EPISODE 1838

[INTRO]

[0:00:00] ANNOUNCER: Modern application development often involves juggling multiple types of databases to handle diverse data models. The lack of unification can lead to complex architectures with attendant security concerns and fragmented development workflows. SurrealDB is an open source, multi-model database developed in Rust and integrates functionalities of many databases, including relational, document, graph, time series, search, and vector databases. It supports both schema-less and schema-full data models and has a SQL-like query language.

The project has rapidly grown in popularity and version 3.0 was just released with the focus on enabling AI-powered analysis of unstructured data directly within the database, along with tooling for building event-driven applications. Tobie Morgan Hitchcock is the CEO and co-founder of SurrealDB. He joins the podcast with Kevin Ball to talk about SurrealDB, handling multi-model data, unstructured data processing, building event-driven AI applications, coupling databases with AI models, and more. Kevin Ball, or KBall is the vice president of engineering at Mento and an independent coach for engineers and engineering leaders. He co-founded and served as CTO for two companies. Founded the San Diego JavaScript meetup, and organizes the AI in Action Discussion Group through Layton Space. Check out the show notes to follow KBall on Twitter or LinkedIn, or visit his website, kball.llc.

[EPISODE]

[0:01:43] KB: Tobie, welcome to the show.

[0:01:45] TMH: Thank you. Great to be here. Thanks for having me.

[0:01:47] KB: Yes, I'm excited to get to learn about what you're doing. So, let's maybe start with you. Do you want to give yourself a brief introduction, and then, talk a little about Surreal and what brought us here?

[0:01:56] TMH: Yes. So, I'm Tobie, CEO and co-founder of SurrealDB. SurrealDB is a new multi-model database that is designed to simplify the infrastructure and the development process for developers and organizations, where they might typically have been using multiple different databases to achieve what you can do with just a single database in SurrealDB. It's designed for AI-native applications, so we're being used a lot in knowledge graphs and graph RAG, but generally speaking, anywhere that you need to use time series data, document data, key value access to that same data, and then, graph, and be able to bring all those together in a simple, single SQL-like query language called SurrealQL. That's where Surreal DB really shines.

[0:02:45] KB: So, let's dive into that multi-model concept a little bit, because I think that's interesting, and it's sort of a trend that we've seen Postgres adopting JSON types and letting you do document stamping. So, when you say multi-model, I heard a lot of different models in there, but what do you mean by that, and what's the driving vision there?

[0:03:01] TMH: Yes. So, in SurrealDB, how we store data is effectively like a document database. Similar to something like MongoDB or ArangoDB. But the query language that we have built, it's very similar to AntiSQL, but it has some differences. This query language enables you to store and query data in many different modalities at the same time. So, you don't have to think to yourself, I need to query and store data in a JSON-like way, or I need to store it in a tabular-like way. Or for this particular problem, I need to store it in a graph-like way. You can actually use key value access or tabular access, or time series access, or document access. But then, also, augment that with other modalities at the same time.

So, you could then and say, I need to link these different records together with graph-like data. and I want to be able to run a query that perhaps touches on a vector search, or a full-text search. Then, it brings me to a single record from there. I might want to do some graph querying out.

Then, after that, I'll get two records with key value access for very fast performance, and I bring all that together into a single response and I'll put that to my application. So, where it had applications before in simplifying the infrastructure and simplifying the development of applications generally. Today, is even more important because of the need to query structured, and unstructured data, and also, to incorporate lots of different types of querying into one single queries, especially when you're looking at vector, and graph, and document, and bringing those together.

[0:04:32] KB: Cool. Well, do you mind if I dive in a little bit, and then understand how you're doing that. So, it sounds like kind of the underlying source of truth, the way it's - the final data stored on the disk is kind of as a document store, if I heard you correctly, which makes sense. That's one of the most flexible ways we could store it. Then, are you doing like a set of indexes on top, or how does that end up playing out?

[0:04:53] TMH: Yes. So, you can as a developer define your own indexes. So, we've got traditional indexes, full text search indexes, unique indexes, vector search indexes, and we're looking at adding in geospatial indexes, and due course as well. But effectively, you as a developer have the control of how you might be querying your data. We also have a piece of functionality called predefined aggregate views. These are effectively materialized views that are a custom index in a way, because they alter data, or they store data on every right, and then, they're effectively grouping that data up for real time aggregation when you query it. 

Then, under the hood, effectively, we're storing everything, as you said, as a document. We update everything in the database in an asset-compliant way. So, everything is transaction -based. It's designed to scale, so it's designed to go from a single node, to multiple nodes, kind of multi-master distributed clusters. But at the end of the day, we're storing everything in documents, we analyze everything in documents, we're processing everything in documents, but with a very flexible query language sitting on top.

[0:05:53] KB: Nice. Let's maybe talk a little bit about that query language. You mentioned it's very similar to traditional SQL, folks might be used to. What would somebody need to adapt to start using it?

[0:06:03] TMH: It's an interesting question. So, users who are coming from something like a graph database like Neo4j can get going with it relatively easily, because albeit, it is different to Cypher. There are similarities and the arrow directions that you might use to go between the graph edges. If you're coming from Postgres, it looks like an SQL language. It has select, insert, update, create. You can define tables. The differences between AntiSQL and SurrealQL are that, we don't have join. So, where you might have a join in a relational database or a table-based database, we have graph edges or record links. So, record links point directly to other records, and you can traverse those out as many times as you want, and graph relationships go both ways, so they're bi-directional.

That leads to you being able to query your data in a tabular-like way, and this makes sense to people who are coming from relational or document databases, because they see things in terms of tables, or collections. But then, you can add graph networked data points between those records. So, you can say, I've got a product table, I've got a user table. But instead of linking those together, we have an ID field or an ID column. We can actually say, I now want to create a purchased table, and the purchase table then sits between these two other tables, user, and product. Now, I can traverse bidirectionally between the users who purchased the product and the product that was purchased by which users. It leads to very expressive, very dynamic queries without forcing you to say, I need to use graph for this problem or I need to use document for this problem. You can be flexible in the query as a whole.

[0:07:37] KB: That makes sense. By ruling out joins, you allow yourself the scalability very easily.

[0:07:43] TMH: Yes, exactly.

[0:07:44] KB: Cool. Can you do graph edges between two nodes in the same table?

[0:07:48] TMH: You can do graph edges between two nodes in the same table. It's actually got a lot of flexibility. So, you can actually do graph edges between an edge and a node. So, the node being a product and a user, and the edge being bought by, so the verb. But you can actually create graph edges from the verb or the edge to the nodes or the nouns as well. So, there's a lot of flexibility. You may not want to do that. It's probably not advised, but if you want to, you can. You can create as many edges between multiple nodes as you want, or you can limit it to single relationships. There's lots of flexibility.

I think the most important thing with SurrealDB and then SurrealQL, is that, as a developer, you have a lot of functionality, a lot of features, and that gives you a lot of flexibility to build what you want, how you want it. We're not really forcing you into doing things in a certain way. As a result, the applications that we've seen being built on top of Surreal are quite fascinating. We see people building in amazing things on Surreal, which we hadn't even imagined.

[0:08:42] KB: Yes, that ability to link from an edge is actually really interesting thinking about, for example, you mentioned the knowledge-based concept. If you have a knowledge graph, you might want to link back from, "This is a connection, the fact I learned, but here's where I learned it. Here's the episode back over here."

[0:08:56] TMH: Exactly. That is a big application use of within today's applications, being able to combine that graph with some semantic search. I think, actually, I'm fascinated by graph, and time series as a whole. So, time within graph specifically. It always amazes me how users who don't necessarily know graph or have always seen graph as this kind of secondary database type that sits on the edge, it's not really ever used as a main database type, but people do realize that they need to use it for analytics. When they come to Surreal, they can actually really see the benefit of bringing graph into their main query model. So, they can say, store things in tables, because humans think in terms of types. We have animals, we have dogs, we have cats, we have products, we have users. But we don't think in terms of relationships, in terms of number one links to number one and that table or number 18 links to number 37 over here. We think in terms of the user-bought product.

We like to group things together in tables or collections, but we like to relate things with graph relationships, and that's how the human mind works. When you combine those different ways of querying together, it makes absolute sense to developers or humans who come to use the product.

[0:10:09] KB: That makes sense. It looked like you are also exposing a GraphQ-based way of directly querying the database.

[0:10:15] TMH: We do. You can never really get away from GraphQL. I wouldn't say it's the optimized optimum way of working with SurrealDB. It does enable you an enterprise to get going with SurrealDB quickly. But, I would say, the flexibility of what you can do with SurrealQL over GraphQL, definitely beat the hands down. That said, you can't get away from it. It's used in many, many places and a lot of organizations use it. So, it's definitely an easy way of getting going with SurrealDB.

[0:10:41] KB: Yes. Well, I think in a lot of situations, people want to use it and you have to build a wrapper around the database. So, having that built in, which I think is a little bit of a theme for you all. You're building a lot into this database. Maybe we can talk a little bit about that. You have this concept, we described of surrealism, of like moving things closer to the data. Can we explore that a little?

[0:11:03] TMH: Yes. So, a lot of our users who are building these large multi-node distributed knowledge graphs, or knowledge bases, or effectively, just large databases for real-time applications. A lot of the time, they have some kind of artificial intelligence sitting on the side of that, whether it's making inference on the data in the database, or whether it's processing something before it comes into the database or as it goes out. But it's often sitting on the side, and they have to build this whole platform to make that work at scale to sit alongside their data.

Now, at Surreal, we're big believers that data is really the power. You see organizations really where they win is when they can store a mass, and query, and analyze that data in the best possible ways. Therefore, if you can enable organizations, and developers, and enterprises to understand their data better, then, you unlock things that previously were unable to be accessed. So, surrealism is the ability to build modular functions. So, using the development methodologies, the testing, the continuous integration methodologies that you as an organization, or you as a developer already use. But being able to then bring those building blocks into the database to sit right alongside your data, and to effectively run them, or cool those functions in an event-driven way.

For instance, when a user submits a support ticket, so we have a record created in the ticket table. I want to now take the message and I want to calculate the sentiment. I want to calculate a short summary, because it's a large message and I now want to use an LLM to generate a response. Now you'll be able to do that with three different very mini functions, and you're able to make the choice of which model, whether you want to run a local model, a remote model, an API, and you're able to build up these building blocks to run either off the shelf or custom develop code right alongside your data. So, it's about bringing the functions to your data as opposed to taking parts of your data, and pushing them out to functions.

[0:13:04] KB: I think this is really interesting in a couple of different dimensions. So, one, as you highlight, data is really powerful. Data also has inertia. It's hard to move around. It's expensive to move around. We've seen - even going back to like the original MapReduce stuff that Hadoop and Google were publishing about. It was about; how do you co-locate your logic with the data that it's functioning on. You're baking that into the data system, if I'm understanding it right.

[0:13:30] TMH: Absolutely. You see this more and more, you know, if you look at the data pipelines of AI workflows and machine learning workflows today, they are incredibly complex. They make use of 30 to 40 different systems. Maybe there's a data cleaning, a data processing, an event driven part of the architecture. That's before pushing it to the different services that do X, Y and Z. At the end of the day, they always come back to a data store, because that's where the data is stored, and analyzed, and queried. 

You see it with non-AI based apps. A few years ago, when people were just building traditional software as a service, or enterprise applications, it was already a pain point back then. But nowadays, it's even more of a pain point because you're trying to understand your data even quicker than - the requirements nowadays are far greater than they have ever been before. And yet, you're having to also introduce many more parts of that pipeline. Yet, at the same time, really, it's always about your data, and you don't want to duplicate all that data. It comes back to the same very reason that SurrealDB was created in the first place. It was to consolidate databases together. 

Instead of you saying, I need to store time series data over here, and yes, the time series database stores things in a time order. But effectively, I've got to duplicate up my data, the same data that's in my document database, and the same data that's in my graph database, because I need to store it and query it in the time series like way over here. So, you've now got three or four or more different copies of the data, but stored in slightly different ways for slightly different analysis.

Therefore, again, it's the same problem that we're seeing. Instead of having to push your data around to the different platforms, because those are what the platforms need, you can actually bring the modular functionality to your data, and have that single source of truth that is ever more important in the applications being built today. 

[0:15:13] KB: Yes. I think that's really interesting. I'd be curious to explore that a little bit more, like one common use case I see for this, even in very traditional setups, forget the complex AI stuff is like, "Hey, I've got a stream of business data, which my application is using and running. But I also need to transform it in some way for running business analytics, and have it in another place where my analysts can dive through and do things. It's a lot of work to make sure those stay up to date, and streaming together, and syncing together.

The one benefit you get from that is your analysts can run all the queries they want, and it's not going to disrupt the functioning of your production database, because it is a separate place. How do you handle different views of this data with perhaps different priorities or different levels of service agreement?

[0:15:59] TMH: Yes, it's an interesting question. I think, the first point to say is, we're not an analytical columnar database. The way we store data in SurrealDB is raw-based or document-based. Therefore, it leads well to transactions and heavy writes. But if you want to analyze an analytical query on billions of rows, then you're going to be better off doing that in a dedicated column, not database. So, there's always going to need to be some element of duplication depending on the queries you're running. I think a lot of the time, it's unnecessary duplication, which is where SurrealDB can really help.

I think to answer your question, when it comes to the priority of how we handle the data, I think the way that SurrealDB can be run is important. So, SurrealDB can be separated from a storage and a compute layer perspective. So, you can scale the storage layer independently. You may want to scale it more if you've got a high number of writes or long running reads. Then, vice versa, you can scale out the compute layer independently. There is almost a little bit dumb, and that it doesn't really understand what the network topology looks like. It always communicates with the storage layer for that information.

Then, therefore, you can scale out. So, you could say, I want lots of small, low-powered instances because I'm handling lots of real-time requests, but they don't need much processing power. Or I might want - I'm processing machine learning compute here and I need to have some four or five, or a small cluster of very high-powered, high-memory compute instances. So, you can separate the queries and push them to different parts of the cluster. That cluster is always running on a consistent view of the data as a whole. That's something we offer in the cloud, but it's also something that you can do on the product as a whole. I think that comes back down to the flexibility of how you can build on SurrealDB, but also how you can run it. 

[0:17:51] KB: Nice. So, diving into another topic that you kind of mentioned here. So, we talked about one of these things that you're innovating on is data locality, and bringing compute to data. Another thing is, it sounds like essentially creating a reactive programming model, a sort of - I was going to say, event-driven, but it's more data-driven way of thinking about what's going on in your application. Can you describe what the motivators are for that?

[0:18:15] TMH: It's interesting that you say it's almost event-driven, but it's almost data-driven as well. It's both, effectively, yeah. The ability to - if we look at applications today, almost everything is event driven. When someone does this, I need to push it into all these different systems, marketing, sales, engineering. We all have to have this overview and understanding of what's happening in our different ways and for our different needs. So, event driven is massively important. That's not to say we're moving completely away from general querying as a whole, but event-driven is becoming even more important. 

Then, event-driven around data, so having that ability to understand what data is changing, how it's changing, how it affects the business as a whole, or other aspects of the application that we're building. Then, being able to bring AI into that. So, being able to understand and work with that data and the changes in that data over time, but combining AI or large language model functionality, or analysis in some intelligent way against that data. Within those event streams, that's where it becomes really powerful, because then, you can store the data, analyze the data, and store the analyzed data back if you want to, for an example. Without ever having to really leave the place where you're storing that data. That means that other teams can always have this holistic view of the data that your organization holds, albeit, they have it for different needs. So, they all have access to it. They're all able to augment that data in their different ways for their different needs, but they're still operating in the same way. There's never a duplication or data that's old or data that's stale. It's always relevant. It's always up to date.

A term that we're building our product on is datagentic. It's about building the agentic workflows that you're seeing out there in other products today, but being able to bring that right alongside your data, as opposed to pushing data to the agentic workflow itself.

[0:20:10] KB: That's interesting. So, explore more. What exactly would a datagentic, like what does that end up looking like in practice?

[0:20:17] TMH: You see this in lots of other tools out there, that workflows effectively is one way of saying it, right? So, I want to say, I want to go to a website, understand the text, pull that text, make an assumption based on these particular things. Maybe that goes to an AI model. From that, I want to push that into, maybe it's a sales leader, I want to push that into a sales CRM. I want to make more understanding of that. So, you're basically building up these event-driven workflows based on that data. But if you can do that based on the data that's actually happening inside your database, rather than stuff that's happening outside of a database and then coming into it. That's where the true power lies. 

As I said at the beginning, the benefit of an organization or an enterprise comes from how it handles, how it stores, and how it collects its data. Probably not in that order either. How it collects, how it stores, and then how it analyzes its data. If you can build event-driven AI, agentic workflows based on that data as it's happening without the complexity of having to move that data around to multiple data stores, stale systems, different platforms. If you have one source of truth for that data, that's where you're able to get better insights, and better understanding of the data that your organization holds as a whole. 

[0:21:33] KB: Neat. I want to dive in now, looking at it, for example, as a developer who might want to use SurrealDB. I'm hearing this and I'm immediately wondering, okay, what's the programming model? How do I set up these workflows and how do I debug what's going on in them? 

[0:21:48] TMH: Yes. So, that's a really important point because you can't build things in a way that locks people into a system, we know that. You have to build things in a way that there are lots of different ways of building systems out there. Every developer has their own opinions on their system. So, you really have to work with their tooling, with their workflows, with their development methodologies, with their CI environments, with their testing, whatever that is.

Obviously, the bigger the organization, the more layers that has. So, you can't go and replace that, and that's important. So, Surrealism is available as a Rust library for the moment, but it'll also be available for JavaScript, for Python, and for other languages, where you can write the code as you want to write it, but it packages into a format that can be embedded as a module into the database. We're never trying to replace the developer's way of working. I think that that's important, not just for the Surrealism functions, but also for the database as a whole.

I think, even more so than any other database out there, you don't have to store your data in SurrealDB in a single way. In fact, you can come to SurrealDB from a document database like MongoDB, from a relational database like MySQL or Postgres. You can store data from Neo4j and from InfluxDB without having to be forced into, this is the way you have to work with your data. If you understand your data from any of those platforms, you can understand working with your data in SurrealDB, and that is exactly the same as how we're building these Surrealism functions. You have to approach it with the same approach and understanding that developers have, because otherwise, developers don't want to and won't use it.

[0:23:24] KB: If I'm understanding correctly, these end up being packaged up similar to how you might do like edge functions, cloud functions, or something like that. Except, instead of running out on the web server near the edge in a CDN, it's running in your database, ready to do whatever you need based on data events rather than user events.

[0:23:41] TMH: Absolutely. I think it's interesting how you say, "Oh, you're sitting at the edge or in the database." So, with SurrealDB, because you've got that separation of storage from compute, the compute can sit wherever you want it to. It can sit at the edge. It can sit in a centralized location. You can distribute it across multiple availability zones or regions in a data center. So, you're not limited to just storing this in the central place that you might have done before. You can run this effectively wherever you want to. You can also run this on the edge devices of a user, maybe on a web device, maybe in a phone. So, really where you run it and how you run it is up to you. The difference is, the workflow and the development needed to make that process in terms of bringing the functionality to the data come live, as it were.

[0:24:25] KB: Interesting. Now, you have sparked a whole new set of questions for me. In that model, you're saying, okay, this thing is going to run based on some sort of event that happens in my database. It's going to have access to - is it just the data that changed or - actually, maybe I should start there. Because when I start thinking about layers of edge, I start thinking about, "Oh, who has access to what? How do I get it? And what's the performance look like?"

Let's go first to the principle. You run one of these functions based on a data change in Surreal. What data does the function see? Is it just the data that changed? Can it fetch other things? What does it have access to?

[0:25:03] TMH: Initially, it would have access to the data that was changed, but that depends on how you define it. So, you could define an event based on that data changing, but you can also then augment that data with other queries. So, you could access related records to that record or you could actually just create a whole new query based on that event, and query something else.

The flexibility is there, again, backward the user in terms of what they want to do, and what they want to build. It goes about saying that, if you have an event that is one kilobyte, and now, you need to run a request that targets a billion rows, that's going to have a performance impact. So, you always have to take that into account. But at the end of the day, you're in control of what you want to query, how you query it, what data you want to access.

The event where it comes in can be processed on the node where the client query was run. So, it could be sitting at the edge in a CDN, on the device, vice versa. It could also be more centralized. So, it could be that the requests are coming into the database and the database, which is sitting near to the compute layer as it were, and the storage, and the compute layer together. That's where the request is handled or the event is handled. So, there's lots of different ways of handling that and lots of different ways of deploying SurrealDB. 

I think the most important thing is what I mentioned and a few minutes back, which is the compute layer is dumb. You can have a thousand compute nodes talking to the underlying storage. That is the database. That's where the database querying is performed. That's where the event driven data processing, and compute logic, and effectively function execution is happening. But it doesn't have to have any knowledge of the rest of the cluster. So, if you've got a thousand nodes or if you've got two nodes running as a compute layer, you can build that, you can deploy it as you wish, depending on the requirements of the application or the platform being built. 

[0:26:47] KB: As I think about that compute layer and think about this desire to push things out as far to the edge as possible, which is great if you can do that. What's the permissions model? Once you're inside the compute layer, does it have global access to the data? Or, if I have something that's way out on an edge, or as you mentioned, even on a user's device, can I restrict what it has visibility into?

[0:27:08] TMH: Yeah, absolutely. So, if you go back down to the storage layer, let's be honest, the storage layer is itself a database? It's a key value ordered database that has no permissions, right? So, if you have access, full access to the storage layer, you can see everything. But the compute layer itself has its own permissions layer. So, you can say, I want users to only be able to see records where the account ID is one, or records where the label field is green, let's say. So, you can - and permissions can be as complex as you want them to be. You can authenticate into the compute layer with open OAuth. You can authenticate with username and password or email and password. You can build custom authorization logic as well into the SurrealDB compute layer.

It's really, again, comes down to the needs of the application, but effectively, that SurrealDB compute layer controls what you can access from the storage layer and then how it's processed. So, if you wanted to make this available directly at the edge as a web database as it were with real-time functionality, you can do. Therefore, the functions that are exposed or the functions that are available to be run on demand in the kind of Lambda request model can be done in a similar way based around those permissions.

[0:28:20] KB: Fascinating. Okay. Another question related to this sort of bridging, because I love that we're breaking it down into layers now. So, you have this dumb compute layer, which is very developer-controlled, and then, you have the underlying database layer. One of the premises that we talked about here was co-location. Is there a way for the compute layer to either by the developer knowing this or automatically kind of get to co-location, like, "Oh, I know that this data is over in this part of my network topology. I want any sort of functions that are running on that data to run near them."?

[0:28:53] TMH: It's an interesting question. That's not something you can do yet, but I'd say there are lots of different things that you can do that make that not a necessity, if that makes sense. Imagine that the key value store, the storage layer, shards and rearranges its data based on the of the cluster at the time. So, if one node is really busy, it may move and shift data across. Now, knowing where that data is at any one time is not really part of the compute layer's job, it just knows where to query it.

However, as a user, as the person building the application on front of the SurrealDB compute layer, you could choose to host machine learning models or LLM functions or Surrealism functions themselves in particular areas in S3 buckets, in local storage, in elastic file store, deployments. So, you have flexibility in terms of, can my compute functions be sitting nearer to where they're going to be run? Or actually, you could then say, one thing you can do in SurrealDB is you can say, I want to store all of these models, these WebAssembly models, and functions in an S3 bucket. This S3 bucket is going to be distributed globally. So, it's going to be replicated in multiple regions.

Then, I always want to pull from the region nearer to me and execute that function nearer to where the user is. So, you can do that, definitely. Again, it's never really, albeit, it's deployed on S3 or Google Cloud Storage, or Azure Functions, or Cloud Storage. You don't have to think about deploying it or running it in that way. It's always based around the needs of accessing the data, querying the data, or being event driven by that data. You have the flexibility to build it as you want, to deploy it as you want, but you don't have to have the complexity in deploying the infrastructure and the different systems that typically would make a system like that work.

[0:30:49] KB: Super cool. So, let's now look at a little bit the debugging story. I set up these functions, they're running off in SurrealDB's compute layer. They're modifying my data or writing new materialized views based on things changing. Something goes wrong. What kind of tracing do I have to see, what happened, how did this end up here? 

[0:31:12] TMH: Yes. So, everything in SurrealDB is traced and logged through OpenTelemetry. In addition to that, it can trace and log to a local file or the command line as well. So, if you are running an OpenTelemetry stack, maybe with Elasticsearch, or Grafana, or something like CloudWatch, you can trace everything that's happening. Depending on how much you want to have logged and how much you want to have traced, you can trace down to the very requests to the underlying storage engine. There may be hundreds in a particular, one particular request, for instance. So, you can trace at a very granular level if you want to. There are also metrics around the number of functions being run, the number of HTTP requests being made.

It's very easy to integrate this into a traditional stack where you may need SOC 2 compliance or ISO 27001 compliance, or the requirements of FedRAMP, for instance. If you need those requirements, you need the tracing, you need the visibility into what's happening. Not inside of your data necessarily, but inside the platform that's running your data or enabling that data to be queried. Then, you can plug that in effortlessly. One thing we haven't actually tried, which is always something I've wanted to do is, can we actually use SurrealDB to trace SurrealDB being run? It's not on the immediate roadmap, but it is a thing I'd love to try. 

[0:32:32] KB: Yes. Just dump your telemetry data right back in.

[0:32:34] TMH: Exactly. 

[0:32:36] KB: The modern data version of bootstrapping. That's cool. I find myself wondering also, because some of what you're talking about here reminds me almost kind of, it feels like SurrealDB is well set up for event-driven data models. Where your source of truth is actually the stream of events that's happening, and then, your materializing views. Some of what you're creating with these datagentic flows is essentially that or could be that.

One of the characteristics of those is replay, and being able to do like replay, debugging, and going back, and sort of saying, what exactly was the state at this point in history? Is that a thing that you've looked into it all or had support for? 

[0:33:17] TMH: There are two things. You bring up the good questions here. There are two parts to Surrealism that are designed exactly for this. So, Docker made the aspects of versioning of containers very popular. We didn't want to move away from that. So, in Surreal DB, and when you bring in a surrealism function, and you can already see this with our SurrealML models that you can bring into the database. Every single model or every single function is versioned. It can use any versioning type you want. So, again, we're not forcing you to use SemVer if you don't want to use it. But effectively, latest will always bring out the latest uploaded version. But other than that, you can query exact versions of the function that were uploaded to any one type. So, that's the first point.

Every single function that is uploaded to the database doesn't have to overwrite the old function. It can be added to, and you could actually then say, "Right, this function doesn't have permission to be run yet. I want to make sure that it queries data in the right way or responds to users' requests in the right way before I make it latest. So, it's currently a development version and no one has permission to run that apart from admins yet. But when they do, I'm going to make this the default version. I'm going to remove permissions from the old version and this is the only function that people can run."

So, you can effectively - you could do clever things like A/B testing as well. So, you want 70% of people are running the previous version and 30% are running the latest version. Then, we'll switch and move everyone to the latest version when we're happy with the response that it's generating. In addition to that, one of the hardest and most research-based thing that we're building at SurrealDB is the ability to travel back in time of your data. So, this is effectively temporal querying. It's not the same as time series data, but it's the ability to go back over your entire data set and see what it looked like at any one point in time.

Now, this has a lot of different uses and applications across many different industries. One of the main reasons for it is being able to see what a graph looked like at any one point in time, which is notoriously hard and difficult problem to solve. But being able to say, "Look, I want to find this user and find out what users he was connected to six months ago." I don't want to have to add dates and times to every single relationship. That becomes very, very complicated when you're querying. I just want to go back in time and see what was the entire data set at that time. Now, this particular query, what did it look like at that time?

Now, if you can combine those two things together, so you say, I've got a temporal query view over my data and I've got version functions. Now, you can give an exact replica of any response that was ever given at any point in time, which becomes incredibly important for organizations who need to have reproducibility and insights into what AI is doing, and how it was doing. That, and also, what access to data did it have at that particular point in time. So, they're big problems to solve, and we're working on those. But you combine those two things together, and it makes that very easy for enterprises and organizations and developers alike.

[0:36:24] KB: That's brilliant. I think, particularly when you combine the machine learning aspect of it, because keeping sort of an eval framework is my model behaving the way I want. You could keep track of historic queries that maybe did or didn't behave the way you want, run them against your new models as a test set. There's a ton of interesting things you could do there.

[0:36:42] TMH: It's even more interesting, because organizations know that they need to have this insight into their data, but at the same time, they don't want to be left behind. They're pushing, depending on the industry, and the requirements of that industry. But organizations are jumping into large language models, and pushing data into large language models without really having an understanding of how it's working, the reasoning behind that. I know that's improving, but it's still got a long way to go. Or even the results that it's giving for the response it's generating.

A lot of industries can't touch that if they don't have the understanding of how it's doing something. Even less so, if they don't understand what data are acted on at any point in time. This definitely helps to mitigate some of those issues. It doesn't solve the problem if you're working with large language models yet, but it can enable you to go back in time and say, you know what, this is the data we acted on, and this is the code that we acted on at the time. We can go back, and reproduce that, and replicate that. Then, we can go from there. 

[0:37:42] KB: You mentioned that you can both upload versions of a models, but you can actually call out, if I understand, to third party models. So, if you're using OpenAI-hosted model or an Anthropic-hosted model, you can do that. Are you able to version that in some ways in the same way of saying like, "Hey, right now we're referencing this third-party endpoint with this version number and this is kind of keep the same historic view."?

[0:38:04] TMH: If you have any custom logic around that API built into that function, then that's obviously versioned. If the API call itself cannot be versioned, let's say they've gone from 3.0 to 3.3, and you're calling the API, then you can't really control that. However, if you need to be in control of APIs, you probably wouldn't be using an API. You'd probably be using something that can be versioned, or you'd be running Lama locally, and deploying the models locally, and running this on local GPUs. So, you can do all of those.

If you're adding context and surrounding code to an API call, that's always versioned. If you're calling something locally, you can always version that. If you're relying on an external on API, then you're almost at the whim of how they do version releases.

[0:38:49] KB: Yes, that makes a ton of sense. So, I'd love actually to hear a little bit now. So, you've coined this term datagentic, you're obviously reacting to sort of the zeitgeist of the market here in terms of moving this. What are you seeing in terms of new application types being developed using these functionalities?

[0:39:08] TMH: So, I don't think it's necessarily anything new. I think that's the same thing for SurrealDB as a database, as a whole. You could always build what SurrealDB offers or how SurrealDB enables it for you, by using four or five different database platforms, and having an API layer, and an event-driven layer in between, and maybe having some API connectors, whatever that is -

[0:39:31] KB: I feel called out.

[0:39:32] TMH: - you're required to build.

[0:39:34] KB: Exactly. The ability to build, that was always there. You see more and more organizations with these deployments every day. Even more complicated because of the needs of AI,  and machine learning, and all of the processing that needs to go on in addition to that. But it's the same thing for AI, with agentic workflows, you're seeing people push to incredibly complicated workflows and pipelines from an infrastructure point of view, from a development point of view. We saw it with the advent of microservices. Everything became a microservice, and microservice could be 10 lines of code, and it was deployed as a Golang application, as a container. That was great, and it was brilliant, and enabled small teams to work on it. But it also resulted in incredibly complicated setups as a whole, because now, your piece of data now needs to go through 25 different services built by 25 different teams in order to come back and give you a result.

Now, if you can enable that organization to work in a similar way, if you can enable them to say, "Look, these teams have different responsibilities when it comes to the data, when it comes to the requirements." But any team can work with the data as they need it, you don't have to relate and work with different teams in order to build your processes and build your pipelines. That's where it becomes really powerful. So, I would say nothing that's SurrealDB is doing is groundbreakingly new. That's really bad way of saying, I think. But it's probably true, it's just simplifying, massively simplifying the infrastructure and the process of which people are building the applications now. It's obvious why that happens. A new technology comes out, and suddenly, there are 400 different platforms offering very small pieces of functionality on top of that. As an organization, you need to use 30 of them. So, you do end up using 30 of them. 

Actually, if you really want to innovate and you really want to move forward faster, you can't be using 30 different platforms. You have to be owning that technology and really coming back to your data, the central source of truth, and operating from there.

[0:41:34] KB: Yes. No, I think that makes sense. The microservices example is a really good one. Microservices allow you to trade off, you get benefits in developer flexibility and simplicity within the program at the cost of infrastructure complexity and operational complexity. What you're saying is, you know what, you can keep that programming model for this set of things, but we'll take all that infrastructural complexity and make it disappear for you.

[0:42:00] TMH: Absolutely, and still make it event-driven, which as everyone knows is the best way of, especially in this AI landscape that we're in right now, is the best way of working with your data. Users and organizations, and enterprises who are using SurrealDB now, it's usually always real-time event-based data. Being able to say, "Look, someone has purchased a product on my site. When someone else comes to my site, I need to be able to recommend a product based on historical data or based on some search that they provide, or some other information about that user."

It's happening when something happens. I think this is the case generally in life, in the world. Everything is event-based. Everything is triggered by some event, and you need to act on that data. Maybe the result of that trigger doesn't happen until two weeks' time or a years' time. But we need to process that data as it happened and create another piece of data or update another piece of data based on something that has happened before.

That is effectively what we're enabling, but without the complexity of the infrastructure, and of the overly complex development model that's been created because of the benefits that microservices and multi-platform services offer as a whole.

[0:43:14] KB: This kind of reminds me actually of another place where I've seen a lot of this reactive programming model, in building real-time interactive applications. Often, those end up with some sort of either push-based model, or a web socket model, where you have some central service that is keeping track of what's changed in my data, and pushing it out to any clients that are active at the moment. If I were to build something like that in SurrealDB, is that something I could put directly into the compute layer, or that compute layer has to run statelessly enough, it's not going to work very well?

[0:43:46] TMH: Another good question, because I might have lied a little bit earlier. The compute layer is dumb, in that it doesn't really have an understanding of topology. But it does have an understanding of the other clients that are connected to other areas of the database as a whole. So, it does understand other queries that are being run by other clients connected to other compute layers. The reason that is the case is because that we have a piece of functionality called live queries. Live queries effectively, if you can imagine in Postgres, you can say, "Select all from table," but in SurrealDB, you can then say, "Live, select all from table."

Now, when somebody else on a different compute node updates that table with a new record or maybe five new records, I can then receive an event which will give me those five new records. I can receive the full event, I can receive the diff, so that the change in data as it happened or vice versa. Then, we can start building a workflow and a pipeline around that. So, I can say, when that user updates those records, now, I want to generate a large language model summary. I want to calculate the sentiment. I want to generate this as an image, but I want that image to be minified. So, these are all like mini functions that can be building and operating on that data in different ways.

Then, finally, I want to update that support ticket, let's say, and the updated, that support ticket will be pushed to anybody who is interested in that change. So, it's event-driven with a completely custom-built workflow or pipeline sitting in the middle of those two events. Because SurrealDB is effectively a web database, you connect to it with HTTP or WebSockets. You can listen to those events, and then, receive them as they happen. Actually, there's nothing you have to build on the developer level at all. This is all just, how do you want to access your data, what do you want to do with it?

That's where the simplification of bringing the agentic AI or the agentic functions to the data makes total sense. Because now, instead of having to push that out to other systems, which then all have to be event-driven in order to push back to us, so that we can push the user eventually. You're sitting within that entire workflow or that entire lifecycle in the database, which is always event-driven.

[0:46:00] KB: Yes, that's kind of brilliant, right? Instead of having to orchestrate all of these things going on, I let you take care of that. I, as the consumer know, "Okay, I have some new information, I'm going to dump it in the database, and just wait for you to finish running your updates, and then I'll get them, and I'll be able to update right here."

[0:46:16] TMH: Then, instead of the teams building their own different containers or different microservices, the teams can build their own functions. Those functions can listen and respond to the data that they need to respond to. It could be completely different departments. There could be functions designed for sales, which are listening to product purchases, and generating dashboards. It could be functions for engineering or for support teams that are listening to different responses. So, those functions can be built for the different teams, or for different requirements, and deployed inside the database, but never having to force the data to leave that environment, which has applications for security. It has applications for simplicity of development. It has applications for the event-driven nature of that data.

At the same time, it's still a database. So, you can still run massive queries, analytical queries over that data if you want to, and put out data for dashboards, or for reports outside of the event-driven nature, but still, you can build everything around that. 

[0:47:17] KB: That is super cool. When you start talking about other departments, it makes me think about, hey, are you going to build WebUI, no code version of how you can drive these event loops, and integrate them, and kind of absorb that whole like marketing automation world as well?

[0:47:32] TMH: I think there's a limit to where we would want to go. And that, yes, being able to more simply define event driven logic from your data is definitely something we want to solve. I think, if you're looking at specific departments like marketing and sales, these often use tools outside of the database. So, for instance, they might use a CRM, or they might use a marketing platform that has insights into web data, which is not stored in the database. So, you always have to link in with those. 

I would say, where SurrealDB wants to sit is the automation and workflow generation level or layer around your data. That could be simplification from a low code perspective for generating those event-driven functions, or it could be more developer-focused where they actually want to write that code and query their data. But then, you can build from that if you want to connect to other platforms. Maybe you build those as functions in Surrealism and bring them into the database, or maybe you push them out to APIs. I think, we're never going to compete with tools that are kind of specialists in those particular areas, but we enable all of the data that you as an organization hold, and own, and want to control, linking more efficiently to the platforms that you need to link to, which are typically external to your central source of truth.

[0:49:02] KB: Absolutely. I can't tell you how much time I've spent saying, "Okay, pull this, sync it to HubSpot, or to Zapier, or to this where they can finally touch it.: If I could just put that into my database and let it roll, like that's brilliant. 

[0:49:14] TMH: Exactly. Yes, exactly.

[0:49:15] KB: Well, we're getting close to the end of our time here and I want to make sure, is there anything we haven't talked about yet today that you want to make sure that we touch on before we wrap?

[0:49:24] TMH: I think we've touched on a lot, because your questions have delved into areas that I guess, normally, we wouldn't touch on from a technical point of view. So, it's definitely been interesting. I think the only thing that perhaps we only briefly touched on and probably deserves a little bit more looking at maybe is the authentication piece of functionality. In SurrealDB, you've got this very flexible authentication layer, where as an organization or a developer, you can completely fit that into the way that your organization works with authentication or security. So, you could fit fitting with OAuth, you could fit in with SAML, you could fit in with OpenID, vice versa. You could use your username and password. Or maybe, you build a completely custom layer in front of SurrealDB, and it talks to SurrealDB as a full access database as a traditional way.

The point is, I don't think enough databases or data platforms talk about security in the way that we should be talking about it nowadays. When you build on this on your data in an event-driven way, or in any way, the security around that data is even more important. If you're enabling users to come to your data and access it in different departments, different employees to access that data, that's a central source of truth in different ways, then you need to have incredibly fine-grained control over how they can access it, when they can access it, what they can access. 

I think maybe, to put it in a different way. I think SurrealDB is looking at the security of data in the database in a completely different way to any other provider out there. I think that's important when you start thinking about how can that data really be used in an organization as a whole without having to be just a building block of many other parts of the infrastructure.

[0:51:14] KB: Yes, I'd actually love to just take a couple minutes and dive into that, because previous job ended up essentially building a layer around a database that was about permissions. Because security principle, you want to push your permissions and security as close to the data as possible, make it as hard for developers to screw up as possible. So, with this system, what kind of granularity can you put in here in terms of, is it down to the document level? Can you screen out certain sub parts of documents? Can it be invisible to your application layer? What does this end up looking like in practice?

[0:51:48] TMH: Effectively, how you define permissions in SurrealDB are like SQL queries. So, you can say, this table, the user table, the user can only access it if their user ID that they're authenticated with equals the ID of the record they want to access. You could also then say, if the user's account ID matches the account ID on multiple documents, then they can access those documents. You can effectively write completely custom SQL to define what a user can or can't do. That authentication for that user will come from an external system. So, it could come from Firebase Auth, it could come from Okta, it could come from Auth0, it could come from your own single sign-on system. So, the authentication data can then be used internally in the database as a variable, and it can be applied to these SQL statements to determine what you can see.

What you can see can then be effectively specified at the table layer or the collection layer. So, that limits you from being able to see a table as a whole or specific documents in those tables. But you can also then define permissions on the fields. So, in SurrealDB, you don't have to define fields. You can keep everything completely schema-less, or you can define fields for everything. So, you can go completely schema-full, or you can mix and match. So, you can have schema-less fields with schema-full tables as a whole.

But importantly, the permissions then can be defined on those. So, you can say, right, this password field, no one can see it. Only the administrators of the database as a whole can see this field. That becomes incredibly important, because when you have data that's being shared between departments or between applications, potentially, you only want to be able to give access to the data that needs to be seen. So, instead of pushing data out, and taking a minified approach to the data that's getting sent out to a system, you can let people only access the data they're allowed to access. So, you don't have to think about other layers that query the data before sending it out to these other systems or pushing the data that's required to these other systems. You already know at the data level where the data is stored, who can access what, and how can they access it, and what can they see.

[0:53:59] KB: Honestly, that might be one of the most exciting things you've told me about Surreal. That's phenomenal.

[0:54:04] TMH: Security is one of those things that everyone expects you to just have to do as a platform, but no one really focuses on the real benefits of it. I can be including that. Security is the boring thing that everyone should do. Whereas, when you talk about distributed compute that can sit at the edge, it's much more exciting to talk about. But honestly, I think the security aspect of having those definitions sit right alongside your data, brings another layer or another point to the fact that the data is a central source of truth in any organization in the world. If you can understand your data, if you can have complete oversight, and control over that data, you're going to be in a better position to move forward as an organization or as an enterprise. 

[0:54:50] KB: Yes. As you say, most people don't like to think about
security, but what you allow is for the person who does to set everything up, and the rest of the development team doesn't have to care, it will just work. 

[0:55:01] TMH: If you set it up correctly. But yes, exactly, exactly.

[END]