EPISODE 1662

[INTRODUCTION]

[0:00:01] ANNOUNCER: Managing data and access to data is one of the biggest challenges that a company can face. It's common for data to be siloed into independent sources that are difficult to access in a unified and integrated way. One approach to solving this problem is to build a layer on top of the heterogeneous data sources. This layer can serve as an interface for the data and provide governance and access control. Cube is a semantic layer between the data source and data applications. Artyom Keydunov is the Founder of Cube and he joins the show to talk about the approach Cube is taking.

This episode is hosted by Lee Atchison. Lee Atchison is a software architect, author and thought leader on cloud computing and application modernization. His best-selling book, Architecting for Scale, is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee is the host of his podcast, Modern Digital Business, produced for people looking to build and grow their digital business. Listen at mdb.fm. Follow Lee at softwarearchitectureinsights.com and see all his content at leeatchison.com.

[INTERVIEW]

[0:01:24] LA: Artyom, welcome to Software Engineering Daily.

[0:01:26] AK: Thank you. Thank you for having me today. Excited about today's conversation.

[0:01:30] LA: Great, great. Let's make sure we're all on the same page to get started. Let's talk about, first, some fundamental definitions and things like that. The word data silo. When I think of a data silo, what I think of primarily is independent data sources that contain interrelated data, data that's meant to work together. What do you think of that definition? Is that a good definition? Or what would you enhance that with?

[0:01:58] AK: Yeah. I think it's a good definition. Essentially, the database of a specific warehouse of a data storage where the data is located and it becomes disconnected to 

other places and it becomes a silo. That's what people usually think about data silos. I think it's a good enough definition. The only way I would enhance it is that, what if we think about the data, metadata, or data definition silos? That's an interesting problem. The one we solve in Cube is that maybe in your Power BI and a top-blowing organization, in some Python Django app, you don't actually hold the data, but you have a lot of SQL scripts that they do analysis of data. They try to calculate some metrics and they become like metric silos, or data definition silos.

Because you can have calculated something to show data to maybe a customer's partner, so internally and people got some idea out of this data. Maybe the data was correct, but the definition was siloed. That's something an interesting enhancement to the idea of data silos. It's not only exact data, but more like, the data definition of metric definition silo.

[0:03:11] LA: Got it. Yes. It's not just the data. It's how the data is used and how it's defined in the meaning of the data as well. That's actually a great extension of that definition. Why are data silos a critical issue for data modeling in general, or data management, data usage in general?

[0:03:32] AK: Yeah. I think, if you zoom out, I think the whole purpose of having a data is to help business to drive decisions. We want to be data-driven as an organization. People on all levels, like execs and management and individual contributors, they all wanted to make their decisions, day-to-day decisions, operations be data-driven. They need to have an access to data. What's happening is that we try to create more and more touch points for people with data, but naturally, by creating this, we're also creating silos as a side effect, because we move some data closer to the marketing and they started to use it, but then it becomes siloed.

By silo, I mean, usually something disconnected as we just defined, right? Whether it's the data itself, or the definition itself. The problem is that it becomes really hard to keep them in sync. You end up in a situation where the organization becomes data-driven. They do work with data. They have access to the data. But is this data correct? Does it show the real number? Does it stay in sync? If your company decided to change specific metrics, maybe five out of your seven silos, they have been updated, but then the rest, they haven't. Now some of your department is looking at the old definition.

I think if you try to generalize it in a software engineering podcast, so I would call that problem is a repetition problem. We know we have a dry idea, like we have, do not repeat yourself in software engineering. Essentially, what's happening is that we repeat data, or repeat data definition in many ways, and then we need to keep them in sync. As we are engineers, we know that's just bad, right? It's really, really hard to repeat things and then keep those things in sync.

[0:05:24] LA: Go into an example, just for those who aren't following. You've got data that's collected in via an engineering department and then marketing wants to make use of some of that data, so they take part of that data, bring it into one of their systems. Now the data is duplicated into a different system. They take that data, perhaps enhance it with some other information that they have and reformat, or restructure it, or reanalyze in a different way and make different use of it for different purposes. Even though the fundamental data has common roots, the data itself now is out of sync, because there's differences and different interpretations.

Not only the data itself is a little different, but the interpretation of the data is considerably different. That's the out-of-sync, non-dry, if you will, problem that you're talking about here. Is that correct?

[0:06:16] AK: Yeah, exactly. That's exactly it.

[0:06:19] LA: You've created something that's called a semantic layer on top of data silos. I think that's actually what you're calling it as well, too. Tell me a little bit about, what does a semantic layer look like on top of siloed data?

[0:06:34] AK: Right. Yeah. If we think about semantic layer, I think many of us work with the different types of the semantic layer every time you work with the BI. Essentially, the semantic layer by itself is usually being a part of the BI. In every BI, we can drag and drop measures, dimensions to build charts. Every time you work with this high-level business level metrics definitions, we work with a semantic layer.

Then what semantic layer does, it takes these definitions and translates them into the SQL queries and it knows the underlying database structure, or warehouse structure. The semantic layer is a bridge between the metrics that business utilizes and then the underlying data structure. Now the problem is that, because every BI has a semantic layer.

[0:07:26] LA: Just to make sure everyone's on the same page, by BI, you're talking about the business intelligence that's –

[0:07:30] AK: Business intelligence.

[0:07:31] LA: - making use of the data, whether that's –

[0:07:32] AK: Exactly.

[0:07:33] LA: - a marketing use, or some other use?

[0:07:35] AK: Right. Right. Yes, exactly. Now, if organization has a 10 business intelligence, so data visualization, data consumption tools, now the semantic layer would be in every of this tool, right, which creates the silos. It creates a data definition silos. Now that's a problem for all the reason we just talked about. It's not dry. It's going to stay out of sync. The solution to that is universal semantic layers. That's what we're building at Cube here, is idea to take this, the piece of the layer of your step where you have a repetition, specifically semantic layer and put it into single universal place. From what place, you can reuse this definition across all the data visualization data consumption tools and business intelligence. It makes your system, your data architecture dry at scale.

[0:08:30] LA: Okay. You didn't do anything with the data. The data may or may not be dry, probably is at some level. Hopefully, it is anyway. The definition itself, you've created a dry understanding of what the various pieces of data mean and how to interpret them and created one standard for how to use that data.

[0:08:50] AK: Yes. Yes. Exactly. Cube becomes a universal semantic layer, becomes an interface to your data that holds all the definitions in one place. It knows about the underlying data silos. It knows an underlying data storage. It knows about potential issues with data. But all these issues, they do not need to be exposed to the consumers. They only communicate with Cube and then Cube communicates with all the consumers. Cube becomes an interface to the data consumers as a facade. There is a pattern right in software engineering, facade pattern. It's essentially a facade to the data consumers of the data.

[0:09:31] LA: Maybe I'll help if we go into a specific example. Let's assume we have a e-commerce store. That's a great example that everyone loves to use, and let's use an e-commerce store. E-commerce store is going to be collecting data for multiple places and for multiple purposes. It's going to be collecting data about website hits and clicks and all that stuff. It's going to be collecting data based on advertisements and the effectiveness of those advertisements, and who's driving what traffic to the site. Then there's going to be information about data from cart ads and cart deletes and checkouts and order data. Essentially, that ultimately turns into ordering data. Then shipment information from a warehouse, and that's going to be a different set of data.

There's 20 or 30 other pieces of data that we haven't talked about, but that's basically the different sources of the types of data we're talking about. Using that example, walk through what a semantic layer might do and who might be able to take advantage of that.

[0:10:36] AK: Right. Imagine you collect all this data into say, into warehouse, like Snowflake. In our example, we can keep it simple, just to have a one, single warehouse. You collect all your data into that warehouse. Maybe use ETL tool, like a five trend to ETL, some Stripe data ETL, your Shopify orders. Then you can enhance that data maybe with some analytics coming from your websites through Segment. By the end of the day, all the data arrives in a Snowflake.

Now in your organization, you have Tableau, you have Power BI, you have Excel and then you also need to display data to the customers through the dashboards. Now you start building different metrics. Say, you can want to build an average order value. You define that in Tableau workbook. You define that in a Power BI and then you define it in some SQL script that's powering your customer facing analytics.

Now at some point, you probably want to change that, right? Because the business evolves, the definitions of the metrics evolves, or you need to fix some discrepancies in the definitions. You go and redefine that, maybe in Tableau, but you forget to redefine it in Power BI and maybe in some of the charts, in this customer facing analytics, because you have 100 SQL queries that powers different charts. Then it becomes the bigger problem, the more metrics you need to change. At some point, they all becoming out of sync.

Solution to that would be, okay, do we really need to define them on this visualization layer in the first place? What if visualization layer is going to be thin, where we dump from logic perspective, just to render things and then go to universal semantic layer for definitions? Now do really, we need to have an average order value to be defined in Tableau? Probably no. Tableau can just go and say, “Hey, Cube. Give me average order value.”

Now, without a semantic layer, your Tableau would go directly to Snowflake. With semantic layer, the Tableau would go to Cube and a Cube would go to Snowflake. Cube would become the hub that receives all the queries, it changes the queries to get the real definition of the data, and then go to Snowflake to query the data and then send it back to Tableau. It's like universal, like a proxy, or getting way to the data for all the tools that holds all the definition. By having it, you can make a visualization layer very thin without any logic attached to that.

[0:13:17] LA: If you need an average order value, you create an average order value piece of data, if you will, that's calculated. That calculation occurs within the semantic layer once.

[0:13:29] AK: Right. Right, exactly. Once.

[0:13:31] LA: There's obviously dry code advantages to this. Are there performance advantages as well?

[0:13:37] AK: Yes. There are several additional benefits to that performance and unifying the governance and access control. The performance is once you define everything in universal semantic layer, it means you're going to query everything through the semantic layer. It means the semantic layer becomes a place where caching started to make sense as a single place of all the requests. Once you have all your system go through universal semantic layer, now you can do a caching there. Cube specifically has a few different implementations and caching strategies that can help with that. I think, the idea is that because you query through the semantic layer, now it's an ideal phase to cache. It's definitely a lot of opportunities to improve performance here.

The same idea applies to access control. Access control tends to be not centralized as well on the different business intelligence tools and visualization tools. But if you have a single place where you query your data through, then it's an opportunity to centralize access control. That's two additional benefits that semantic layer architecture can provide as well.

[0:14:47] LA: Okay. You mentioned universal governance, which I know is tied to access control. When I think of universal governance, I think a lot more than that, too. I'm assuming there's other benefits, like regulation compliance and things like that that can – do you want to talk about that a little bit?

[0:15:05] AK: Yeah. Once you have all the metrics definitions in place and dimension in place in semantic layer, now you can classify them and say, “This is a PI data. This is data under that compliance.” You can also manage the owners of the data. You have different groups and teams responsible to make sure that the data is up to date. You can apply the whole set of governance features that you would expect. I think the difference between semantic layer first architecture and a more classical governance architecture is that semantic layer is active. Meaning, that you're going to make all your queries directly through the semantic layer.

In a traditional governance approach, the governance system, it usually sits on top of your stack and a little bit on the side. Meaning, that your Tableau query is directly Snowflake, but then you have a governance platform that describes the data and talks about the data. But there is no strong opportunity to enforce that, because it’s on the side where –

[0:16:07] LA: You can't filter when you're not in line. Yeah.

[0:16:10] AK: Exactly. Exactly. Yeah.

[0:16:12] LA: Okay. Governance is a useful case for this as well, too. Governance changes. I mean, a lot of governance rules change on a regular basis. Having to go through and change the rules in multiple tools can be problematic and error prone for that matter. This allows you to change it into a single location. Am I characterizing that correctly?

[0:16:34] AK: Yeah, exactly. Yeah. I think you’re spot on. It's the same problem with the data model, right? It's just, you sprinkle all this definition across all these tools. Then it's not centralized. It's not dry. It's the same idea applies to the data model and to the governance. In fact, I believe that governance should be really connected to the data model, because and all about this, it's like, what kind of metric is this? What kind of people can access to this metric, or that specific dimension? Is it a PII dimension and so on, so forth?

[0:17:06] LA: Yeah. Yeah. In some ways, governance is just a part of the metadata associated with the data and you need to tie it to the data. That's usually not the way it's done, but a layer like this can certainly enforce those sorts of restrictions. In the example that we gave, the e-commerce example, we were collecting data from multiple sources and essentially, ETL them all into a single data warehouse and then putting the semantic layer on top of that. There actually are scenarios where there isn't a universal single data warehouse. It's data from multiple sources that is disjoint and isn't uniform.

First of all, can you deal with that? I'm assuming you can with Cube, but do you recommend centralizing your data before you put a semantic layer on top, or are there advantages to leaving the data decentralized and putting a semantic layer on top?

[0:17:59] AK: Yeah. Great question. Semantic layer, and it gives, specifically, we can work both on centralized data and data allocated in different places and in different silos. The way that Cube work that it can dynamically connect to different data warehouses based on the data model definitions. When you design your data model, you can say, “Oh, this part of the data comes from Snowflake. That part of the data comes from BigQuery.” Then that's how we access this data. That's how we are joined between data sources if needed.

I think there are a different approaches to modeling data. Obviously, cloud data warehouse way, then you have something like a lake house, which is, to some extent, very close to the data warehouse architecture. Then you have a full-zero ETL approach, where you just keep the data in places where it is, and then you use some fee duration engine, like a Trina to access it. I think from a semantic layer standpoint, we fit into any of this. I don't think we position semantic layer as a federation engine, specifically right now. We would rather rely on something a Trina, if you really need to do have a federation.

We can federate data, so we can join across multiple data sources. But if you need to really go deep into more complicated use cases where you need to push down some compute to the source and then bring it and more massage data, so there are engines like a Trina that are really good at it. At the end of the day, we can both with both ETL data, fully ETL data, or not ETL data, and that needs to be federated.

In terms of the advantages, I think there is a lot of advantages to have the data in a single place. At the same time, it's always the cost of the moving data. I think if the cost of moving data is really high for a variety of reasons, then it's probably not worth it. But if there is an opportunity to centralize data in a single warehouse, a lake house architecture, it feels like a preferable solution.

[0:20:09] LA: That makes sense. I think one of the things that I see when we talk about low-ETL data processing, where you have multiple sources and you leave into multiple sources is interpretation of data can be an issue. As a simple example, you have data coming from, well, let's say, from your Spotify order processing. This is simply using our example here. Engineering is looking at that data, marketing is looking at that data, finance is looking at that data. They all think it looks different than it really does, because they make assumptions. Every single one of them make assumptions about what the data means that may or may not be true assumptions.

The finance may assume every transaction ends up, I don't know, I'm drawing a blank on specific examples here. But what I'm trying to get to here is data itself, we look at data from a given source, different consumers of that data can make different assumptions about what that data actually means, or contains that may or may not be true, and so you get different interpretations. Tell me how a somatic layer helps remove that problem.

[0:21:19] AK: Yeah. It's a good question. I felt like, at the end of the day, we talk about the trust in data, right? Do we have a trust in data, in our data consumers, right? When they go in and consume data in the Tableau, do they trust the data? I feel like, the way to solve it is to provide as much context as possible to the data consumer, so they understand where the data is coming from and how it's being processed and how it's being calculated. At the same time, the problem is that all the steps, they are very technical. We have a lot of pipelines, like in transformation code and data modeling code telling you, “Oh, yeah. You can look at it. But if you can read code, you can understand that.”

I think the problem here is how to make the breach to tell data consumer how the data is calculated. Imagine a work, like a organization, or imagine a e-commerce store that has a data engineer and then the marketing is looking at the data, or finance is looking at the data and they don't trust the data. Now, they come into a data engineer and ask, “Hey, how did you calculate that?” If the data engineer can explain in more details how this calculation is done, that will create more trust level in that data.

I think the question is like, is it possible to automate that? I feel like, that is connected to maybe governance, or I just want to say, catalog solution of the data. That's something that we think a lot about at Cube is that, because we've been this place where all the data model definitions are located today, is there an opportunity to surface that knowledge to the data consumer in a way that they can learn about the data, learn what kind of the data they have, what the lineage of the data, the definitions, and then take that knowledge and use the data and trust the data?

[0:23:11] LA: That makes sense. I think one of the problems that I run into a lot when I do these sorts of data analysis is you've got a definition of what you want to try and get out of your data and how you get that data changes the results. A simple example, I'm trying to get the average click-through rate for social media engaged sessions coming into my site. I made that up, or something like that. Well, what does the average data for that mean? Well, let's look over 90 days. Okay, so what 90 days? Well, the last quarter, or the last 90 days, or last 90 days minus today, or last 90 days plus yesterday. You see what I'm saying is, the definition of how do I get that, what seems like a very well-described piece of information is different and different people have different interpretations.

What ends up happening is different people come up with radically different answers. In fact, how you collect the data can actually cause the data to be more, or less usable as well. Tell me how a semantic layer like this can help with that problem.

[0:24:19] AK: Yeah, yeah. That's a perfect example of a problem that we're trying to solve with a semantic layer. Because even a single definition can have a lot of variations, right? In your example, average click-through rate. For different people, it may mean different things. Then, if you let this definition to sprinkle across organization to be located in different places, then we will have a lot of different definitions, right? Everyone would create their own definitions here, like on a sideways, so no one is looking at and then use that definition in some presentation that will end up at the board level, and the board level would be like, the number of that problem is that we need a centralized governance of the metrics.

People can have different metrics. That's totally fine. We just need a framework to be able to develop them and then document and then just share with the rest of the organization. If someone would come to me as a data engineer and say like, “I want this average click-through rate.” I will ask a lot of follow-up questions. How exactly do you want to do this, right? Do you want it to calculate it this way? Look at the 90 days. Are we talking about the rolling? Are we talking about any specific filters and all of that?

Through that conversation, we will come up with some definition that we mutually agree. The person who wants the metrics and I as a data engineer. I can say, okay, we have the data that we collect in a specific way that we can support your calculations. We can give you that metric. Now, I would go into my semantic layer. I will create that as a code. Essentially, it's a code, just a code base by the end of the day. You put this definition in a code base and then you have this metric now. Now that metric is going to be available in a Tableau, or other places. Now, I do need to document it, of course. I need to create a really good description, that's how it's been calculated.

The other person who comes for that metric, they can read this definition and they understand, okay, this is an average click-through rate, but that's exactly how that was calculated. Maybe they need another one and different type of the calculation. They come to me and say, “Hey, I need to make a change.” That's fine and we can make the change. Then we can create a second one. I think the potential challenge in that solution, that's a state of today, right? This is how are our users use Cube and how they leverage the whole stack. That it might create a little bit a work for the data engineer to make sure that everything is defined and documented. I think, that's where I believe that AI can help us. We haven't chatted about AI yet, but I felt like we should have, right? Just like this –

[0:27:09] LA: We will. Yeah.

[0:27:10] AK: Yeah. Right. I think what amazing thing we can do is that as long as we keep everything as code base, AI is really good at writing code. I think that's the best use case for the modern-day AI is just to generate code and generate text description. Essentially, we can use AI to like, if you need a new metric. Maybe you don't need to go to data engineer. You go to AI and AI can go and take that metric definition, generate that code to create this metric definition and then send a pull request. Then, data engineer, they only need to review that.

[0:27:45] LA: To your point about AI, I think, it's illegal in today's society not to talk about AI when you're talking about data. So, yeah. I agree with that. I hear what you're saying and I mostly, but I'm not sure completely agree with you. It makes a lot of sense, but I think one of the problems you still can run into – let's keep to the data engineer example and then we'll extend with AI in a second. When someone comes to you and wants a definition and you create that definition of some piece of data and give it to them, ideally, you'd want everyone else to use that same definition.

Now, you mentioned someone else is going to come and they want something slightly different, and so you create a second copy of the data definition with a slight variation and then a third copy with a slight variation and a fourth copy with a slight variation. Sooner or later, you still have a semantic layer, but rather than having 20 definitions that are scattered throughout your organization, you have 20 definitions side by side in the semantic layer and they're all different definitions. How do you avoid that problem and doesn't AI actually make that problem worse by making it easier to create new ones?

[0:28:54] AK: Yeah. That's a good question. I think, I still believe that's a better state of things for us than having this 20 definitions being hidden and without any understanding whether they've been used or not at all. Once we have them in a central place with the 20 definitions, we might understand that maybe the 15 of them are really legacy right now. They essentially should be deprecated, because they're not being used, and then we can centrally govern them and remove them from the stack entirely.

Once we have that central place, we can see the lineage and we see, oh, there is link now really charts that have been powered by these definitions. Or, maybe there are really still charts that people don't use them, let's go and deprecate this chart. It's a place that helps us to control that and evolve. I think we need to accept that the definitions are going to change. We just need to build a framework, how we support that change.

[0:29:51] LA: Got it. The simple fact that they're located in one area makes the change easier and makes the consolidation of the change is not just duplicating that. You can also consolidate a lot easier as well, because you know which ones are being used, who's using them, and you could potentially, and this might be also a good use for AI as well, you've got 30 definitions with minor differences between them. 20 of them are being used. Can we consolidate that into five by making certain changes in the definitions that might be acceptable to the consumers? Perhaps, AI can actually help making those recommendations and you can adjust the usage models to a better definition that's also more uniform as well. But you can't do that if they're scattered throughout the entire code base. You can only do that if they're all known and all centralized and understood by one system, or one entity. Is that a fair statement?

[0:30:50] AK: Yeah, exactly. I think you're spot on. Also, it's all a single code base, with a single framework. You can refactor it. You can think about, okay, how we make these definitions more efficient. We can use AI, as you mentioned, to help us support the similarities, the differences. It can help to refactor it. I felt like, having it as just a code base and in the central place that gives a lot of downstream benefits.

[0:31:17] LA: If I were to describe two problems, which do you think is the biggest problem with data modeling today? Then the follow-on question is, how can what you're doing help with this? Is the biggest problem with data modeling, is it a lack of cohesive modeling where data is hard to understand, some tools don't know what the data is, or how to make use of it, like for instance, AI? You can't just throw an AI algorithm at data, without any understanding of that data. Or is the model meaning lost, or misunderstood, because it's not well documented, and hence, the data is misused or misunderstood by some tools? Which do you think is the bigger problem?

[0:32:02] AK: I think, in general, the data modeling as a concept. Maybe it's easier for data engineers, data people who are doing that all their life, for 10, 15, and even more years. Then if we're bring it to the data consumers, what a measure, what a dimension? All the BIs and intelligence tools they show you all this, like multidimensional concepts, but it's still sometimes hard to get an idea of what is it. A lot of people talk about metric, but really, I don't think we even have a universal definition of a metric. What is a metric? Is metric a measure? Or metric is a measure with a time dimension? If you add a filter to that, that is a metric.

There is like, I feel like a lot of back space around the data modeling, especially between connecting the business concepts and the business users and data consumers with the data engineers. Because in a data engineering and a data modeling part, it's a little bit more determined. We have a lot of different approaches, like a pinball, we have data walls, all of that stuff. It's a little bit more structured.

Then what the complexity comes in is how we translate that structure, which is inherently very complex to the data consumers, when they only ask you for a metric. We try to then explain, oh, but metric is less tangible, right? Let's think about measures and dimensions and all of this. I think that's a hard thing.

Now, how we can make it easier. Documentation is, I think, just in general, keeping everything up to date and documenting. It's not a hard problem from a brain power perspective, but it's a lot of a manual, mundane work that no one wants to do. That's an example where AI can actually help us. I think that's a big problem, but it's just something we really need to automate, but we never had really good tools to automate that.

Then I think we can go and try to solve these many, many problems, and which will help us at the end of the day to bring the non-data folks closer to the data, to better understand data, because things like a better documentation could definitely help them to just work with work the data in a better way.

[0:34:24] LA: Mostly, what we've been talking about AI here now is AI as a tool to analyze the data model and analyze the data structure and the data – essentially, automating a lot of the things that a data engineer does with the data. Helping to increase the usefulness of the data by giving better documentation and better understanding of the data that you have available. That's great. What about for the customer that's looking to apply AI in a large language model, to analyze their data to help create customer useful information based on that data? In other words, the building a large learning models that need to understand your data. How does the data semantic model help that use of AI?

[0:35:14] AK: I think the way that modern AI transformers, LLMs, they work with the data is through the code generation. Essentially, because of the architecture they build in, that's probably the best way to do that. We cannot really think about uploading a lot of data right into AI, or into its context, because the context is limited by its architecture. The best way to the system can do analysis is to break down the complex task into multiple sub-tasks, and then execute the code snippets to analyze the data, and then based on this come up to some conclusion that it needs to generate a code snippet again, and then arrive at the final answer.

How we know the, for example, ChatGPT for data analysis works, right? If you upload the Excel file and say like, “Hey, run me some analysis on top of it,” it will just generate a Python script, execute it and give you the answer back. Now, I think, it means that many, many AI agents, they would need to access data in a cloud data warehouses, lake houses and all these places.

Now, the questions like, how they would be able to do that, the answer is by generating and executing SQL. The system, they will generate a lot of SQL. I feel a lot of SQL right now, either it being written by humans, or being generated by business intelligence tools, in the next few years, we'll see a lot of SQL being generated by AI agents. Now, the question is how we can help them to generate the SQL? Because they don't really know anything about your data. They don't have a context. How do they know what columns they have? The simplest approach would be like, oh, yes, let's just take DDL off your database and give it to the AI agent as a context.

If people try that, and I think it was some research papers and benchmarks comparing different approaches, so that usually does not give you a really good and strong accuracy, because the columns are cryptic. You don’t understand the relationship between entities. The way to improve that is to give context about the data. It's like, what dimensions you have here, what measures you have, what columns you have in data, what are relationships between different entities? Essentially, give semantics to the AI agent.

Now, if you package that as a really good context that you can attach to that, your prompt and say, now generate SQL. Then the AI agent would obviously generate a much better SQL with very high accuracy. That's, I think, how the semantic layers can help in that architecture. They can be this provider of this context.

[0:38:01] LA: Yeah, that makes a lot of sense. Since, again, it all comes down to data understanding, right? The AI has to understand your data, just like the humans who are doing the analysis does. We've been talking a lot about the BI use cases for data. For the most part, those are batch analysis. Not always, but they're not real-time analysis for the most part. Most of the types of analysis we've been talking about so far are the types of analysis that happen after the fact. What about real-time data analysis? How useful is a semantic layer? Does a semantic layer, like what, like Cube help, or does it just delay the processing to the point that makes it unrealistic for real-time analysis? How does it work in the scope of a real-time analysis?

[0:38:52] AK: Yeah. I think few things here. First, I think real-time needed very rarely, like a true real-time. There are use cases where we need a true real-time, like a streaming for real-time. But with my experience in data, it's a very rare case. In many use cases, we don't need a real-real-time.

Now, other things is streaming-level real-time is extremely expensive. If organization of the team decides, oh, we need the streaming-level real-time, then they need to ready to pay for that. Because all this technologies that helps you to process the streaming data, they are extremely expensive. Your stack is going to cost a lot. The other thing is, it's really hard. There is not a single solution. It's not like a Snowflake for streaming data that you can just, “Oh, I just stream everything into that warehouse, and then write a lot of SQL queries.”

People try to do that, and there are a lot of great companies and technologies that started to try to address that problem. Probably, the older one was a KSQL DB. I don't think it's very active anymore, but it was an attempt to bring SQL to the streaming. Then there are a newer one materialized with interesting ideas around like, can we build a Snowflake-level experience, but on top of streaming data? It's still hard. Everything is still in progress. I'm sort of a bullish that this technology will help us to make our life easier, but it’s hard.

Now, how it all connects with semantic layer? Cube specifically built to work on top of SQL back-end. If there is a way to run a SQL on top of the streaming data, like for example, it's materialized, or with K SQL, you can potentially put a Cube on top of that, and that's going to work. Cube is not designed to work on top of like a Kafka, or something like that. You still need to have a back-end. All the streaming architecture we have within our community, our users, they are all very complicated.

[0:40:52] LA: Got it. Are you a SaaS service, or are you a standalone application? How are you structured, and how do people engage with you?

[0:40:59] AK: Great question. We have a cloud offering, where we can have a shared cloud. Essentially, it's going to be one VPC in the specific region and a specific cloud that we support, that our customer can – they share that VPC, essentially. It's a multi-tenant architecture. Then we have a dedicated offering that essentially, we spin up a dedicated VPC instance in a specific region that customers selected in a specific platform, and then we run everything in that VPC. Then finally, we can bring – we call it, bring your own cloud. We can bring everything inside the customer cloud.

It really depends. As you can imagine, the first option is more for SMBs in a meat market, and then as we go to larger enterprises with more compliance, it's just more regulations. It's a more complicated deployment version.

[0:41:48] LA: That makes sense. Who are your competitors?

[0:41:52] AK: The Google, they bought a company called Looker about three, four years ago. Looker was essentially a business intelligence tool. They bought Looker to sell more BigQuery. Also, they wanted to make Looker to be more like a headless and universal semantic layer eventually, because Looker has a really strong semantic layer Look ML in it. They wanted to take advantage of that, and just what if we can use that semantic layer not only for Looker UI, but across all the other tools?

It's still TBD, to be honest. It still didn't happen. We know that the modern Google is not doing well in terms of the acquisitions. It's not yet clear when it's going to materialize as a competitor, and if it's going to happen at all. We also have a company called EdScale. They've been around a little longer than Cube, solving the same problem. Some of their concepts are the same. I think the difference between Cube and EdScale is that we are very caught first, and we approach this problem with more an engineering philosophy and engineering triggers approach, where it's a code base, you can put it under the version control. That's all the things you can do. Where EdScale is more a visual builder, more a traditional, think about business, objects, universe style of experience.

[0:43:18] LA: Right. Right. It sounds like, it's still a pretty young space, though. Is that a fair statement?

[0:43:23] AK: It's a very young space. Yeah. I think it's a very new category that is developing very fast, and I think we'll see more and more adoption of it in the next several years, but it's still a very young space.

[0:43:37] LA: Right. Artyom is the CEO of Cube, a data modeling company focused on data semantics. Artyom, thank you for joining me today on Software Engineering Daily.

[0:43:47] AK: Thank you for having me. It was a great conversation.

[END]