[00:00:00] JB: Companies have many things they want to do with customer data beyond sales and outbound marketing. To do that, they need to expose customer data so that data analysts and scientists can build more sophisticated use cases such as marketing attribution, churn prediction, product recommendations, making all that customer data work for the entire enterprise requires an entirely new architecture that syncs customer data between CDP, and cloud warehouse without creating a whole bunch more problems than you had before.

Today, we have Soumyadeb Mitra, here. He is the founder of Kleiner and Insights backed RudderStack which he will join us just to talk a little bit about the right way to activate all of the data. He is an industry veteran. We're super excited to have him. He's been on the show before, I have learned. And he has a ton of experience in the data ML and analytics space, in addition to holding a PhD in computer science from the University of Illinois Urbana-Champaign. Thank you so much for joining us today.

[00:00:59] SM: Thanks, Jocelyn for inviting me to the show. I'm super excited to be back.

[00:01:03] JB: Yes. A lot of things have been going on with RudderStack. We’ve got some exciting new approaches and thought work that we want to hear about from you. But before we get into that, maybe you can just give us a little background on yourself and RudderStack for those who may not be familiar with it.

[00:01:21] SM: Yeah, so I'll start with myself. I am Soumyadeb. People call me Soumya. I'll go into reverse chronology order. I've been doing RudderStack for the last four years. I started in June 2019. So, this is almost our fourth year. Before RudderStack, I spent a year in this company called 8x8, a public telecom company, where I was trying to set up a very similar stack of like collecting customer data, or doing things with that data, and activating that data back. The challenges that I ran into in that process, like prompted me to start to RudderStack.

Before 8x8, I was the co-founder and CTO of a company called Marianna IQ. At a very high level, we are using machine learning and deep learning to automate marketing. The idea was that a lot of the marketing actions are very manual, and you could automate that with like a ChatGPT tech. Of course, like we are early, both on tech but also, on the market. But learned a lot of things during that process. The primary thing was that most customers do not have clean data, and if you don't have data, then you really can't do much ML and so on.

So, it was a good learning experience. I ended up selling that company to 8x8 to build a very similar stack, and then I need similar challenges, and that's why I kind of started RudderStack. Before that, I started my career in a company called Data Domain. If you know Frank Slootman, the CEO of Snowflake, that was his first company as a CEO. So, he recruited me into Data Domain, convinced me after my PhD that I should join that instead of Google, had an amazing run, great learning experience. He was an amazing leader. But as I said, I have a PhD in data.

I've kind of been working data pretty much all my life. Finally, I hope with RudderStack, we can solve the data problem, the customer data problems, like data is big. Quickly about RudderStack, we are building what is called a Warehouse Native CDP. I'm sure we'll talk more about that. But that's RudderStack in one line.

[00:03:38] JB: Yes. Let's break that apart from a terminology perspective and use that as a way to talk about some of the problem statements that you're addressing, because it's an interesting way to talk about what you're doing. Warehouse Native CDP. Why did you choose that as your tagline? Why does that matter to customers?

[00:03:57] SM: Yes. I think like warehouse-native, and CDP is a customer data platform, right? So, for folks who don't know what a CDP is, maybe we can start there, and then we can come back to like warehouse-native, right?

If you look at – even if you split up CDP, the code is customer data. What exactly is customer data? So, if you look at any brand, take [inaudible 00:04:25], right? I mean, they have customers interacting with that brand on multiple channels. You have people coming to the website and ordering, browsing things, ordering things here, people managing their mobile app. Coming to the mobile app, like going to the store, making transactions et cetera, et cetera. You have all these different channels through which customers are interacting with the brand and you want to collect all that data. Why is that important? Because like you want to know more about your customers, give them personalized experience, and all the interesting things which you can do if you know your customers better. For that, you need to collect customer data. That's all this is. That's the definition of customer data.

[00:05:10] JB: Right. Let me ask you. The problem is, now, let me guess, I did some research on some other things you talked about. But also, from my own experience. You've probably got multiple – you have multiple lines of business. You have multiple contexts in which this customer data is collected. And what you end up with is little warrants of data that describe your customer when they are swiping their credit card. And that same customer has described somewhere else in servicing that credit card down the line, and you might not have the clear picture of how those personas which is really just you, the one person, fit together. That's one problem that CDP is trying to address, right?

[00:05:50] SM: Yes, 100%. Actually, that is the whole pitch of a CDP. We didn’t start the market, right? I mean, there have been CDPs have been around for like, almost 10, 15 years, right? They came into existence, because the folks who wanted to use this data, like the marketing folks, mostly, but sometimes like other business units also. It was very hard for them to get access to this customer data. They had to rely on IT to put up this infrastructure to collect all that data. This was like, early 2010. I mean, you had to really set up a Hadoop cluster. You're talking about like data at scale, if you’re collecting first-party interactions of the website and mobile app. You have to set up a Hadoop cluster and like hire a team of data engineers to write all that code and all that stuff.

[00:06:35] JB: And then you got a freshness problem, right? You've got a freshness problem, because when it's getting collected, it's fresh, but it's messy.

[00:06:40] SM: Yes, exactly.

[00:06:42] JB: And that goes to this long pipeline process of getting, to and up for the analysts. By the time they get it, it's not what they wanted.

[00:06:50] SM: Exactly. Then, anytime you need something new, you want to collect the new event, you have to go back to IT. By the time your need for the data goes away. Exactly, as you pointed out. Marketing said like, “Oh, IT is too slow. I need them to get away.” They kind of like went ahead and bought these other SaaS solutions. There a lot of segment, one of the first players in the space, which got acquired by Twilio. And then there have been a bunch of startups who have kind of built companies in this space, primarily selling to marketing. That way, we don't have to talk to it, we will manage everything, we'll collect the data, we’ll do it on our SaaS system. That went a long way. I mean, they kind of have solved interesting problems.

Now, that also has its own sets of challenges, right? I mean, these cloud SaaS black boxes, they are good at collecting data where they can. But if we look at a large enterprise, there is a lot of data that does not flow through the CDP. I mean, many buy a vendor that a marketing team buys, right? I mean, there's only so much data that can flow into them, right? I mean, there is data in your transactional systems and your back-end systems. Marketing cannot get access to this, and like through like a SaaS vendor. So, you're always this challenged of the customer data platforms having a limited view of the customer. They would get data from proper things that marketing have done, like the website may be. But they will get data from the transactional systems, the back-end system. We had the same challenge at 8x8, right? I mean, there was an incomplete data about customers in our CDPs. 

[00:08:32] JB: So, if the incompleteness that has traditionally been the problem, you don't get – you don't have to do to leverage all the rich data that you have about your customers, and I imagine that's like 10x the problem in the world of training machine, because you want as much data as possible to train, right? So, you've got an even bigger problem than just having unhappy analysts. It's just not going to work.

[00:08:54] SM: Hundred percent. In fact, that was the root problem at my previous role when I was trying to – I was trying to build a churn model. Churn is a very simple use case. But a very important use case for anything, which is subscription. Telecom providers are one. If you think somebody's going to churn, it is better to give out the phone for free for six months, and try to address that concern, because the main cost is the cost of acquiring the customer.

But it's still important to find this to be churning customer. You cannot give it out free for everyone. Now, to build this churn model, you need app activity data. If you know that there are users of the app has gone down, or over time, that means they're probably likely going to churn. But then we found that is also a very late predictor. I mean, by the time their usage has gone down, they're probably already made up their mind. So, we found that ticketing data, like support ticket data, is a very lead predictor and if you know that somebody has opened a lot of support tickets, and the sentiment is very bad, then that data is actually a very good leading predictor of churn.

So, you need to build a churn model, which is a model is very simple. You can use any standard machine learning algorithm, but you need this completeness of data, you need the app activity data, and you need this ticketing data, bring them together, and join them, and create their customer profile, and then you can train a very simple ML model. The hard part is to get this data together. That's one. I mean, your traditional CDP will maybe have the app activity data that will not have your ticket data. Your ticket data might be sitting somewhere in your back-end warehouse, data warehouse and so on. How do you bring both of them together? Then, how do you then train machine learning algorithms on top? That is the hard problem. Not so much the ML part.

[00:10:48] JB: Right. That is the hard problem. As you're talking about all these use cases, I'm hearing streaming data. I’m hearing the structured data on-prem, in the cloud. I hear all these examples. Yet, you're using this term, cloud-native warehouse. Can you help me understand that?

[00:11:04] SM: Yes. That is our pitch, right? You had those data silos where your first-party app data is sitting in a cloud CDP, let's say ticket data, and a lot of back-end transactional data, you're pulling into your data warehouse like Snowflake and BigQuery, and Redshift to do analytics. Why create these two separate data silos? Earlier, you had to do it, because your data warehouses. If you go back to 2010, there was no cloud data warehouse. You had to buy your tenant data, and like an on-prem data warehouse, where you could not collect all the streaming data. That was much higher volume, extremely costly, you cannot dump it. So, you had to store it separately in a Hadoop cluster, and so on, while your transactional data went to a data warehouse, like a Teradata of the world. So, you had to keep them separate.

But in the last four or five years, with the innovation in the cloud data warehouses, it has finally become possible to bring all of this data together into one single place. You can literally swipe your credit card and get a one terabyte instance of snowflake, and store all that data, and process that data. That has been a big shift in this ecosystem. Cloud data warehouses have enabled collection of high-volume streaming data, ETL data into one single platform, right? So, what you need is like a CDP, like the traditional CDP's did the data collection, data unification, everything. If you could just run that stack on top of a cloud data warehouse, instead of being another data silo, you can solve a lot of these problems. I mean, around bringing both the data together, data completeness problem, analytics and ML problems, like you could not do on the traditional CDPs. Now you can do on top of a cloud data warehouse. That's what we are kind of doing with the RudderStack in a nutshell.

[00:12:53] JB: Yes, CDP got a little bit dusty, right? People were not as interested because of all these problems, and what you're saying is that cloud architectures really opened up a new way of thinking about it. You also have – let's talk a little bit – let's see if we can talk about two things at the same time. But you have this notion of collection, unification, activation. Can we just kind of talk through that, because I think those are really great pillars for understanding RudderStack? But I also want to talk about some architectural decisions underneath those as we go along.

[00:13:24] SM: If we look at this stack, again, end to end, founded RudderStack. I mean, you're trying to build this use case of – again, let's go back to the churn model. Then, other things could be similar. You want to build that churn model, you want to get your first-party app data, like your streaming app data, and then you also want to get your ticketing data through an ETL kind of a pipeline. So, you are to run these two pipelines to get data into your data warehouse. That is the collection piece all of it, right? The second step is you have to stitch all this data together into this golden customer record. I mean, finally, to feed any analytics or ML, you need this one row per customer with a bunch of features for that customer. The features could be total number of tickets they have opened in the last seven days. What is the average sentiment score? How many times have they logged in into the app? How many phone calls have they made? How many transactions have they done?

These are all the features that are interesting that you want to compute for a user by combining these two data, like the streaming data and the ETL data, right? This is the final output that we want to get. There are a lot of challenges in there. I mean, we can get into the details, but let's for now say that that is the unification step. You want to combine all of this to create this golden customer record. Once you do that, then you want to train like an ML algorithm which by the way, I mean, I think is the simplest part of this puzzle. I mean, you, like a simple – you don't need like a chargeability deep learning thing to build a churn model. Once you have these features, a very simple gradient [inaudible 00:14:58] would like work well for this use case.

Once you have, let's say, a churn code, you want to take that and push it back into the business system. Finally, your support team has to give out the six-month off coupon, or your marketing team has to run that. You have to put that code back into the tool Gainsight if support is using the that, or something for them to act on those predictions, and so on. This is like the activation step. Finally getting the data out of the warehouse to a restoration. You can come out, like we call it collection unification activation. People can call it out. It's like some people don't ever bounce for them. But this is the pipeline that you need to build use cases like churn, and everything else kind of fits the same model.

[00:15:45] JB: Yes, interesting. That's really helpful. Taking a look at some of that, I want to talk a little bit about some of the architectural decisions, or ways of being that you've decided on. Because I do think it’s interesting, if you wanted to kind of unbundle all of these silos for lack of a better – like places that – it's not just a silo, it's more than that. You've got all these different systems, and then all these different ways of thinking about the same data. You have decided to attack that by using control plane, data plane. You want to talk a little bit about the decisions you've made there and how RudderStack works in the data plane?

[00:16:25] SM: If you go back to this stack, you have the collection piece, which has both streaming data and some ETL data. You have to pull ticket data through ETL. You have the collect piece. The unification piece runs on top of the data warehouse, which does all the stitching and all the stuff. And then activation is like taking data back from the warehouse into the destination, right? This is like the end to end stack.

At the core of it, you need some service to move data. Whether it's streaming from cloud to the warehouse, or streaming app to the warehouse, whether it is warehouse to the cloud, all these are data pipelines. Some source, some destination, you have to just move data. That's what we call the data plane. It is important that your data plane is closer to where a lot of the data is. If you are, let's say, if you are using Snowflake, everything is running on top of your data warehouse. You're pulling the data into the warehouse, you're pushing it out from the warehouse.

So, if your Snowflake is in an AWS environment, then you want the data plane to run in an AWS environment. Maybe in the same VPC, maybe not, depending on the use case. But at least you want them to run in the same network, because you're moving a lot of data. For Snowflake, if you're using GCP as your data warehouse, then you want this data plane to run in the GCP. Eventually, if you have like a private cloud, any raw data, you want this to run in a private cloud. I mean, we don't support that yet. But in theory, you could do that. That's kind of what we call the data plane, which is like moving the data.

[00:18:08] JB: The reason I asked about that is because you have a separate gateway, and then you have this data plane world, and I might have gotten some data information, so call me on it. But it's designed to politely degrade, if there's problems on the data plan and serving side. I thought that was a really great feature for truly large enterprises. Because when I hear everything you do, I get a little scared. As an enterprise software data person, it feels like these actions could take a long time. They could burn a lot of memory. What happens when it fails? Help me understand how you thought through the problems of very large real time use cases?

[00:18:49] SM: Yes. I think, I've embedded like so many levels in which we can talk about it. I mean, real time data is hot. I mean, we are talking about it, real time and high volume.

[00:19:00] JB: I'm just interested in scale and performance. Those are huge problems and what you're trying to do.

[00:19:04] SM: Yes, and frankly, we had customers who have been sending hundreds of billions of events a month and with a strong latency requirement. You cannot lose the events, that's one thing. Plus, you have to deliver events in under three seconds, five second latency. You have to operate on those scales. That's all handled by our data plane. Now, the way we kind of do that is like, even the data's plane is split into these multiple components, there is the gateway component, which is responsible for ingesting all the data, and dumping it into some kind of a queueing system. I mean, think of it as Kafka. We don't use Kafka. We have built a homegrown queuing engine, but it's kind of a messaging bus, that's like Kafka.

[00:19:52] JB: I think that is like the waiting room or quarantine before.

[00:19:55] SM: Yes, exactly. That's a good example. I mean, it's like a waiting room in a hospital. Events come in, and then they get into a waiting room, right? The only thing you want to guarantee is that gateway never goes down. I mean, we can never – of course, we are a startup. Once in a while, we have bugs and we go down, but it is very, very rare. I think we had kept a very high availability on the gateway, and you have to handle things like if you have to do deployments, how do you roll deployments without going down? You have fail overs. Then, we have to handle things like how do you scale up suddenly? Sometimes you have customers who have panics. I mean, they're even volume – do you scale it up, scale that down?

We had, again, hard engineering problems, but not unsolved problems. A lot of companies have solved this. We have built on Kubernetes. But at a high level, we need this high available gate failure, which kind of puts that into our waiting room. Then, we have consumers of that data, right? I mean, we have consumers who are now taking that event, and then deciding what to do with it. I mean, we have a notion of transformation. Customers can modify the event post already. That is also important when you have – you want to fix an event from an app. You published an app, there was some error in the event, you want to make changes before it goes to downstream destinations. We have that notion of a transformation that runs on the event. And then, that event is eventually delivered to the destination. It could be a data warehouse, but could also be a cloud destination.

During that delivery, again, we have to handle failures, as you pointed out. If the destination may be down, there may be other kind of errors, network errors, and so on. We have to retry and make sure that it does get delivered. Our system guarantees like nobody can do exactly once delivery, but we do at least once, then we try to minimize retry. We have built all that logic, as the consumers consume that event from that waiting room, and then do all this processing. Then, part of the system can also scale up, scale down. The great thing is like this whole thing can be run on your laptop. We are also an open source company. So, this entire code is open source. You can run it on your laptop as a Docker image. But we can also run it in like a Kubernetes cluster with hundreds of consumers and hundreds of – and get results.

[00:22:12] JB: I was just going to ask that about deployment models. Yes, that's helpful. That helps me understand the underlying specialness of this type of architecture, because you really do need that for these like a multinational, massive data structure. So, that tells us a little bit about what you've been doing. Tell us a little bit if you're ready to share about directionally where RudderStack is going. What's next?

[00:22:41] SM: Yes, so I think what we have built are the data pipelines. So far, RudderStack was like a data pipeline company to move all the data. We can, as I said, we can do event stream, we can do ETL, bring all the data together, we can send it in real time to different destinations. We also have the reverse ETL pipeline to take data out of the warehouse and send to the destination. So, we have all these data pipelines. I think, the next step, so that is that, collect and activate step, right? The unification step is what we kind of left it to the customers. We said that, okay, now we got all the data, now you write all the SQL, Spark, whatever transformations to make that –

[00:23:24] JB: Can I see a question quickly? Let's talk about unification exactly what you mean. Because I might be guessing. Well, I guess it's related to a question I had, which is like, what is the notion of a customer? Once I pull all this stuff in from all over my company, how do I figure out because they've got different schemas, different data definitions, what is one of one?

[00:23:44] SM: That's a good question. What does a customer mean to a consumer company, and what does a customer mean to a B2B company can be very different? Let's start with the simple case of a consumer company. Let's say you're a B2C company, you are like Crate & Barrel. You mostly sell to individuals. Then, a customer is finally an end user. There is some end human being that is buying from you. Now, you may not have – when you are collecting data, you may not have a consistent idea about the user. You may be coming to their website, you may be going to their mobile app, and like when you come to the website, you get a cookie, and you never leave as a guest. You never logged in, and then you did something on the mobile app, you never logged in. And then maybe you created an account, signed up with an email on the mobile app, maybe you signed up with a phone number.

So, you have all these different identities, which are now generating events. You have events associated with all these IDs. You can all stitch them together till you provide some [inaudible 00:24:49]. So maybe you eventually give your email address on the mobile phone while you're checking out. When you give that email address, at that point, you know that all these activities are from the same user. This email address, this phone number, this mobile device ID, this cookie ID, everything belongs to the same user. So, when you're combining all these activities into one customer table, the first step that you have to do is teach all these identities, right? This anonymous ID, device ID, cookie ID, email, phone number, into one user. That's step one. This is called identity stitching, ID resolution, and so on. It is surprisingly hard to do it in SQL. When you have to write a lot of logic, you have to get it correct, and like this, and so on. That's step one.

Now, think about it's not just ID stitching. You're trying to compute features. Features are like, how many times you have browsed, let's say a product. That's a feature. How many times have you looked at furnitures. Now, to compute that feature, you may have to now combine activities across all these IDs. You may have browsed furnitures as an anonymous user on a mobile device, and you might have browsed as a known user on a web device. Then, you want to combine all of them into one single feature, right? You have to look at all of this and compute that feature.

Now, some features are even harder to compute. Let's say you're computing the revenue feature. How much am I getting from this person? Think of like somebody having a subscription business, so you have like wine purchases, and you have subscription purchases. Subscription data will be coming from your Stripe account. While your point purchases may be coming from some transactional systems or web events. So, you have to combine all of them. And the Stripe data may have a separate ID, right? 

Computing the customer 360, it may sound simple. As we work with customers, you've seen like they spent months and months hiring a team of data analysts and so on, to compute this table. This golden customer record.

[00:27:00] JB: I'm dating myself, but we should do it all by hand. Right? You should do it all by hand. Join it all, create what we call features now, and then it immediately changes. That's the other problem.

[00:27:13] SM: Exactly. Yes. That is the other part. Then, you're all doing this, because you want to send this data, your marketing team needs a feature to run a marketing campaign, right? They need the total revenue to run a campaign. Tomorrow, they might need a separate feature, like total revenue in the last seven days, or something. Every time they have to, like they need a new feature, they have to go back to the data team, they have to figure out how to compute that feature, it gets into a sprint, and it takes like two weeks to compute that feature by the time marketing needs something else. This whole process of unification, if done by the data team, there’s complexity, and it's not useful for the downstream tools. So, that’s –

[00:27:57] JB: Maybe people can't do it anymore. There's just so much data as possibly you need to infer it programmatically, right?

[00:28:03] SM: Exactly. So, that what is the unification product we launched. I think like, we thought people can take care of it. But as we work with more customers, this is often the hardest piece. We cannot productize this. We launched this product in Snowflake Summit. We have some early customers. We got some great response. But the whole idea is to make it easy to create this customer 360 for both the data teams and engineering teams, but also allow non-tech people, like marketing, and product, and so on.

[00:28:36] JB: That's such a great call out, too. Hopefully, I'll see you at Snowflake Summit. But that's exciting. I think it's such a great evolution of the roadmap, and I think people – I can see why people are excited about that. Because as hard as it is to do what you originally, if the baseline is very hard, people feel the result in this activity of unification. That's where they just see the output of what they wanted to do.

[00:29:02] SM: That's a great part. In fact, when we're working with our early beta customers on this, they were using RudderStack to collect all the data going to the warehouse. But this output customer 360, that really like triggers the light bulb and say, “Okay. Well, I can see this one record per customer with all these features.” Then, that's when they also get creative. Maybe I can also add this additional feature. Maybe this would be interesting to get how many support tickets they have, and they will add another source. That's why we don't charge for this product yet. I mean, we charge only for the data pipelines. But this product triggers the part of like, “Okay, we shouldn't – we need more data. We need more interesting sources and so on.”

[00:29:44] JB: I think it really puts your finger directly on the existential problem for most complex businesses. There's just not enough data people. The people who know what features they need, the people who are asking the right questions, they don't have the data chops, right?

[00:30:02] SM: Exactly, exactly.

[00:30:03] JB: It's a problem. I mean, it used to be something funny we talked about, like, “Oh, we'll just bring in the less technical people.” Listen, that's the majority of the people in the organizations. They know the business and they don't have enough tools. So, I think that's cool.

[00:30:18] SM: I was just talking about a problem. The other thing I see is like, I mean, unless you can create that customer 360 and provide like show value of thought from that, you don't get budget from your execs to hire more data people. Unless you do that, you cannot show that, create that in the first place. I think, this is an awfully an opportunity to break from that cycle.

[00:30:44] JB: I want to talk about your business. On the business side, we talked about the product and the tech. I want to talk a little bit about the business. I should have done more research on this. But I assume you're getting questions about generative AI, how that affects this modality that you've identified as something customers need. Do you have any common questions you're hearing?

[00:31:06] SM: Yes. So, I think, everyone and anyone who's interested in generative AI, including my mom. What are you doing in generative AI? We are building some things around like, embedding generative AI into the product, starting with support, which is the easiest one. I mean, how do you better search over your docks and so on, and then we'll add them in the product. Transformation said, we have a transformation feature. We can auto generate code and so on.

Those are small things in the product. I think the big driver for us is, I believe, we will finally realize one to one personalization with generative AI. I mean, like, we already – we have been talking about it forever, one to one personalization, and we talked about it in my previous company, even 10 years before that, we are talking about it. But there are two missing things. One is like you did not have the tech to do it, right? Even if I knew everything about you, as a human, I can write the most perfect message for you, like marketing message for you. But then, how do you scale that?

So, there was no tech to do that. That’s why the whole marketing ecosystem was built around segmentation. You create broad segments. Woman in New York, I'll do this messaging, and then something else. It could not do one to one. Now, finally, we have a tech to do one to one. I mean, you can like feed everything about a user, and then some magic. ChatGPT or something built on ChatGPT will come up with the perfect message for that person. Assuming that is solved, what do you need though, is still to collect everything about that person. If you don't have the data, then you cannot call this module. Then, hopefully, that will keep us in business, and that will help drive revenue for us.

So far, we have been more focused on the data problems. But I think – I mean, right now, we have not built these use cases. But somebody will build, us or somebody else, will take all this data, and craft up the perfect marketing message for that person, given everything they know, and the perfect support message or whatever.

[00:33:18] JB: That's right. That's interesting. Yes, I feel that way, too. I get excited about data, solving data problems with cloud data, because we've been waiting so long for the promise of true personalization, accurate sub second underwriting, all of those things, right? I do you feel like it's a little bit more easy to see it on the horizon than it was the past few years. That's interesting.

Let's talk a little bit about the business and the traction you're getting with customers. I know you've got a couple of use cases you can talk about, or you've got some happy customers. Again, let's intersect that with how do people – how do customers implement? How do they get started? Maybe you can use a couple of examples of real customers if you have some to share?

[00:34:02] SM: Yes, I think, I'm not sure. Maybe like talk about the specific use case, like we are sharing on [inaudible 00:34:09].

[00:34:10] JB: I really didn't ask the question. It's just it's a lot. I didn't ask the question, perhaps correctly. It's just that what you do, you will come to me as the head of the bank, and you're like, “Give me all your data.” I'm sure people will feel like they don't want to do that right away. So, how do you carve out your land before you expand?

[00:34:30] SM: In fact, that's the whole value prop of RudderStack. We don't say, “Give us all the data.” The story we tell our customers is like, you have some data in your data warehouse. Are you truly getting business value out of that data? Maybe you're powering some dashboards and so on, but how do you are getting business value of that data? Then, they will say, “Okay. Yes. I mean, like I have some basic churn model. It would be good to build these initial use cases to drive business value.” That is the conversation.

Then, we say, okay, so to build that churn model, let's say again, taking that example, do you have a golden customer record on which you can train the model, and then you can run the predictions and so on. More often than not, the answer is no, right? We don't have that. Then the conversation goes into do you even have that data to build like that golden customer record? I do have all your first-party data? Do you have your ETL data? Again, more often than not, the answer is, no. We have some data, not everything, and so on.

That's what we saw. We understand backwards. We say like, “What is the business use case you're trying to drive? Do you have the golden customer record, and you have the data completeness?” So, we can vote that way, and but implement forward. You have to start collecting data, then you build the golden customer record, and then you go and activate the use case. We have customers who have kind of done this whole thing in a month. We have customers who take six months to take this journey. But that's kind of our pitch.

That's the one product. The other product is data collection. Thankfully, Segment, which was acquired by Twilio has built a big business around data collection, and they have kind of told the world that this is a problem, you should not be doing it from scratch. Just use a SaaS solution. We are like an API compatible replacement for Segment. After acquisition ended up – I have great respect for Segment and what they have done. But after acquisition, you always have unhappy customers and want to move on. That's the other product. We get a lot of inbound leads just for that, like data collection, and all the whole journey. We are both kind of working on marketing motions. If that makes sense.

[00:36:43] JB: Yes, absolutely. So, every – you're all SaaS, is that correct?

[00:36:49] SM: Yes, we are all SaaS.

[00:36:51] JB: Okay, and help me – the other question I had real quick is, help me understand the difference between the open source offering and the rest.

[00:37:01] SM: So, the only piece, which is open source for us is the data collection piece, right? So, if for the event streaming data collection, like our SDK does open source, our back-end does open source. In fact, when we first launched RudderStack, we launched as an open source alternative to Segment. I mean, first, when we got on Hacker News. In fact, that's where we initially – first time I got on Software Engineering Daily was in response to that Hacker News post. You have reached out. You wanted to talk about why are we doing this, and so on.

Yes, that's how we got to market. That's open source. We have a thriving – we have a lot of open source users. I would say like we have a lot of commits to the code base other than, like our destination SDK. We integrate with so many destinations, that's where like, we get some community contribution, even from the downstream vendors. That's what is open source. But then, why people pay us is sometimes they don't want to manage the open source, number one. Number two, is sometimes they have scale requirements, right? If you want high availability, if you want multi-node, or to scale up or to scale down on the ingest piece, then that's what is only available in the SaaS offering. That's how we kind of make money.

That's part one. The other parts of the unify and activate we are talking about, that is our commercial offering. That is only available in SaaS.

[00:38:20] JB: Are you priced on consumption?

[00:38:22] SM: It is priced on an event – amount of data that moves through RudderStack. It's pretty straightforward. Whether you're moving ETL streaming, reverse ETL. Either way, we just charge for whatever you're moving through us.

[00:38:33] JB: Gotcha. Well, that sounds amazing. Do you want to – I know you're going to be at Snowflake Summit talking about some amazing new stuff. I know you also had a couple of like – just to walk us through, like a real business case and what really happened, is sometimes really helpful to hear that, and frame up the rest of what we've talked about.

[00:38:52] SM: Yes, so I think like one like case study that we are presenting at Snowflake Summit is with this customer called Wise. They mobile cams. Cameras that you can put in your house and so on. They have like over 24 million subscribers, is a huge business, with, you can imagine, a lot of data. The mobile cams or IoT devices, they made some data and so on. They wanted to drive similar challenges, use cases, right? Can you predict churn? Can you drive up LTV? Any subscription business has the same set of use cases as they have.

So, they are using RudderStack to collect all that data into a data warehouse. They're using the unified product to create that golden customer record. They're pushing that back into brains to run engagement campaigns that build churn models and so on, and then move from some legacy CDP into that stack and that – I don't remember top of my head, but they have 5x increase in productivity in deploying new ML models. Adding new features to profiles –

[00:40:06] JB: That’s big.

[00:40:07] SM: Because again, I think a year or two, as we were discussing earlier, you have to like handcraft a lot of those features. Complex SQL, figure out, I think, with a unified product that has become much easier.

[00:40:19] JB: One thing I forgot to ask you real quick, and then we can now wrap it up is, customer data has a lot of obligations around it for privacy and tracking. How do you – I should know this, but how do you address that or do you use that something a partner does?

[00:40:35] SM: In fact, that is one of the biggest selling points of RudderStack. That's why we have a lot of traction in EU as well, where people are very sensitive about like data privacy. A lot more than the US. Our story is simple. We don't store any customer data, right? We are just the data pipelines for moving the data. All the unification also happens on top of customer’s data warehouse, their own Snowflake instance, and BigQuery, and so on. We do not like store data, and that really simplifies our story.

We also have a deployment in EU so that like the EU data stays in EU. US data stays in US. We also have a deployment model where we can run the whole data plane. The data plane I was talking about. Still SaaS, but it can run inside a customer's VPC. In that case, we don't even see the bits. In Saas, we at least like see the data moving through us. But here, if it is running in a VPC, we don't even see the set. Entirely moves through customers. Like compliance – in fact, when we started, our pitch was compliance is going to be important. We are a solution to that. The warehouse first, everything was centered around compliance. Now, over time, we realized warehouse foster has a lot more advantages, does not compliance.

[00:41:56] JB: Yes, that's a good plan. You rise above it, and just let the customer, whatever the customer is doing, just stay on it in terms of protection. That's smart. I can see why certainly in the EU, they would love that. Well, we're kind of wrapping it up. I have a couple of quick questions. Is there anything you wanted to talk about that I forgot to ask about?

[00:42:19] SM: No. This was great. I think you got everything. Thanks for that.

[00:42:25] JB: Absolutely. It's so great to have you here. Then, one last question I'd like to ask some of our more technical founders is, our audience is made up of software developers, architects, people who might be considering doing a startup. I’m just wondering if you have any words of advice as a technical founder?

[00:42:40] SM: Yes. I think, I mean, I have been a founder, like, this is my third company. I started a company around college. I did another company with a CTO. And then this, I'm running as a founder, CEO. I think, I don't know if it applies for every technical founder. But one thing I'll say is like – there are two things actually, which is important. Number one is like, as a technologist, you are always trying to look for something new. When I was doing my PhD, the entire focus was you have to find a research problem that nobody has worked on before. That's what like – only then can you get a paper published. When you are trying to – as a technologist, when you see what can I build new? I mean, nobody has thought about it. I'll build something amazing. That is always the inclination.

You almost have to unlearn that. The goal is not to build something new. The goal is to build something which people need. If people need something, more often than not, somebody would have built it once before. Somebody would have tried it, maybe they didn't have the right solution. Maybe they had an old architecture and so on. Don't try to build something new. Try to build something people care about, which more often than not, means people have done it before. We often reject ideas saying that, “Oh, this has been built before. We cannot build a company here.” I built RudderStack, where Segment was like doing 100 billion in revenue and then there's still so much market left. I mean, and I give a lot of the credit to my investors, my advisors to actually go after this market, irrespective of whether Segment was there or not. I think that is one big learning as technical founder.

The second thing I would say is follow up to that. You can either try to create a market. There are only so many risks you can take as a founder, and there is a market risk. There is a cloud risk and there is a team risk. These are the three main risks. Some people are good at some things. I mean, Uber created a market, right? Clearly, the founder was great at creating a market. Technologists like engineers, us, we are often not like some super eloquent. We are not great charismatic people to create a market, right? It is often better. I think we have a higher chance of success if we go after a known market with a better solution. What we can do is build better products and so on. It kind of goes back to the earlier point, like, don't try to do something new, like work or something which has been done before, do it better, and then –

[00:45:24] JB: I love that. I think, we do suffer from trying to be too cool.

[00:45:29] SM: Yes. Because I do it.

[00:45:31] JB: I feel that when I write blogs, or I’ve written like, “Oh, everything's been set. Everything's been written.” Or if I’m posting to LinkedIn, but as it hasn’t all been said.

[00:45:40] SM: Yes, I mean, this is like the biggest thing. Look at how many generative AI companies are there. In the last three months, I've probably been pitched like 20 times, different companies doing generative AI for support and so on, and only one will survive. The market is not even clear. Versus, if we look at like a Segment, again, I'll take that as an example. For seven years, they had literally no competition, and they made a $300 million business, right? I mean, they're probably like a $10 billion market. There are all these white spaces and markets where literally, there is no competition. Versus like, as technologists, we often try to do cool stuff versus where there are markets.

I think, Peter Thiel. I mean, he's a controversial figure, but I have great respect for his thought. I mean, he has a book around like this whole thing. You should learn – you can build a company only when there is no competition. Don’t try to compete. Anyway, I can go on and on but –

[00:46:40] JB: I think that's great. We have to do another show on the nature of competition and what it means, because I do think that's such an important – even investing, people say I don't have any competition and I would feel like, “Oh, I don't know. I feel like everyone has a good idea should have some competition.”

[00:46:56] SM: Hundred percent.

[00:46:58] JB: Anyways, it's great to talk with you. I hope all the best at Snowflake Summit. Start drinking water now, wear your sneakers, and we'll see you again on the show. Best of luck. Thank you. 

[00:47:10] SM: Thanks. Bye.

[END]