EPISODE 1670

[EPISODE]

[0:00:00] ANNOUNCER: All robust technology platforms require testing to ensure that features work as intended. In many cases, test require data, but getting access to valid and high-quality test data is a common challenge, especially when the technology runs on sensitive data.

Realistically, mimicking data that would normally contain sensitive financial or personal information is not easy. Tonic AI was started in 2018 to provide developer tools to transform production data into safe testing data. Andrew Colombi is the CTO and Adam Kamor is the head of engineering at Tonic. They joined the show to talk about creating realistic synthetic data, data deidentification, validating LLM reg output, Tonic subsetting engine, and much more.

Gregor Vand is a security-focused technologist and is the founder and CTO of Mailpass. Previously, Gregor was a CTO across cybersecurity, cyber insurance, and general software engineering companies. He has been based in Asia Pacific for almost a decade and can be found via his profile at vand.hk.

[INTERVIEW]

[0:01:16] GV: Hi, Andrew and Adam. Welcome to Software Engineering Daily.

[0:01:20] AK: Hey, thanks so much for having us. We're really excited to be here today.

[0:01:23] GV: Yeah, it's great to have you both here today. A nice change for us here on Software Engineering Daily to have two guests. So, you are two of the, I believe, is it four co-founders of Tonic? 

[0:01:34] AK: That's right.

[0:01:34] AC: Yes.

[0:01:36] GV: And we're going to hear a lot more about Tonic shortly, which is exciting. But maybe, Adam, do you want to first just introduce yourself? And what was your role to co-founding Tonic?

[0:01:47] AK: Oh, certainly. We're going to hop right in. So, my name is Adam. I'm Tonic's head of engineering. As you said, one of its co-founders. The road to starting Tonic. Well, a lot of it is being in the right place at the right time and luck. I think that's true of a lot of endeavors, frankly. For me, the lucky part was having worked at Tableau Software in Seattle back in 2017, where I met Ian. Ian is Tonic's CEO and one of our other co-founders. Him and I worked on the same team at Tableau. We were both on the product management side.

We got to know each other. We were friendly. We both respected each other's work ethic, hustle, just like abilities in the engineering space. At some point, it became clear. We were both kind of interested in starting a company, and there was - we knew each other, I think, for two years before that really happened. But what ended up happening is I left Seattle to move back to Atlanta for family reasons. Then, Ian ended up leaving Tableau in Seattle to move to San Francisco. Then, I got a call from him one day saying, "Hey, Adam. I'm thinking about starting a company. I have these partners that I'm working with." One of them being on the call today. And he asked if I was interested in kind of like brainstorming with them and iterating on some ideas and building some silly little like demo software, demoware. I fortunately, agreed, and the rest is history.

[0:03:05] GV: That's cool. I didn't realize there's common Tableau part of the history there. So, that's fun to know.

[0:03:09] AK: That's right. I am the outlier, though. I know, Ian. But the other three co-founders all know each other a lot better. And I'm sure Andrew is going to talk about that.

[0:03:17] GV: Right. So yes, Andrew, what was your road to co-founding Tonic?

[0:03:22] AC: Yes. My lucky moment, was working at Palantir with Ian and Karl. I actually hired both Ian and Karl to their roles at Palantir, and I was their boss for a little while. Now, the tables have turned. Ian is the CEO and our business daddy. I was working at Palantir with them. We worked on a lot of data things, of course, we were at Palantir. We even kind of explored this problem. I think independently with Ian and with Karl, I did like synthetic data projects. I think I did it first maybe with Karl, and then later, because we were working on like some projects for Palantir. It was with like a big bank and we needed data for testing as you do. So, I was like, "You know what, I can do this. I'm just going to make some synthetic data for them." I was like, "How hard can fake data be? We're going to do this in like a week, and we'll be good to go." And like, weeks later, I'm still like figuring out Poisson distributions and recalling how a hidden Markov model would work and stuff like that. I realized how hard this problem was.

But the cool part about it was it also ended up being more valuable than I thought it would be. So, we did this project for this banking customer that we had. But then, we were using it in like demos to other banks or demos to even outside of the banking industry. Then later, Ian had a project in a totally different space. It was like a media website kind of a thing, and they were looking to analyze and work with data about traffic and figure out what their customer patterns were in ad revenue, yada, yada. So, I was like, "Hey, I got this tool set that I built with Karl to do synthetic data. We can use it to test the software that we're building for this customer." So, I did it again with Ian and that was really cool.

Years past, I quit Palantir, and I was working on a video game with my sister. I was just working on the video game and then Ian was like, "Hey, we're working on this thing." And I was like, "I'm a pretty crappy video game developer. But I'm really good at data." So, it's probably the better choice for my career to go to his data thing with him. Fortunately, I was able to bring my sister along for the ride. She is now in our marketing team. So, that's really cool.

[0:05:32] AK: One of our first employees. Our first hires.

[0:05:33] AC: Yes. It was kind of like my rider. "You want me, you get my sister." No regrets, I think.

[0:05:40] AK: No. None at all. Let me say, I mean, Andrew is, in fact, excellent at data and I did see an early demo of his video game and I'm glad he switched.

[0:05:51] AC: So yes, that's kind of my story and no regrets. It's been a wild ride. That goes back to the video game at some point.

[0:06:00] GV: Yes. You wouldn't be the first developer to try video games and end up not there. I keep dreaming about that and I haven't even tried it. So, at least you go that far. That's very cool. Let's talk about Tonic. Let's talk about, I believe, at the kind of the start of Tonic, there was and sort of is, there's a core platform that is Tonic, and then there are more products that have come since then and we're going to talk about those as well. But what was the founding product? And what is the core platform? 

[0:06:27] AC: Actually, we're about to kind of rechristen Tonic, core Tonic, as Tonic Structural, and that's to differentiate it from Tonic Textual, which is one of our new products that you're alluding to. But yes, the core idea behind Tonic Structural was this idea of working with structured data and creating great, fake test data that was on the structured data domain. So, structured data, we mean databases with rows, and columns, and tables. But we also mean JSON documents. Those are structured enough for us. We don't mean a random text blob of data. And that's come later in Tonic's life.

But the beginning was just this idea that people need test data. Everyone needs test data. Applications are built on top of databases, like Postgres, and MySQL, et cetera. There isn't great tools out there for doing this. So, that's what we kind of set out to do. I was actually looking at our blog recently, and some of the core things that we built the product around at the very beginning, in like, the summer of 2018, and like the fall of 2018. We found it in May of 2018. Are still like the core things that people buy Tonic for today, which is pretty cool.

[0:07:46] AK: Yes, we got at least one thing right, Andrew and I. Andrew is right. Our primary use case for Tonic Structural has been for providing deidentified or synthetic data to engineering teams, so they can do development and test. But that same tool, using the same technology is able to like satisfy other use cases as well. Definitely, secondary use cases. They're not our main bread and butter, but they still sell, and it's for things like generating demo datasets for sales demos. Or for deidentifying data sitting in a large data warehouse or data lake for safe analytics purposes. The tool is kind of like naturally evolved to fit these other use cases using the same like underlying technology, which is great. Because obviously, we want to be constantly making the pie bigger. But still, it's like, as Andrew said, since day one, it's been about dev and test data and that remains true today.

[0:08:38] GV: So, walk me through, I imagine, there's maybe two cases here. We've got one, I guess, where a product is fairly new, and actually, but it's going to need a ton of data to understand this is probably going to work the way that the developers are sort of intending it to. The other side, I guess, is there is tons of data already in there. But it's not okay for the developers to have access to the real data. So, could you maybe walk us through kind of those two scenarios or any other scenarios. How does it work? Is what I'm interested in.

[0:09:09] AK: So, of those two scenarios, we primarily see the second scenario. That's where the need and the demand seems to be. I'll speak to that. But like, I guess, before I do, I will say that if you have a new application and you have the need to fill a database with data, so that you can like test the application to see how it behaves, that's possible to do with our product and with other products as well. But our product relies on there being an existing set of data in the database, so that we can create something for you that is super realistic. If you come to us just with an empty database, yes, we can fill it with data that will satisfy constraints, that will adhere to the correct data types of the schema, et cetera. But it's going to be garbage. Maybe that's fine for what you're trying to test. But it can't be realistic if we don't have anything to base it off of.

The alternative would be, it's like, okay, well, I'll just generate rules for how every column has to be generated. But then, what you end up doing in that case is, you have to come up with a rule set that is basically as complex as like the business logic of your application, and that becomes difficult to do very quickly and doesn't really scale well. People typically shy away from doing that quickly. To Andrew's point earlier, where he was having us Poisson processes, and hidden Markov chains, and all of these things. If you're kind of like having to set these rules yourselves manually, it becomes very difficult. It's better to use a tool like ours that kind of just looks at the existing data set, and is able to create a synthetic or deidentified version of it.

Andrew, do you want to take that second question on like, okay, so people come to us with a database full of data, their developers can't access it, what do they do?

[0:10:45] AC: Oh, yes. I mean, their developers can access it. That's exactly where we come in. As Adam said, we're building on top of the knowledge of your dataset. From there, we're able to recreate things like the weird business logic that you have, and the weird - it's not even weird. It's like the natural phenomena in the world that are represented in a database. You've got a dataset of healthcare things, healthcare claims, that's a common one. Like healthcare insurance. It essentially represents the real world in this kind of esoteric arcane way that is known as a relational database. The real world is really complicated, and interesting, and rich. That's why we believe the best data comes from data that exists, because then we can learn about these real-world phenomena and recreate them.

[0:11:36] AK: Yes. That's right. I think you can possibly bucket, like the people that come for use case too, Gregor, the one that you said, where they have data, they just can't access it. Really, I think those users can be bucketed into one of two groups. It's like there's development teams that can no longer access production data, so that they have to create their own staging and development databases. They're typically very poor imitations of production. They lack the complexity and plethora of combinations that exist in production, they lack the edge cases, and they certainly lack the scale of production. That is to say, the number of rows.

For those customers, you bring in Tonic. Tonic, essentially, overnight is giving you high-quality data that looks and feels just like production, and it's at the same scale as production. So, for those teams, Tonic becomes a developer productivity tool. We've just made your development team way more efficient, because they get more realistic data to work with. Then, we have customers that are actually, today, using their production data in their lower environments for dev and test.

Typically, there's a lot of red tape associated with that. It's hard to do things, because you have to have the same controls in place on dev and staging that you do in production now. But for those customers, we're typically more thought of as a privacy tool, right? Before they were doing something that really wasn't safe, they were giving everyone access to production data. Okay, well, now they're not. You get the same amount of efficiency with a little bit less red tape as well.

[0:12:52] GV: I think that's a good thing to call out that there are really two sides to this. Your mission statement very much clearly says this, that, on the one side, this is about making developers more productive. On the other side, it is about that everyone has a human right, effectively, to data privacy. I think the majority of people out there probably never have realized that a lot of the data. Okay, it's not out there in the public, so to speak. But developers, unfortunately, have often had access to data that they should not had access to.

[0:13:20] AC: Yes, I mean, it is. It's in the public, in a sense, right? It's like these people that you've never interacted with are aware of have access to data. It was a couple of things that really makes me proud of working at Tonic, one is that, that we're unequivocally doing something good in the world. I think a lot of companies, a lot of tech companies, especially, have like these lofty mission statements, and we don't even have a lofty mission statement. I don't think you'll find one on our website, or like, you go to like about or whatever company core values. It just doesn't exist on the website. But this is like pretty clear in the founding of the company and what the company does, like we protect people's data. That's what we do.

Then, the other thing that I think is really cool about working at Tonic is we also get to work on really gnarly computer science problems. I mean, lots of places get to work on cool problems too, especially scale problems, or as you grow a company in terms of like the technology and what have you. But we could also work on like really mathy problems. And if you want to understand the insides of a database, Tonic is a good place to learn that stuff.

[0:14:16] AK: Yes, certainly.

[0:14:18] GV: In terms of how that data gets, I guess, populated by Tonic. Is it table by table? Or are there actually links where tables that have data that rely on other tables? There's now links there as well.

[0:14:33] AC: It's complicated. I mean, I think we have some of the best technology for quickly loading data into a database, because that's like our principal performance bottleneck, is getting data into a database. For the most part. Depends on the database. But some databases are really good at ingesting data, like Snowflake, for example. But if you're talking about Postgres or Oracle, then yes, that's a principal part of the technology and there are a lot of different ways we traverse the table graph. The table graph is sort of like the graph of tables using foreign keys as a way to - that's the edges and the nodes or the tables themselves.

There are a lot of different ways that we traverse that graph depending on what we're doing. We have this technology, another technology, really proud of the subsetting technology which shrinks a database and it does that by starting with a target table or more than one target table, taking a portion of that target table, and then using that seed set of rows to expand throughout the rest of the database and collect all the rest of the data associated with those rows. So, that would be like, if let's say you're an e-commerce website, and you collect some users of your e-commerce website, you'd be getting their purchases. The actual rows that represent the items they purchased, their address book, their credit card book, or wallet, or whatever you want to call it, and hundreds of tables that are being touched by this core set of original rows that you collect.

That technology, right there, is fundamentally a technology around how do you traverse the graph of tables. Yes, it's too complicated without visuals to explain on this podcast. But it's a really cool piece of technology. We have some ancient blog posts about it that are a little bit cringy when you look back at them, just in terms of like, we were like, "You know what will sell? A funny blog post title. If our blog post title is hilarious, people are going to click a hell out of it." So, we had something like, Honey, I Shrunk the Database. I don't think most of our readers would even connect with that at this point. It's like an old movie.

[0:16:32] AK: I actually just watched that movie with my kids recently. They loved it.

[0:16:36] AC: Yes. Well, that's a kid's movie. So anyway, yes, there's some really meaty technology problems underlying there and funny titles didn't make a difference. We were still unknown.

[0:16:48] AK: Yes. But we got some of our first customers from the open-source library condenser that we - early on, there's a lot of noise out there. Every company is trying to get your attention and trying to get your slice of that attention is a real challenge. But we were trying things like Andrew saying, like, funny blog posts, and things like this. It doesn't really work. But one thing we did that was successful, but very expensive for us was to actually open-source some of this technology, and the original subsetting engine we wrote in Python. We called it Condenser, because it condenses things. It got us our first customer. I think that's correct, Andrew, isn't it?

[0:17:24] AC: I don't know. Maybe -

[0:17:26] AK: It did. I don't want to say the name of the customer, because I'm not sure what our logo rights are with them. But it got us our first customer, because they actually filed issue on the repo, and then that started some conversations with them, and then that eventually led to commercials. I mean, at the time we were writing this, it was just like, I don't know, it's very frustrating, because we have this cool tech. We didn't want to get it out there. And that was really, I don't know, there's really some wind in our sails that was very much needed at that time.

[0:17:53] AC: Yes. And subsetting has been a huge thing for us. I mean, it's probably like 50% of our customers use subsetting, which is a pretty big deal. I don't think there's any other single feature in Tonic, except for maybe the very core UI or whatever. But 50% of our customers use that. So, that's like a pretty big penetration for us, which is really, really great.

[0:18:12] AK: Absolutely. It's something that's - like our subsetting engine is very unique to us. The underlying technology. We have patents on it. The approaches we take, the experience. It's all patented. It's a very good engine. I've never seen anything really approach the level of sophistication that our subsetting engine can do.

[0:18:31] GV: To those that are not familiar with, how would you, just at a high level, describe subsetting? And what about Tonic's engine is kind of superior in that?

[0:18:39] AK: Sure. Well, I'll describe it. Andrew can say why the engine so great. If you can think of a good reason on the spot like that. At a high level, we basically treat the database as a dependency graph. So, the nodes or tables, the edges of your foreign keys. Then, the user basically provides one or more target tables. These are tables that they say, "Hey, give me a random percent of rows from this table, like 5%. Or filter this table based on this provided WHERE clause." Then we can go and we can go reduce that table in size, based on the percent or the WHERE clause. That's the easy part.

But then we have to like, in a very special way, to reverse that dependency graph of tables, bringing over only rows from other tables that are necessary to maintain referential integrity. At the highest level, that's what subsetting is and what it does. Why our engine is so great and sophisticated. Andrew, can you add that? I'm not even sure if I could answer that now.

[0:19:30] AC: I think the things that we really strove to do is inject determinism into our subsetting algorithm. So, if you just like naturally take what Adam described as this like node, sorry, graph with tables as nodes and edges as foreign keys. A really trivial example of a problem is you have a cycle of foreign keys. You have Table A, has a foreign key to B, has a foreign key to C, has a foreign key back to A. Actually, an even more common thing in real-world tables would be something like Table A has a foreign key to Table A. Right back to itself. If you start collecting data, and you're like, "Okay, I got some rows from A, now I need to get some rows from B, because I've got a foreign key from A to B, which means that for these rows to exist in A, I need certain rows in B. Now, I follow that again to C. And now, I have to go back to A." That means I get more rows from A, which means in turn, I need to get more rows from B, more rows from C, more rows from A, B, C, and so on and so forth. Eventually, you might collect a whole table. There's no end in sight, necessarily.

So, the thing that we really strove to do in our algorithm is make it deterministic, and maybe that's not exactly the right word. Finite number of steps. That's the right word. It needed to finish in a finite number of steps. I didn't want any possibility of looping forever, or looping indeterminately. So, that's what we set out to do. Going into the details of how that works is complex, but the long and short of it is, upfront, before we collect a single row from the database, we have a plan. The algorithm comes up with a plan of attack of like how it's going to do this subsetting, and it never needs to diverge from the plan.

Yes, it's essentially a set of traversal rules for how to explore this graph of tables, that produces that. But yes, that's the core of it, is like this finite step idea. Then, there's a bunch of other stuff in there that's really important when you - that's like the core idea, right? Like, "We're going to do finite steps. Great. We got it. Finite steps. Finite steps." Then we deployed it to like the first customer. And they're like, "Oh, but does it handle polymorphic foreign keys?" What is a polymorphic foreign key? It's like, I have a table that has two columns. One of the columns has a foreign key, like a typical foreign key. Goes to ID five. The other column is like, which table does that foreign key relate to?

So, it's not a foreign key that you would ever see in like a Postgres database in like, as a foreign key, create for ALTER TABLE, add foreign key, whatever. It's like a conceptual foreign key. So, you have all these things that are real-world databases problems. Does it work in Oracle 11g? So, we've tackled all that, too, right? There's this beautiful theoretical concept of it and must have finite steps, and then there's the practical, like, every day, getting down and dirty with a bunch of databases and actually working for people that we've added over the last six years, that makes it really valuable as well.

[0:22:26] AK: I think we'll get to this later in the podcast. Our subsetting technology has unlocked new experiences, features, and products that we can kind of now provide to our customers. As soon as you're able to take like a one terabyte database and reduce it down to a really realistic and deidentified five gigabytes or one gigabyte or what have you, you unlock a bunch of scenarios that you couldn't really do before, because working with one terabyte is cumbersome, and difficult, and time-consuming, and expensive.

So, for example, our customers can take a subset of databases, and instead of outputting them to a live database server, which is like the default mode of operating within Tonic, basically, give us a database, we give you back a new deidentified database. We can now output your data directly to Docker containers, for example. Then, can be pushed to customer Docker repositories, like an Amazon ECR, or Docker Hub, or key.io, what have you. That's something that - you couldn't do that if you were deidentifying a terabyte of data. But as soon as you're reducing it in size, you can start putting it in Docker. Then, as soon as you have it in Docker, you don't have to have all of your developers hammering the same dev database sitting in some RDS instance all at once. Everyone gets their own local version of the dev data to work with, right? All they do, hey, just Docker pull container name. And then you can be up and running in minutes with your own dev data that's all local. Then, we have some other products that I think we'll talk about that take that paradigm even further, in terms of like making the data easily available and widespread in your organization.

[0:24:02] GV: Yes. That feels huge. If the data can be, in itself, a Docker container. Speaking of deployments, Tonic is both cloud and on-prem. What kind of differences are there between those two types deployments? Or is it pretty much the same?

[0:24:18] AC: Not much. Yes, really not much. There's like a couple of things about cloud that are a little easier and a couple things about self-hosted that are a little easier in terms of configuration, and what have you. Certainly, just being honest, if a customer comes to us and says, "We don't care if it's cloud or self-hosted. Whatever you think is best." We would say cloud, 100%. Because just in terms of like getting them set up and onboarded, and us being able to understand what they're up to, and being able to help them quickly, cloud is just easier and better. But we work with the most sensitive data that the org has, and thus the security teams have questions and they're not comfortable with that. So, we end up doing a lot of self-hosted.

But in terms of capability, there really isn't that much of a difference. You could probably find a couple of corners where one is support and the other isn't. But yes, it's pretty much the same.

[0:25:06] AK: Yes. Of course, when you run self-hosted, we've been doing that since day one. So, our self-hosted install is easy and quick and assuming the customers provision like an EC2 instance, or a VM ahead of time, you're up and running in Tonic in 10 to 15 minutes. You can run it entirely air-gapped, if you need to. It's all very, like, smooth at this point. It definitely wasn't originally. So, for your listeners today that are thinking about selling software that customers deploy themselves, there is potentially like a long, painful road to get to a point where it's easy.

If I'd give a person one piece of advice, I don't know, Andrew, if you'll agree with this or not. We make our solutions available the Docker containers. So, however your organization is managing, deploying, and dealing with Docker, you can use that to run Tonic. Initially, we supported just Docker Compose, and by support, I mean, we would provide you with a Docker Compose YAML file that you could use. If you wanted to use Kubernetes, ECS, EKS, AKS, whatever, that was on you, and we will be there to help. But it's not something that we had documentation for ready to go.

I think supporting Docker compose out of the gate was definitely the best move. I think a lot of startups now would probably go straight to Kubernetes, and I think that is likely a mistake. Getting it set up, getting it running, is typically like there's more of a learning curve. I've seen this a lot lately, where I think like when you're starting off, just go with the simplest thing, even if it's not the sexiest, and Docker Compose, I think really fits that bill, especially if you're just getting started with self-hosted deployments. 

[0:26:41] AC: Yes. Do I agree or do I disagree? I guess, I mostly agree in the sense that like, the fastest thing is so important, especially in the very, very beginning. You just need to get it to work somewhere. But with that said, Kubernetes obviously has like a lot of benefits in production, and they're - actually, one thing, self-hosted versus cloud. There is a difference actually, that the capabilities that Tonic has, if you're in Kubernetes is greater than the capabilities that you have in Docker Compose. Because like Docker Compose - Kubernetes is like a real thing that Tonic can interact with, and so can do stuff with it to upgrade itself, or spin up nodes or pods for the purposes of doing various actions, and what have you. So, there are actually a greater feature set in Kubernetes deployed Tonic than Docker Composed Tonic.

[0:27:31] AK: But we didn't add those things till much later.

[0:27:34] AC: Yes. Much, much, much, much later.

[0:27:35] AK: Starting out, keep it simple. I mean, if you can get away with just the Docker run command, to split everything into a single container, I would do that. Andrew, do you remember, we had that one prospect early on where they're like, "Oh, no, everything has to be in one Docker container. We won't run for containers." So, we built that. What did we call it? We had a name for it. The Omni container? Is that what we called it?

[0:27:55] AC: Yes. Something like that.

[0:27:56] AK: So Andrew, overnight, he had to go like create a single container build that ran all - I think, initially Tonic had three containers that ran with three different processes. So, Andrew had to go build like the Omni container, which ran all three processes at once. He had to basically do it overnight, so we can deploy the customer the next day. And of course, the customer never converted, or rather, the prospect never even converted. It was a huge pain in the butt. We ended up maintaining that stupid container for months.

[0:28:23] AC: I got to learn about some stuff there. At least I learned some stuff, in terms of technology. 

[0:28:31] AK: We also learned - I mean, it took us more learnings to really get there. We learned not to do things like that eventually, I think. We're getting better at like saying no to these types of requests. Andrew is better at saying no than I am. I'm a real pushover sometimes. But oftentimes, it's just not advantageous to do.

[0:28:51] AC: The thing that I'm about to say no, which is kind of random topic. But the thing that I've learned about saying no is that actually, like you don't try to guess what they're going to react - reaction is going to be. If you think their reaction is like, "Oh, no, they really need this, whatever, and then you tell them no. It's like, "Okay."

[0:29:12] GV: Yes. Well, we were just checking, but it's no worries.

[0:29:15] AC: Exactly.

[0:29:15] GV: We already spent two hours trying to, "How we're going to say no to them?"

[0:29:18] AC: Right. Don't sweat it so much. Also, it's not going to ruin it. If you say no, it's not going to ruin the relationship, probably.

[0:29:26] AK: Probably not.

[0:29:27] AC: Unless they're really not serious. And if they're not serious, then they weren't going to be a customer anyway. If say no to something small ends the deal, the deal didn't exist.

[0:29:37] GV: These are clearly all good pieces of advice given where Tonic is today. Certainly, something I'm going to be taking notes off because the product I'm working on is very much at the early stages of where Tonic was when you were talking about saying no. And Docker Compose, I also was thinking about not putting out blog posts with silly titles anymore. I'm thinking just to skip that part.

[0:29:58] AK: That wasn't negative, at least. So, if you got a funny title, if it makes you laugh, I think it's fine.

[0:30:05] GV: That neutral.

[0:30:07] AK: That neutral. There we go.

[0:30:09] AC: Yes, that neutral.

[0:30:09] GV: So, given that things have worked out, let's look to where the product is now. I believe we have three follow-on products. Textual, Ephemeral, and Validate. So, shall we dive into all three? Where would you like to start?

[0:30:27] AK: Andrew, let me set the stage for you to talk about Ephemeral. So, if everyone recalls, I was saying, "Hey, when you subset data down, you can do cool things like put it in Docker containers", so developers can just pull those containers down. But you can make it even better and that's where Ephemeral comes into play. Andrew, over to you.

[0:30:45] AC: So, Ephemeral is our latest product for test data infrastructure. There is a lot of great products out there for production data infrastructure, things like RDS and the equivalents and Azure and Google Cloud. But if you want test data infrastructure, actually things like RDS are like kind of antagonistic to being test data infrastructure. If you just take - if you accept like the defaults in RDS, just spin up the default RDS instance, I calculated this a while ago. But I don't remember the exact figure. But it's like going to cost you thousands of dollars a year.

Test data infrastructure doesn't need that level. A lot of the defaults are things like multi-AZ. Testing infrastructure does not need to be redundant and backed up in multiple AZs. Test data infrastructure, RDS, if you shut down an RDS instance, it even has like this warning. We're going to start you back up in a week so that we can apply patches and stuff like that. I've personally shut down an RDS instance, forgot that I shut it down, and then a month later realize, "Oh, I've just been burning through credits, because it just started itself back up.

So, we really wanted to build great test data infrastructure and like testing infrastructure tools. So, that's where Ephemeral comes in. It is an ephemeral database and it works really well with things like subsetting, because it's intended to be lightweight, really quick to bring up, really quick to shut down, really easy with APIs to drive, and a very self-service user interface for individual developers. So, we have like two main use cases we're targeting with Ephemeral. The first one, and these aren't ordered, just the first one I'll mention, is automated API-driven Ephemeral environments. So, that would be for your CI/CD, if you need a database during testing.

CI/CD, such a critical part of modern software development. It wants things to be as fast as possible. There's nothing that are going to make your developers more grumpy than adding 10 minutes to your CI/CD. Ten minutes is an anathema, at least at Tonic, we would never allow that to be merged if it was going to add 10 more minutes to our CI/CD. As organizations grow, I mean, I've worked with organizations where the CI can take three hours or whatever and even longer.

But you want to keep that as tight as possible. So, waiting for infrastructure to start. If you put like an RDS instance start in your CI, that's like a non-starter, and people don't, as a result. Instead, what they do, is they run a long-running RDS instance that is up 24/7, and it's just available like when the CI needs it or whatever. But they're just like burning cash when they don't actually need it. If you set it up accidentally as multi-AZ, you're burning like twice the cash you need to be burning to make that work. 

So, that's one use case is the automated CI kind of use case and the other use case is developer ad hoc needs. I was just playing around with yesterday with Ephemeral because I needed a test database for a bug that I thought I saw. I was like, "Did this bug actually happen? I should try to repro it." And I needed a database to do it and I just used Ephemeral for that, because it makes it really easy to pick a database that I know is going to be filled with the test data that I need and it starts it up really, really quick, and it shuts it down really quick.

The other thing that Ephemeral has, is it's like the opposite of RDS, where RDS will automatically start your database after a week. Ephemeral will shut down your database automatically, depending on inactivity. If your database hasn't been used for the last three hours, it's just going to shut it down. Now, the good news is, Ephemeral is so quick at bringing your database backup, that it's not a big inconvenience if actually, you needed that database. So, you can bring it right back up, bring it back to life. It works really well with Tonic, of course. Tonic provides test data. Ephemeral provides test data infrastructure. It's providing that place for that data to live. Of course, things like subsetting feature really heavily, because subsetting it shrinks the data down and Ephemeral, actually, the technology behind what makes ephemeral so fast, is I wouldn't say scale independent, but it is it's got a very constant on your scale. A very small constant on your scale. It's actually quite fast, even if you have a lot of data.

But probably, if your test data infrastructure, or rather test data, you don't need lots of data. So, shrinking it down beforehand with subsetting and putting it in Ephemeral is a really smart and good idea and one that we enable really well with the Tonic suite of tools. So yeah, that's kind of Tonic Ephemeral in a nutshell. I'll just recap the highlights. One is, it's test data infrastructure first. We designed it as test data infrastructure. Unlike RDS, which was designed originally as production data infrastructure. That means that it's really quick to bring up and really quick to bring down, because we know test data infrastructure needs those things, and it comes pre-populated with data, meaning you can save kind of a snapshot of a database with the data, and then recreate that database. And it really targets these two use cases, developer ad hoc testing, just needing a database to tool around with for a day or a week, and the automated CI, API-based deployment of test data infrastructure for continuous integration.

[0:36:15] AK: Andrew forgot the most important part about Tonic Ephemeral.

[0:36:18] AC: Oh, no.

[0:36:19] AK: Yes.

[0:36:20] AC: What's the one part?

[0:36:21] AK: That your listeners could go create free accounts today. You can learn more at tonic.ai/ephemeral. From there, you can either book a demo, or just go and create your free account and start spinning up test data infrastructure, basically, immediately. It's a great experience. I encourage everyone to go try it. If it's not clear to you, Gregor, Andrew is a better engineer than I am. But I'm a little better at selling. So, that's where I pride myself.

[0:36:48] AC: I would agree.

[0:36:48] GV: I was going to ask, so how does a developer get up and running with Ephemeral. But actually, I was going to ask more to the - I mean, because we're going to cover a couple more products, Tonic products in a second. I was curious, when you're getting up and running, can you just sign up to Ephemeral? Or are you signing up to a Tonic account total? Or how does work?

[0:37:09] AC: They're separate right now. I mean, that's just like a - I think, if design and some engineers had their way, we would like first develop some sort of unifying mechanism. But back to Adam's point, the only thing that matters is getting it out there and we did not develop the unifying, the [inaudible 0:37:25] off whatever thing to allow all Tonic projects to be under one system.

[0:37:30] AK: That doesn't get us like more customers. It's something that's nice to have that we can add later. People will appreciate it. It's not the paradigm shift a game changer that we're constantly searching for as founders of the company. So, should I talk about our other two offerings quickly? 

[0:37:48] GV: Yes, fantastic.

[0:37:49] AK: Excellent. So, we have two other offerings. These other offerings are actually slightly further afield from Tonic's core offering of deidentifying and synthesizing data, primarily for dev and test use cases. The first offering that I'm going to speak about is called Tonic Textual. It's kind of meant to be a stark contrast to Tonic Structural. The products that we've been talking about up until this point. Tonic Textual takes unstructured text data. It could just be raw text or it could be text sitting in a file, like in a PDF, or a Word document or even an image, and it will identify the sensitive bits of information within that text, and then redact or synthesize the text.

There's two primary use cases. The first is actually very similar to like Tonic's core use cases of dev and test. Let's say you're an engineer that's built an unstructured data pipeline. That is to say, you have an S3 bucket filled with Word docs or PDFs, and you want to take all of this data and build a pipeline that eventually lands it in some type of structured form, like in a database table somewhere. So, you build this pipeline, but now you have to test it. You can use Tonic Textual to generate synthetic or fake PDFs, Word documents, text files, et cetera, to test your ETL pipeline, essentially. That's a very tonic structural-esque use case. It's for the same customer, the same user. It's just unstructured versus structured data.

But there's a better use case for Tonic Textual. I shouldn't say better. They're both good. But it's the use case. The reason that we built Tonic Textual is actually for data scientists. So, I'm a data scientist. I am starting to get into Generative AI. I want to go build chatbots or RAG applications, or just applications that are using my company's unstructured data as its backbone. Maybe I want to build an internal chatbot or even an external chatbot. Maybe I'm a healthcare company. I have troves of chat transcripts between patients and doctors. I want to build a chatbot that can help answer questions like as the front line on my website.

Well, to do that, I need to deidentify and redact all of those transcripts before I train my model so that I prevent data leakage when it's actually being used. So, Tonic Textual is great for that use case. The form factor is - well, there's two form factors. We have a UI where you can just upload files and download redacted versions. But then we have an SDK that fits very well into existing like data science applications. That's primarily meant for data scientists that are trying to deidentify and synthesize text prior to like fine-tuning and training Generative AI models. That's where we see a lot of usage as well. 

That's the use case that I'm personally most excited about. I very much like that use case. I think it's cool. I love that Tonic is kind of expanding into additional customer profiles, that's always exciting to see. Folks can try that out at tonic.ai/textual. That will take you to a landing page where you can learn more, and then there's a very big button at the top that says, "Create Free Accounts", and that allows you to create an account and try it out today. I think we give you, I think 100,000 words per month for free. But that might change by the time you create your account and that'll allow you to like try things out and give it a go to see if it works for you.

The other offering we have is called Tonic Validate. Tonic Validate is even further afield from our core offerings. Tonic validate is an evaluation framework for LLM-backed applications. So, what that means is, let's say you're building a rack application, for example. A RAG application is essentially an application that takes your company's data, and then when you ask questions to your LLM, the RAG system is retrieving just the necessary context from your company's underlying data to provide in the prompt to the LLM, so that the LLM can give an answer.

Of course, you can't send the LLM all of your company's data because it won't fit into the context window. So, the RAG application, its main purpose is to get the most important context so that we can fill the context window with the things that are most relevant for the question being asked. You build this application, you're happy with it, but now you got to test it? Well, I mean, typically, with a test, it's like, okay, I give the application input and this is the output I expect. When you're using LLM based applications, it's a bit different, right? A, they don't have to be deterministic. And B, the quality of the answer provided by your RAG application or the quality of answer from the LLM, it's very subjective. It doesn't have to be. It could be like, "Hey, what was the contract start date for this contract?" The answer is just going to be a date. It doesn't really change.

But it could be a question like, "Hey, what's the best way to get started doing X, X, and X?" Whatever that is, right? Well, that answer is going to be very subjective and how do you test that? So, Tonic Validate essentially hooks into your LLM-based application, and then it can evaluate responses from your large language model, and it can provide scoring metrics that tell you, "Hey, how's the quality of this application and the answer is that provides changing over time." Our customers typically integrate this into CI/CD pipelines, so that every time you're making changes to your LLM application or to your RAG application or what have you, you're getting information about how answer quality is changing. And then you can see over time, "Hey, wait, a second, quality just dipped. Let's go to that commit. Let's kind of see what happened. Let's make some changes, et cetera." So, think of it as a test tool, specifically for LLM-backed applications.

[0:43:11] GV: Got it. So, I mean, I'm curious, this sounds like at Tonic, there's quite a sort of fully featured product suite, almost, if you want to call it that now. But I mean, there must have been many areas that you could have explored and gone to. Now, it's quite focused now down to Ephemeral, Textual, and Validate. How did you kind of arrive at the three extra offerings?

[0:43:37] AK: It's the right time in place. I answered that earlier, when you said, how did I get started with Tonic? Well, I was in the right place at the right time. For Validate and Textual, there are different answers for each. For textual, we talk with our customers every day. We learn what they need, where their pain is, what's missing in the market. We just got enough requests about deidentifying and redacting unstructured texts that it's like, "Oh, we have to do this." And really, those requests started coming in when Gen AI became big. The need for deidentifying and synthesizing texts for Gen AI purposes is clear. The use case is clear, and that's when we started seeing an influx.

We were luckily, in the right place at the right time, meaning we had a company, we had customers, we have good relationships with our customers, and they let us know what's missing. Validate was born out of a failure. We were building our own RAG application offering that we decided was not good. But when we were building it, we built a really cool test framework, so that we kind of see how our application was doing. We thought that application like the test framework was so nifty that we decided to open-source it and then build a web UI around this open-source offering and that's what we have today. By the way, I didn't mention this, but you can check it out, tonic.ai/validate. Andrew, how did we get on Ephemeral? Do you know the history of that?

[0:45:01] AC: Yes. I think, honestly, we were just like looking at our main offering and like, what would complement it really well. What else could we sell to our customers? And we had the test data, but we didn't have the test data infrastructure, and we saw a gap there in the market, right? Like I said earlier, RDS, the default settings in RDS are really kind of antagonistic to being used as test data infrastructure, and there weren't that many offerings in the space for it. So, we thought this could be a place that we could add some value in the market, and that the market would reward us. So far, that's true. That's great. Ephemeral might have been like a little bit simpler in terms of its origin. It was sort of like, "Well, what else do we need? And what else do our customers need? Perfect. Let's build that."

[0:45:45] AK: It aligns so well, with our vision of like the test data space. It's not coupled tightly to our core Tonic offering, but it sits adjacent to it very nicely. Yes, you're right, Andrew. It's just, I guess, less exciting, or less complex of a backstory.

[0:46:03] GV: What team changes or what's the makeup of the team, especially, I'm curious on the engineering side. I guess you've gone from one core product effectively before. How is that team? Are they distributed across the different products? Or was that quite - is it not that way at all? I'm curious.

[0:46:22] AC: Yes. We do distribute them among the different products. So, we have a team that's basically dedicated to Ephemeral. We have a team that's basically dedicated to Textual and Validate. Then we have teams dedicated to Tonic. Tonic Structural, I should say. Tonic Structural is still the majority, because that's where the majority of our customers are, and the majority of our revenues are, et cetera, et cetera. But it's actually kind of beneficial to have a smaller team in the beginning, because when it's smaller, everyone knows what everyone's doing. So, you can kind of get away with less in terms of ceremony and process, and even things like, I mean, this might be an anathema to some of your listeners, but even things like testing processes. With a really small team, everyone knows what everyone's doing, so people don't make mistakes as much in terms of like, "Oh, I didn't even realize that feature existed." Well, you know because there's your neighbor is the one who made it and there's only three of you, or whatever.

Back when Tonic first started, and it was basically me and Adam, and maybe one other engineer coding on it, we didn't have very many tests, and we didn't need them. Honestly, the software quality wasn't that bad. It's only as the team grew, and grew, and grew, that people become unaware of the impacts of their changes, and it's not just the team growing, it's the codebase growing, right? So, the code base grows, and you become less aware of how things are coupled in unusual ways, and that's where the gotchas start happening, and the regression start happening. Now, of course, we have pretty robust test suites.

But in the very beginning, we didn't have test suites. Also, test suites don't sell. You start in your company, I'm just going to throw this out, hot take, test-driven development, not a good place to start if you don't have any customers. You don't have any customers and you're worried about test-driven development, I don't know, I probably made some enemies.

[0:48:21] AK: Not for me, you did not. You should just be throwing as much crap at the wall as you can and seeing what sticks. That's the only approach that you should take. I'm going to add on to what Andrew said about small teams. It's probably like a second-order effect. I think what Andrew said is probably more important and more relevant. But initially, when you're building a new product, you don't always know what to build. You rely a lot on feedback. That feedback, initially, is like few and far between, because you just don't have a lot of conversations happening with prospects, right? It's just that you never get the attention you think you're going to get, right?

So, if you have a larger team, what ends up happening is you have good engineers, they want to make a difference, they want to help, they start building stuff. It's almost never the right thing to build initially. So, when you have a larger team working on a brand-new product, people end up bolting on a lot of stuff that is just not really adding value and it's probably detrimental because it's code that's never going to be exercised. It doesn't have tests. It's adding to complexity. It's like no one's winning by this code being written, essentially. But when you have a small team, you get less of that. I think that is another really nice benefit from kind of like starting small and then growing the team is slow - like truly, grow your engineering team as slowly as you can on new products. I only like to add people when it's like becoming painful. That's my experience.

[0:49:41] GV: It's a great tip. Yes, I'm probably going to make some enemies as one of the hosts of saying this, but I also agree. Test-driven development on an early product is definitely a waste of time and effort. Time is finite. You should not be doing or writing tests. You should be writing features. Things that customers want.

[0:50:00] AK: When your products fails to achieve product-market fit, which is 99% of all products, including the ones that Andrew and I come up with, you're not going to be thinking, "Darn, I wish I wrote more tests." You're going to be thinking a million other things.

[0:50:13] GV: So, we're sort of starting to come to the end of the episode, but I'd love to hear, maybe some kind of - when you contextualize any of these products, whether it's the newer three or Tonic Structural, what are maybe a couple of examples, I don't know, of customers, obviously, maybe can't name them. But things that that you've helped with and you're really proud of? What has Tonic helped with, especially, perhaps, to the fact that people's data should have the human rights of private data? What are a couple of things that come to mind when you think of proud deployments?

[0:50:53] AC: Yes. I think we can name some customers, for sure, because we have them on our website. And I think the first one that comes to mind is eBay for me. I mean, obviously, it's like a big impact for Tonic, because eBay is a logo, but also a big impact for eBay, we really help them through our subsetting technology, create better testing environment for their automated tests. They were struggling with being able to run automated tests against realistic data and we really helped them solve that through our subsetting.

Adam, earlier, in the call mentioned something about taking massive-scale data down to gigabytes, they really do that at eBay. Their production data is eight petabytes and what they extract from that iteratively, meaning like they extract many of these one-gigabyte chunks, basically. I mean, they're not guaranteed to be a gigabyte, but they're basically about a gigabyte. And then they accumulate those in a test repository of test scenarios that they can run their automated tests against. They wrote a whole blog post on this too. So, you can check out eBay's blog to learn more about it.

But I mean, that's a big impact for everyone, right? It's was a big impact for eBay. It was a big impact for Tonic. It was a big impact for the people of the world, because a lot of people use eBay. So, just made a better product for everyone, actually, made a better eBay product, also made a better Tonic product. eBay was an early customer for Tonic and we had used subsetting in a couple places. Earlier to my conversation about subsetting and what all was the great feature of subsetting, and then the second thing I said was like, "Well" - and it works, right? It works in the production context. eBay taught us a lot about how it needed to work with Oracle, in particular, and with a really complex schema with lots of databases. I mean, they have hundreds of databases, and then they also shard them, so you can think about that as thousands of databases that it's working across, which is pretty cool.

[0:52:54] AK: That was really pivotal for us when that started working. My answer to this question is, it's less detail. I'll just say, I am most proud and happy when I see our customers using Tonic-generated data across their environments. When we have customers that essentially all developers only use Tonic generated data. I just know that that's a customer that is happy. They have private safe data. It's easy to access. It's higher quality. They are finding bugs faster and more reliably and I really like that when customers like fully deploy Tonic across their organization.

[0:53:32] AC: Yes. It is really rewarding. Just as a developer, I mean, putting myself in their shoes, it's sort of like every decision - you make so many little decisions today, as a developer. How big should this thing be? How big should this field be? Or how much space does this image going to take or whatever? Like having really good synthetic data for all those things makes every little decision you make, and having it deployed across an organization like that really enables that, which is really cool. 

[0:53:59] GV: So, Adam, I think you've already helped us out a few times during the episode with the links and how to see the products and get hands-on. But for example, if the developer is listening today, and time is precious, what's the kind of one entry point that you're thinking that you could say like, a developer wants to get up and running with something Tonic product-based, where's the best place for them to go?

[0:54:24] AK: Oh, my God. Just to tonic.ai. Our product page is right there. And for any of our products, you can go create free trials. Anything that we mentioned today, you can go create an account right now and begin using it. These, of course, are accounts that are hosted in our multi-tenant cloud environment. You can use it, you can get a feel for it, see if you like it. Then, if you like it, you can either basically use it in the cloud forever, or you can have a conversation with our sales team and we can discuss how to get you self-hosted, meaning running from your environments.

[0:54:53] GV: Nice. That's a pretty compelling offer. So, Adam and Andrew, it has been very fantastic to speak to both of you today. I feel like I've been in the right place, right time today, just getting to speak to you and learned a lot. I'm sure our listeners have as well. Yes, I hope we can catch up again in the future sometime and hear how these new products are going.

[0:55:17] AC: Yes. Me too.

[0:55:20] AK: Likewise.

[END]
SED 1670		Transcript

	(c) 2024 Software Engineering Daily	1