[00:00:01] SF: Amruta, welcome to the show. 

[00:00:01] AM: Thank you, Sean. Good to be here. 

[00:00:03] SF: Yeah. This is actually your second time here. I really appreciate you doing this. I know you're not feeling necessarily 100% today. I like that's you're soldiering through just the same to be here to talk to me. 

[00:00:15] AM: No. It's fun to be here. I think anything for just having a really awesome conversation. 

[00:00:23] SF: Yes. Previously we discussed running secure workflows over sensitive data. And today we're actually going to tackle I think probably the hottest topic in the world right now and definitely the hottest topic in the tech world, which is around generative AI and how to build with these technologies in a safe and secure way. 

And I thought a good place to kind of start to dip into this is to look at your sort of background in AI. I know that you've previously worked on projects like Bing at Microsoft. You also worked on Einstein at Salesforce. But where did your background in AI start? Was that something that you studied in university? Or was this mostly experience that you gained in the workplace? 

[00:01:05] AM: It kind of organically started in university a little bit. Not on crazy AI. But when I was in university, one of my main projects was building a matching algorithm to match kind of mentors to mentees and things like that. And it was nothing complex. The basic combination, and collaborative filtering and content-based filtering algos. But that's kind of where it started. 

Just piquing my interest working on AI and then it just got enhanced as I started going through my work life. Bing, I worked with the Powerset team there. It was all about natural language processing, semantic, machine learning there. Then from there, I even went to Topsy Labs where that got even more enhanced because we were doing more things on the Twitter firehose. And then I jumped on to the Einstein world where we were now looking at several different algorithms for several different use cases. It started kind of organically. Piqued my interest. And it's just continued through my career. It's been fun. 

[00:02:22] SF: Yeah. And with Einstein, what was the problem that you were trying to solve there? Was it around sort of something related to data analytics? Because that's built into Salesforce in some fashion, right? 

[00:02:36] AM: Yeah. With Einstein, Einstein's kind of a blanket term that Salesforce uses for AI. And if you think of Salesforce is everywhere into any business workflow. And it was all about how can we use AI to optimize business workflows? And specifically, what we were trying to do when we were thinking of like large corpuses of data in BI or analysis areas is we were trying to democratize data. 

Because what happens is there're all these data teams and we would always observe that executives and a lot of different folks who really need insights from this data and have to work on it have to send a request. Someone's going to go. Do some analysis. It's going to take them a long time. The analyst goes back. And it just keeps going on and on. And there's like so much inefficiencies there. 

And AI is never really given in the hands of every single person who needs it. If you think on a larger level, it was all about operationalizing AI, democratizing AI to help folks get insights from their data. Do predictions. Do all sorts of intent and trends. Because, a lot of times, people are looking for seasonality in, let's say, their own customer behavior, their product behaviors, their sales teams. All sorts of things. Looking at corpuses of data and tell me what's wrong. What's right? That's kind of the different types of problems specifically we were trying to solve there. 

[00:04:17] SF: Yeah. And you must have had to apply sort of different AI approaches, I would assume, given sort of the diversity of data and problems that you'd be trying to solve with Salesforce. 

[00:04:27] AM: Oh, yeah. Totally. I think I don't remember the number. I think we did a bunch of acquisitions there as well as we built a lot of models internally. And they expanded all the way from – I think there was also vision. Of course, language and sentiment. There were bots, of course, because we were trying to do personalized interactions for all these queries and conversations that were happening. Predictions, pretty normal and obvious. Intent. There were a lot of these different algorithms that we're working. 

[00:05:02] SF: And was that expertise that Salesforce had in-house at that time or did you end up having to essentially build AI teams in-house to solve some of those changes? 

[00:05:13] AM: It was a mix actually. There were a few folks internally that had the expertise. And then we also acquired a lot of companies. Because when you're thinking of AI, they're different algorithms. I mean, there's aggression models. There's clustering. There's different people use different techniques to solve the same problem. 

We acquired a lot of companies and a lot of AI knowledge came from that. A large part came in from that. Plus, there was a lot organically. And, yes, that resulted in building AI teams right there. 

[00:05:46] SF: Yeah. One of the things that you mentioned when you talked about sort of some of the initial thoughts around Einstein and motivation around it was this idea of bringing AI into the hands of everyone. And I think that's something that we're actually like really starting to see with generative AI. And in particular, things like ChatGPT. 

And even a couple months ago, I went to visit my parents, and my dad who's in his mid-70s picked me up at the airport and started asking me questions about AI. Like, it's on his radar at this point. I'm curious, when did ChatGPT sort of first come onto your radar? And what were some of your first thoughts about it? 

[00:06:27] SF: It's interesting. I have these conversations with my seven-year-old and my 11-year-old now because she's like, "Oh, I just searched this in ChatGPT." But in terms of like when it came on the radar, GPT3 was there for some time. Because of uh the folks that I had worked with in the NLP world, that was already doing the circuit. 

I think late 2000 – I think last year. Because I remember Christmas time when a lot of ChatGPT conversations were happening. And especially, in Jan or – I don't remember the exact date when it opened up. And every single person could try it out and I was like there. Someone's actually done what we were trying to do in Einstein for the corporate world to everyone else out here. They can actually leverage and do something. 

Actually, when we were at Salesforce, we had coined the term called citizen data scientists. Making every citizen a data scientist. And I was, "Open AI has done the same thing for every single individual here. Now each one of us are a citizen AI folks or whatever you want to call it. That's kind of when it came up. I knew about GPT3 for some time and then it was just about ChatGPT where they actually created that interface where that changed everything.

[00:07:55] SF: And did you see that as something that was as somebody who has expertise in the space, has worked on it a long time? Did you see that as something, like, this is groundbreaking? Or were you more like, "Oh, I kind of seen this. They just threw a nice interface onto it." 

[00:08:12] AM: Oh, no. This was groundbreaking. Because I always believed technologies exist. AI is not new. What creates change and becomes groundbreaking is how you use it, how you apply it and how you give it in folks' hands. And this was groundbreaking. 

And this was like this big wow moment. We don't get a lot of these moments in the technology history. And this was one of those. And I was like, "This is going to change everything." Like, this is going to change – it's already changed how we do every single thing right now. 

[00:08:42] SF: Yeah. Do you think that – as you mentioned, AI is not something new. Even neural networks dates back to like 1957. LM, since like the 90s. Even the transformer technology was – I don't know. Six or seven years ago at Google. 

[00:08:59] AM: Six. Yeah. Yeah. Ten years altogether. 

[00:09:01] SF: It's not all new. Do you see the sort of what's been responsible for this like huge explosion in AI companies, products, interest from the general public, interest from – my dad is –  essentially this UI that open AI came out with that really sort of made AI just an everyday thing for anybody interacting with technology? 

[00:09:27] AM: I think so. And I think it's, over time, a lot of things that have computed to the point where we are right now, right? It all starts from even compute. When AI was invented, it was extremely compute-heavy chipsets and technology was not where we could actually run these models at such efficiencies in the kind of machines that we run today. That has been a great factor in terms of bringing these to the market. 

The other big piece has been data, right? Over time, we just have. Because a model is only as good as the amount of data you feed it or how you've trained it or how much data it's been trained on and things like that. And I think all of these things have compounded to the moment that we are right now, where not only do we have efficiencies from a compute standpoint. We have the data and we have really well-written and trained models. 

And on top of that, what OpenAI did is exposed it to everyone else in a way that they can use it and in a way that they are used to using it. Because everyone chats now. If we had done ChatGPT maybe 25 years ago, maybe it wouldn't have had this kind of effect that it has today. But we are so used to using Google. We are so used to chatting with people, messaging people. And this same interface now. And, of course, whatever you say, COVID made us digitally forward. Even like our parents and everyone else have become very technologically savvy across the world because of COVID. And it's the amalgamation of all these things. 

[00:11:07] SF: Yeah. Yeah. I think you're exactly right. It's really the combination of the scale of the cloud, scale of data and a variety of different innovations that have happened in the space that led to this sort of moment in time where this feels transformative. It feels like the birth of the internet or the first smartphone or something like that. 

[00:11:31] AM: Yeah. 

[00:11:32] SF: And I think it's going to have a huge impact on the industry. And I read recently from McKinsey research that generative AI, coding support, can help software engineers develop code 30% to 45% faster. Refactor code faster or document code faster. Simulate edge cases. All this kind of stuff. 

I'm curious, before we start getting a little bit deeper into the technology, what are your thoughts on just the potential impact, good or bad, for people who are working as developers today? 

[00:12:02] AM: I always look at innovative times like these as a great opportunity. For developers out there, there's so many statements being made around, like, now every 2x developer is a 10x developer. And, "Oh, you can reduce costs of developers now that AI is there." That's all great. 

But I think now, what I really like to think about it is every developer now doesn't need to spend time doing mundane things. They can actually spend more of their time doing interesting things that they want to do and start upskilling themselves or doing that 10% that they never got a chance to do. That's kind of how I look at it. 

With all these co-pilots and everything that folks are using right now, efficiencies have just improved like anything. And it's all going to multiply and come downstream with the innovations. That's kind of how I feel. But, of course, as they say, with great power comes great responsibility. We have to make sure that it's getting used right. It's being implemented in the right way with the right controls and thinking of it the right way. 

[00:13:15] SF: Yeah. I think that's a good lead-in to start to talk about how some of these technologies kind of work and how they might be different than other forms of AI that we've seen in the past. Maybe a good place to to start that conversation is what is generative AI and how is that actually different than other forms of AI that we might be familiar with from the past? 

[00:13:41] AM: Generative AI, the way to think about it, it's like subset of all kinds of AI that are there, right? And when I was talking about like all these other AI pieces that I had worked on, most of those AI models, they're focused on solving something specific. Maybe doing some predictions. But all of that on existing data. 

While generative AI models, their whole purpose is generating new content, new data, right? That is similar to what we humans have generated or has been there before. On a very high level, those are kind of the two key distinctions between how we think we've thought of all these other models, let's say, with supervised, unsupervised, or reinforcement, transfer, all of those things, and the generative pieces that are there. And it's mainly based on deep learning architectures and things like that. 

[00:14:41] SF: Large language model, or LLM, is a form of generative AI for essentially creating text. And that's what GPT is. It's a form of LLM. But how do these models work? What is the sort of training process that leads to being able to essentially generate something like a poem or in response to a prompt that leads to several paragraphs describing some sort of phenomenon? 

[00:15:06] AM: Yeah. A lot of these models, when the model is built, it doesn't know anything. And then you have to literally feed it a lot of information. And that's where they actually get what we could call natural language understanding and the ability to extract and comprehend meaning from text. And that's where you have to feed it a large amount of information. 

And the way you feed information to these models is also in a specific format like you would do for any machine. You share about embeddings and vector databases. And I won't go into the technicalities. But basically, you're taking a bunch of data. You're breaking it down into a way that a model will understand it, which is numbers or vectors if we can call it, and then feeding it in. 

You start giving it a bunch of data. And what the model is doing is, while it reads all of this information, imagine the huge corpus of information that a whole library, multiple libraries, if you can, if you want to really objectify it. And it's trying to understand which word comes after a particular word or which character or what comes after where I am right now. And that's basically what every generative algorithm is doing, is trying to accurately predict what is the possible next thing that's going to come after what I have done right now. It could be like after an and. 

And all of this, while it's getting this data, it's also trying to understand. This is where what we've known with what semantic algorithms and everyone do. Think of something similar where they are putting all of that natural language understanding in there. It's not just blanket know after an and. There's a the. But there's a context. There's an understanding of what you've written and then what's the next thing that needs to write, be written down. 

The way you actually do language models is build a model, give it huge corpuses of data and train it. And then the glory of these models, and which is why it's huge right now, is you can actually take a model that has been trained on a large amount of data and then fine-tune it. 

What fine-tuning is, is basically after a model has been, what we call pre-trained, you can give it some additional smaller data sets and say, "You know what? You have a lot of this knowledge. Now I want you to take that knowledge and apply it to this particular topic or this particular subject and this particular type of data that I'm giving you. And then I'm going to be able to use you for these specific tasks. 

And that's what we are also seeing a lot happen right now, where you are taking a pre-trained model or an LLM and then fine-tuning it. Giving it additional data, additional context and letting it give you answers or generate information based on the topic that you're interested in. I kind of probably went a little further. But that's kind of how you want to think about like how these large language models work. 

[00:18:23] SF: Yeah. And during the training phase, is there any form of like supervised learning in there where a human is involved with actually giving feedback about whether something makes sense or not? 

[00:18:35] AM: No. That's the big piece that, when we think about privacy and the conversations that are happening right now, is there is no human, right? It's all just machine. Whatever you are giving to an LLM, that's it. You can't take it back. 

And there's a lot of conversation right now around unlearning. And I know there's – recently, I read that from a regulation standpoint, folks are saying if you've leaked information to an LLM that you thought you didn't want to leak, you got to like disband that whole LLM or not use it anymore. And people don't realize the amount of money, effort and time that goes into training in LLM. But there is no human involved. And what that means is there is no oversight. 

Privacy becomes a big concern. And all the information that you give to an LLM now, you have to be very cautious about what you're doing there. Because that information can get used at any point in time. You may say, "Oh, I gave it data. I'm not getting any answers right now." But six years from now, six months from now, two months from now, maybe you didn't get something. Someone else gets your information. That's one of the biggest concerns right now when you think of LLMs. Because it's completely unsupervised right now. There's no supervision. 

[00:19:58] SF: Yeah. And recently, Italy had a temporary ban of ChatGPT I think in relation to potential GDPR evaluations. And then Samsung also banned employee uses of ChatGPT because – there, it was a different issue, is, essentially, people were sharing proprietary code with ChatGPT and then asking, presumably, for refactoring or optimization suggestions or something like that. 

[00:20:23] AM: Yeah. Totally. I think it's all sorts of things, right? It's basically whatever we have tried to prevent from humans to do, now machines are doing it and we don't know how to prevent it. And the best way is to make sure that you're giving machines the right information and not the wrong things. And this is exactly where we talk about privacy. And it's all about the data and making sure that you are not leaking sensitive information. 

Like, in case of Samsung, right? If they had put some proper controls from the data standpoint where, let's say, I'm an employee and I'm trying to give information to any AI. But they had a filter, where if I gave something proprietary, it would automatically filter it out. Even if, let's say, I'm a good employee and I just made a mistake, a very honest mistake, but there was a filter that could take care of that mistake. And now I'm not leaking that to any other LLM so that I'm not leaking IP that someone else can actually benefit from. That's kind of where we need to go. 

[00:21:29] SF: Yeah. And I've seen a few approaches to privacy and security in this space. And maybe we can kind of take each one piece by piece and break down what problems it solves and maybe where the gaps are. That sound good? 

[00:21:45] AM: Yeah. Yeah. Let's do that. 

[00:21:47] SF: Yeah. One of the ones that I've seen people talk about, and actually had conversations with Snowflake Summit last month with some of the the Snowflake folks on this, is that companies like Microsoft, Snowflake, and I'm sure others and sort of public cloud space that are getting into generative AI and LLMs are proposing this model where it's essentially, let's train and run your own private LLM that no other company has access to. Essentially, rather than running that through something like open AIs APIs, you're essentially running that within your own cloud infrastructure. No other company is going to touch it. What problem does that solve and what are some of the limitations there? 

[00:22:28] AM: Yeah. I think it's a great model for those who can do it. Because we're potentially saying – what we're trying to solve for is you don't want to share your learnings and your data with others. And this is a very common thing we do even without AI, where even SaaS products that are serving multiple customers don't share data across these customers. And that's what these enterprises want, is like I don't want the intelligence from my data to be used by, let's say, a competitor or someone. 

It does solve for that. It does solve for making sure what you are training, the intelligence you are creating with your own proprietary data stays to yourself. It's isolating that part. It's personal isolation and it completely solves for that. But the limitations there are the fact that you're still giving it sensitive information. 

Even while you're keeping it all within your own company or within your own enterprise, you're still giving it proprietary information. You're giving it sensitive information that can be used by folks within your enterprise. And these enterprises are not small. Because, let's be honest, it takes a lot of capital to be able to even do this Because it's not cheap to create these private isolated environments and privately train your algorithms. And it takes time to even fine-tune these algorithms. 

But at the end of the day, the data is going. You're not preventing sensitive data from leaking into these algorithms. And someone, somewhere, getting that information out and being able to use it in ways it should not be used. While it solves one problem, it has a large limitation as well. 

[00:24:20] SF: Yeah. It sounds like, essentially, this is probably a fairly expensive thing to invest resources into because you're essentially going to be running your own LLM cluster that is private to your company. You're talking about a fairly expensive cloud infrastructure to do that. Plus, you probably may have to build out a team to do that. 

And then on top of that, you have a big challenge around how do you make sure that Joe that works in accounting has access to accounting information essentially, through the LLM. 

[00:24:57] AM: Exactly. How do you control that, right? And then the other thing also, the big thing that is there, is there are so many models out there. Do you really want to confine yourself to one single model that you have privately trained? Because tomorrow, there'll be another niche model that might be there for you to use for specific problems. And now you're going to, again, go in and bring that model into your infrastructure and train it. That's also another big limitation. And I think the biggest one is also just governing who gets to see what information and ask what information as well. 

[00:25:34] SF: Yeah. I mean, that model idea, I hadn't even really considered. You're right. There's a multitude of models. I mean, there's even applications that will show you how different models perform side by side to different prompts so you can essentially choose the right model for the problem that you're trying to solve. 

That's going to be a pretty heavy investment for any sort of reasonable company to both run, manage that infrastructure. Plus, have the domain expertise to do it. Plus, essentially, adapt to all the changes that are happening in the space. 

[00:26:06] AM: Exactly. And I think I was talking to someone the other day and I told them, I'm like, "It's similar to when we saw SaaS applications, right?" Every SaaS application has been built because it's optimized for a particular use case. They understand how that works. It's going to be the same in the LLM world. Everyone's either taking a foundation model and fine-tuning it with a ton of data for a specific use case. Or they're building new models for specific use cases. 

And we've helped a health LLM company kind of do that and help them train it by obfuscating some data. And we're going to see a lot of that. And you do want enterprises. And not only enterprises. Every single person should be able to leverage these things. 

[00:26:50] SF: Another thing that I've seen people talk about and I've seen some product offerings out there, is it's essentially fully-redact or even replace sensitive data with synthetic data before the training staff or fine-tuning or prompting. Wherever, essentially, the data is entering the model. What does this essentially solve or not solve? 

[00:27:09] AM: Completely redacting sensitive data from models sometimes will solve your problem of leakage of sensitive data and make sure that, "Hey, you know what? You're not sending sensitive data to the model." It never sees it. It never gets it. But you can't just completely redact the way we do redaction on our end, right? You have to redact with a stand-in. 

And what I mean by that is you redact information, but you still have to tell the model saying, "You know what? There was a name here." The model knows that the redacted information was a name. Because then when it's trying to do its understanding, it knows that, "Oh, this information was about a name or a location or anything." Right? That's kind of how you want to think about redaction. It's not like just the basic reduction that we do. 

And when we talk about synthetic data, I've always had this notion with synthetic data. Everyone's heard me talk about it for ages. Is when you generate synthetic data, you're using AI. And AI models have bias. And the synthetic data that's being created is going to have bias. That is one part of it. 

The other piece is synthetic is not real. And when you are training these models, you want to give it real data so it can actually come back to you with a real answer to things, right? And I think that that is the huge power of it. And what you want to figure out is how can you do that in an effective manner, right? 

Rather than using synthetic data, why can't you just use – basically do tokenization or stand-ins where you can give it some information with context? And then when the information comes back to you in inference you are now able to say, "You know what? I understand this. This was Amruta. I can like kind of put it back." 

At the end of the day, you're not compromising an end user's experience or even the model's understanding on things. But you're doing it in a safe way. They're okay, both synthetic and redaction, to a certain point. But it doesn't really help us get the power of LLMs the way we really want to use it. 

[00:29:26] SF: Yeah. It sounds like from a fully redaction standpoint, it can you solve some problems. Might be necessary in certain situations. But you're losing essentially, the contextual information, which is part of the power of the model. 

And then with synthetic data, so much of model quality is essentially the quality of the data. You hear people talking in the AI space as like garbage in, garbage out. If you're lowering the quality of the data, then you're lowering the quality of the model, essentially. 

[00:29:54] AM: Totally. And you don't know. And synthetic, the same in a lot of times when you're thinking of like the amount of data that's there. There's collisions also in the synthetic data that you create. It just completely confused the model at some point. 

[00:30:09] SF: Yeah. That would be bad. There's also a bunch of these companies that are really focused on specifically like protecting ChatGPT, which is where you saw the bands from Italy, we saw the bands from Samsung were specific to ChatGPT. And I believe this work by – they're typically like a Chrome extension that monitors what your employees are essentially pasting in the ChatGPT. It's almost like you're putting a firewall around it. 

And it still is a little bit half-baked to me because you could use some ChatGPT competitor or even run open AI APIs on your local hosts and get around the copy-and-pasting policy. There seems to be no way of really stopping people from doing something like that from copying an internal email into one of these systems. I guess what are your sort of broad thoughts on this? 

[00:30:58] AM: It's very true, right? I mean, you cannot prevent. And I always tell people, there is a very powerful tool out there and you are blocking people to use it. They will figure out a way to go get it. It's like a kid during Halloween, right? You're going to tell them, "Do not eat any candy." But they're going to figure out a way to sneak in one. 

I think, I always say, give people the power. But make it such that they can safely use it. Because don't you want each of your employees to kind of get more efficient with GPT? Have all your kids, and parents, and family and everyone do it? 

I think the Chrome extensions and all of that, sure. It's kind of a Band-Aid solution. They can remove it at any time and go around it. And I don't think that's kind of the right approach. Sure, you can feel safe. Maybe 10%, 20% of the things that had to happen won't – bad things may not happen. But the rest is going to happen. When there is something that has to go rogue, it can't go rogue. That's not something that's completely going to solve it. 

The way you want to solve it is truly put not a firewall, but a filter, as we've been talking about, right? As someone types anything or copy-paste an email and sends it to GPT, let it go. But make sure that you know you're able to remove all the sensitive stuff from there. Let it go and let it come back with a response that this employee can use in a way that they can use it and replace all the stuff that you've removed while it went to the LLM with the real information. And I think that's kind of what I feel is the right approach when you're trying to use these foundation models or products such as ChatGPT. 

[00:32:43] SF: You mentioned a couple things through this conversation, the idea of essentially filtering. Also, I think tokenization. Are those two things essentially the way that we do this the right way? Is this a way to keep things private but essentially useful for LLMs? 

[00:33:03] AM: Yes. I truly think so, right? I think there are like – you can do things three ways, I say. For example, when you're training a model or even fine-tuning a foundation model. Based on the use case, you want to redact. Do what we call one-way reduction or remove it and put some stand-ins to make sure your model knows what was here. Or you can do one-way pseudo-anonymization, which is tokens. And this is not LLMs tokens we are talking about, which is I'm very careful when I use the term tokens when it comes to AI. But these are stand-ins for Amruta becomes 1234name-123478 or something like that. So that the model knows that there is something here. 

And then the third piece is also reversible pseudo-anonymization, right? When I trained the model or when I maybe even gave a prompt to the model with a lot of context, because that's what folks are doing right now and using it, right? I can do prompt engineering and I can send a bunch of context and do attention mechanisms to make sure that even a foundation model is able to give me a very contextual answer. 

But what we can do, is in that information, you basically pseudo-anonymize the whole thing. You stand in tokens. Send it. And then you re-identify it. So replace all of those pseudo-anonymized tokens with the real data in that whole flow. So then the end user gets the 123ABC is Amruta when it's coming back. 

I think those are the three key kind of ways I think is the right way to use data with AI, and LLM and all of that today without compromising sensitive data as well as giving the power of AI to everyone. 

[00:35:02] SF: Yeah. And without having to run your own version of privacy – 

[00:35:06] AM: Of course. Not breaking the bank as well, right? Because those GPUs can run up a number. 

[00:35:15] SF: Maybe we can take a specific example. If I do something, like, I type in into an LM. I send in a prompt, like, where was Sean Falconer born? In this case, Sean Falconer is my identifier. What happens to that? And then sort of what ends up getting fed into the LLM to generate the response? 

[00:35:38] AM: When you're typing in where you are and everything, in that case, Sean Falconer and the location that you're in is going to be replaced by Sean Falconer becomes nameABC123. Where you are becomes locationXYZ. And then that goes into the model. 

Now the model is going to take that information. Let's say, I decided to send you a package based on where you are and I said send Sean Falconer who is in this location, [inaudible 00:36:10]. Tell me what's the closest restaurant to him and stuff like that. And it's going to come back with some information to me saying, "Okay, these are the places and these are the things that you want to send." 

And then as it comes back in the UI, I'm changing Sean the 123 back with Sean. What's happened is, in the prompt, I've replaced the sensitive data with pseudo-anonymized sentence. Send it to the LLM. The LLM has not seen any sensitive information. It's generated a response. Returned it back to me. And I have taken that response, taken all the pseudo-anonymized pieces in that response and re-identify them with the plain text or real information. So that as an end user, I'm still getting a real response. That makes sense? 

[00:37:03] SF: Yeah. Essentially, your you're filtering the prompt before it reaches the LLM. 

[00:37:08] AM: Exactly. 

[00:37:09] SF: And then on the response, the response goes back out through essentially, the reverse filter that takes the identified data and re-identifies it. And how does – when we're talking about the challenge with the private LLMs of controlling access. Like, how do I make sure that Joe in accounting sees accounting information but Susie in customer support doesn't have that level of access? How does that work in this sort of filtering process? 

[00:37:35] AM: It's a great question, right? Because you also – when the data is coming back, you want to make sure that the right person sees the right information. And that's where governance and policies come into picture, right? This is where when you set up specific policies on specific roles and users saying, "Hey, you know what? Joe in accounting doesn't need to see my SSN." Then even if an LLM is coming back with a response, which has a token that I'm going to now re-identify as an SSN, as I'm re-identifying it, I'm going to mask it for Joe. Because based on the policies that have been set on the data, he's not supposed to see the SSN. He's maybe able to see only the last four of the SSN. 

The response where you're re-identifying the pseudo-anonymized data is also re-identified based on the policies that have been set based on who gets access to what data and where. And this is also very awesome when you're thinking of regulation. Because one of the things that interestingly, I was having conversations two days ago, is folks have not thought about how localization and data residency is going to play into LLM. 

Because let's say I have a model in the US, but I'm feeding it information of citizens all over the world from places all over the world. I'm not going to train different models in these specific locations. That data cannot leave that particular location nor can a US employee see the response from an LLM that includes information or details of someone who's in EU or somewhere else. 

And those are also the places where these governance policies will come into play. Where, as I'm re-identifying it or as the prompts even going in, I can make sure that some of these things, based on policies that have been set, are blocked. 

[00:39:35] SF: Mm-hmm. I see. We were talking about a number of things here around governance, sort of this filtering approach, tokenization, localizing this information to make it so that you can run essentially, a central version of your model that's de-scoped from things like data residency. How is this, I guess, relate to the work that you're doing at Skyflow and some of the things that you're doing with companies around the LLM space there? 

[00:40:04] AM: I think this is exactly what we are trying to solve for, right? Going with my whole original mission of making sure we give power to everyone, but give it in the right way. 

With Skyflow, what we are doing is we already have a privacy platform, right? And I think that made it easier for us to build a solution for LLMs on top. And what we're doing is we are giving the ability for folks to de-identify the data that's going into LLMs. And de-identified, whether it is training, you're building your own model, you're fine-tuning an existing foundation model, any of that. 

Additionally, we're also giving them the same ability to de-identify information in the prompts that are going in. All the interfaces. Most commonly, right now, folks are doing conversational interfaces as well. They are starting to experiment. And being able to redact information or pseudo-anonymize it in a way that it does not reach the model. And when it comes back, we are able to re-identify it, right? You tokenize. You de-tokenize. 

And as we do that, we also have a very, very extensive fine-tune access control layer that allows us to apply those policies and governance, things that we were talking about, when folks re-identify the data. Those are kind of the big pieces that we are providing. 

And another big thing that comes with this, because we have the privacy platform, is regulation and compliance. Because data is data. It's not different when you're giving it to an LLM or you're giving it to any other system or information. And you have governance and compliance to adhere to. Those are kind of the two big pieces. 

And then one of the third thing we built as we identified the use case similar to what you had mentioned earlier around like, say, Samsung or anything else, is they might be information that aa enterprise or a company thinks is sensitive for them, which may not fall under our standard sensitive fields as we know. PII, PCI or PHI. 

And for that reason, we also created this concept of a data dictionary, where someone could go in and say, "Hey, you know what? All these other project names." Or if anyone's talking about this particular person, like, you want to redact all this information. 

They can put that as well. So we can make sure that any interaction is completely safe. And that's kind of what we're helping a lot of LLMs in different companies solved for. And as I had mentioned earlier, one of the companies that we're working with is building out a very niche health LLM, which, initially, they're going to put out there to help, for example, nurses. What they do in post-op or anyone like that. 

And what we've helped them do is take all the hospital data from select hospitals, all the doctor notes. You can think of like all the different kinds of doctor notes. Like, you have prescriptions. You have someone just writing a report. And take all of that unstructured data, identify where the sensitive information is and then pseudo-anonymize it with stand-ins and then send it to their LLM so they can train it. 

And we also, in the process, certified the de-identified data set to be sufficiently de-identified. There is no HIPAA concerns or anything like that. That has been super exciting. Being able to create that pipeline for them. 

And then there are a few other customers we are currently working with who are trying to build out conversational interfaces to be able to build that assistant that's taking very sensitive information. You may think of it as will, mortgage documents, legal documents, maybe bills and things like that. And being able to process it and give answers to an agent that's maybe helping a family or something and not leaking all that information into an LLM. Because these companies we are helping don't have deep pockets. They're not going to be able to run a private instance and train their own algorithms. They're all using public algorithms or LLMs that exist. For them, we are able to create a layer that can make sure that they're not compromising the sensitive information of their customers. 

[00:44:50] SF: Yeah. And imagine for companies that they're not even sure what they're going to do in the space but they want to start experimenting, they're not going to start day one with a private LLM anyway. They're probably going to start with essentially, existing APIs or off-the-shelf open source LLMs that exist. 

[00:45:08] AM: Exactly. And I think even – I mean, I'll wait to see this. But everyone who's going to start training private LLMs, they'll just be like, "Oh, why is my LLM that I privately trained not performing like this other person's LLM that they privately trained?" There's going to be all of those "my model is better than yours" situation that's going to start happening. 

[00:45:32] SF: Yeah. And the dorkiest, nerdiest of fights – 

[00:45:36] AM: Exactly. 

[00:45:37] SF: Yeah. And so, one of the things you mentioned in terms of compliance is the idea that data is not different for an LLM. It still needs to be compliant. But I think the big difference is that – and we touched on this earlier, is essentially, it's just way more complicated. Because it's not like data sitting in a database somewhere. It's some representation of this data that's been aggregated into a model. It's not like you can just go easily delete it or something like that without blowing away our model. Yeah. 

[00:46:05] AM: Oh, totally. Totally. It's funny you ask me this question. Okay. Bear with me. I was trying to explain this to my daughter. And one of the metaphors that came up, and she really understood it. I was like, "Here's a whole rack of books in the library," this was when we were a library, "go find out in which book do they mention your name. How are you going to do it?" That's the exact same problem that you have. Because today we get DSR requests from customers, right? Saying delete all of my information. And we have to go and figure out in which database, which log, where I have saved the customers information, and go delete it. How do you do that in an LLM? You don't know.

[00:46:49] SF: Yeah. Or it's like doing the concept of a cat from your brain. 

[00:46:53] AM: How do you even do that? There is a lot of research and a lot of work happening right now on learning. And I'm very curious and interested to know what happens there. I'm sure someone's going to come up with some genius way to do it. But right now, we don't have it. There is no way to forget. There is no way to delete. What are you going to do? You've given someone an LLM your social security number. It's gone. Sorry. It knows it. You can't just take it back anymore. 

[00:47:21] SF: Yeah. Yeah. Well, Amruta, I think that's a good place to leave it. And I want to thank you so much for coming back. This is such an exciting time to be working in, I think, technology in general. But at the same time, it's also really important that we take a moment to kind of slow down as an industry at times and think about the potential privacy impact investments and things like generative AI have and how we can still innovate, but do it in a way that protects our customer data or protects our intellectual property. 

[00:47:50] AM: Totally. And thanks for having me here. I think it's an important topic. And I'm always very passionate about making sure everyone's using the power of AI but in a responsible manner. Let's make that happen. 

[00:48:06] SF: Awesome. Well, thank you so much.

[END]