EPISODE 1640

[INTRODUCTION]

[0:00:00] ANNOUNCER: The growing use of large data sets and ML in the life sciences has created new demand for data technologies. Snowflake is a cloud-based data warehousing company that provides a platform for storing and analyzing large volumes of data. Harini Gopalakrishnan is the Field CTO of Lifesciences at Snowflake. She joins the show to talk about data challenges and solutions in biotech.

This episode of Software Engineering Daily is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[INTERVIEW]

[0:00:42] SF: Harini, welcome to the show.

[0:00:43] HG: Thank you, Sean. Thanks for having me here.

[0:00:45] SF: Yeah, thanks so much for doing this. I'm glad we were able to make this happen.

[0:00:49] HG: Yeah, same here. Likewise.

[0:00:50] SF: You are the Field CTO of Lifesciences at Snowflake. Just to start off, what exactly is that job? What does that consist of being the field CTO in particular of Lifesciences?

[0:01:02] HG: It's an interesting role. Basically, within the field CTO organization, there is an industry focus. There are a lot of peers of mine that take care of different industries. The idea is that Snowflake is a regular technology platform. I'd like to say that it's like an engine, and then you've got to customize it for the car that you need to build. In my case, it's life science. The positioning that I do for my customers in far more small, medium biotech, AI-powered companies is what does Snowflake do in the context of life science? It could be a use case-driven conversation, where we map the native Snowflake features to a specific use case and how to leverage it.

It could be, in most cases, a partner-led conversation, where we have a joint partnership with another, say, lab informatics vendor, or a clinical powered by vendor, and then we are going into pharma together. What I do are different things. That's the cool thing about the role is that you have the flexibility to define it, but you get to speak across small, medium, and large pharmas. Then, you also get to talk to partners and technology vendors. You're keeping yourself updated, at the same time, trying to connect that back to our use case and value prop for our customers.

[0:02:05] SF: Yeah. A lot of the time, you're working directly with a customer, or maybe with a customer through a partner to educate them on what is possible with Snowflake to help them solve a particular problem, or something like that?

[0:02:16] HG: Yes. We also work internally on building industry facing solutions. For example, we have a lot of focus right now is around AI and generative AI, because Snowflake is new to that ream of users compared to other players. We build industry solutions internally with our extended goal. That's called Polaris. Then we are telling the story with the help of an actual, tangible solution, like a protein folding, or a clinical fine-tuning. Then if a customer is then interested, they can choose to take it forward themselves. Yeah, it's either partner, or internal solution building, or directing the customer, telling them how to leverage it for a specific use case based on what we see with other customers.

[0:02:54] SF: Then what was your background before, prior to having this role as the CLCTO at Snowflake?

[0:03:00] HG: Well, it's an interesting journey. A long, long time ago, I was a bioinformatician. I moved from tech to actually solving, using tech for a business problem. Be honest, I didn't know anything about genomics, or proteomics before my bioinformatics program. I picked it up during my course and was the best experience for me, because I learned a lot of biology, actually my masters, but also, how technology was solving for it.

Then I was heading up a small consulting practice for research informatics. That gave me the exposure of how R&D is done across all the pharma. Pretty much the same pattern, same vendors you come across. What struck me even then is 15 years ago, even before cloud was very popular, or widely adopted, there was a lot of innovation going into how people leveraged technology in solving research problems, like protein docking, or being able to model and simulate many different possibilities for your drug candidate generation. It was always ahead of the curve, compared to say, the other value chain.

I was doing that. Learned a lot. Then I moved to the industry with Sanofi, was building up technology platform for real-world dividends, a bit more later to the pharma value chain. Then a short stint at a CRO called Syneos. They’re not AWS in the healthcare life science practice and now at Snowflake. You could see, I’ve seen through consulting tech products and industry, so it's more of a 360 view at this point.

[0:04:19] SF: Yeah. You've basically done everything. You just need to found a company now and you've come –

[0:04:23] HG: Hopefully, that's my next step. You’re giving me good ideas. Yeah.

[0:04:27] Yeah, exactly. I want to talk a lot about technology in the space, how cloud and AI is impacted, biotech and biosciences. But maybe to start off with to just set the context for some of the listeners that are maybe less familiar with this world is prior to the cloud, hard to imagine that world, if you live in the world of Snowflake and other public cloud tools and platforms, but what did the technology for areas like drug discovery even look like?

[0:04:56] HG: It's a good question. When I started and – so I’m going to give up my age, but at least 15, 20 years ago, most of the customers had these high-performance clusters. One of the classic use cases I used to see when in discovery, when you walk into pharma was personally, when I started off my journey after masters was administering this cluster. It's just locked in a room. Every time people were running a protein folding study on these compute grids, and every time they wanted to scale, I remember having to go and physically replace a node, or add additional nodes, so that we could have the compute power.

The language of choice back then was Perl. Now it's Python, but it was similar. We used a lot of MATLAB for machine learning, even back then, classification, regression use cases. Long story short is that a lot of the programs were Linux-based, Perl-driven, catalyzed in these compute nodes. People were already thinking about terabytes of storage, because the first human genome was sequence. A lot of people were doing sequence analysis. A lot of the in-silico drug design, what we call that is in the pharma was looking at how a drug and a protein would dock and creating new variations on the candidate and doing that in millions of orders of magnitude was already happening with the help of this high-performance computing.

There are very niche software providers, like Schrodinger, which is very popular in this world for computational chemistry, who had leapfrogged the ability to create these algorithms, and then deploy in a cluster in a paralyzed environment. People were tapping into that and being able to do the accelerated drug design. If you were to ask me 15 years ago, the way pharma would look at drugs is they go from millions of compound screening to a few hundred thousand, to a top 10 that were going to animal testing. But going from million to a 100,000 already required a lot of computational power and algorithms that were happening in these clusters.

[0:06:44] SF: Yeah. I mean, I think in this space, biologists have been dealing with massive amounts of data going back to the 90s with the human genome project and the project and stuff like that. They had the concept of big data long before anybody knew what big data was, essentially, because there's just a massive amount of information that exists in the life sciences and biotech.

[0:07:06] HG: Totally, right? There are two schools. One is the biology part of it, where the human genome was sequenced and that it would be a lot of sequence alignment, pairing one sequence with the other, being able to build these hierarchical phylogenetic trees, which shows how does a certain family of proteins relate to each other. Those already required a lot of computational power. That's in the comp bio spectrum. Then even the small molecule, a lot of these companies were doing this modeling, creating new variations of compounds, tweaking a property here or there, and then seeing how that would bind to a protein. That was also happening.

Yes. Reels of big data before it was called big data. The modality of data was it was flat. It wasn't tabular. There are specific sequence formats, like FASTA, or SMILES strings, which pretty much were considered flat file representations. Being able to process that, that scale was what folks were doing in our lab back then. Research labs.

[0:07:57] SF: From a technology perspective, who was solving some of these technical challenges back then? Are the people with computer science degrees that are being tapped to solve some of these scalability challenges? Or is it people who have the biology, chemistry backgrounds that somehow picked up the tech skills, because they're basically forced to in order to figure out, wait, they can't do their jobs, unless they solve these problems?

[0:08:23] HG: Very interesting person. Every customer that I went to in my first 10 years of consulting when we were building up this practice, you would find a specific profiling research compared to the other value chain. You would be talking to IT that they would call research IT. The folks there would have either been mostly a computational biologist; somebody who had done a molecular biology PhD and then they picked up programming as a hobby and then they moved to IT. Most of the profiles came from the business world into picking up tech to actually solve a business problem.

To your point, for them to learn that technology was easier than for an IT person to pick up the domain. Because just the number of constructs you had to learn, like what a ligand was, what a protein was, what an amino acid behind the protein, what does a protein folding mean? Why is it important? It was too much. It was easy for, I think, the folks who were curious to pick up technology and a lot of these applications, because of which was not really is DLC standard if you were to put that in the software parlance. You would have people who had picked up programming, created this wonderful app, then say, Python or Perl, and that would be runny, but nobody else could do anything with it, because they had rewritten it and it was runny, it was doing its job, but absolutely no idea of what was the secret sauce, because they’re really a documentation, or release process.

Most of our time was trying to re-engineering, or taking it over for us to start scaling it up. We had to pretty much look into the code and tweak things here and there, figure out what's breaking and then try to create the artifacts. That was most of the time in research. To your point, that's exactly the personal, science guys who became techies.

[0:09:55] SF: Is that still the case today, or has it changed a little bit?

[0:09:58] HG: What's interesting, I think technology is more accessible now. Everybody can. It's a bit more accessible than before. But if you look at some of these AI-based drug discovery companies, I haven't worked for one, but if you look at the profile, you still see folks with PhD in some either biophysics, or PhD in a certain business computational biology or domain, and they have always worked with technology all through their journey.

I still see that, but they would definitely have a team of engineers who can actually build the foundational models, or algorithms that can now do these things, but their lead scientist would still need to understand, have that background and grounding in science, because there is no other way you can validate the output, because whatever you do, you got to figure out if this algorithm works or not. It depends on how much you can interpret the outcomes, and that requires the science background.

I can say, because technology is more easy, it's not like you had to open a VI and write your code all the time now. I feel like a lot more technologists have the ability to learn up the domain and start tweaking with these things. 

[0:10:56] SF: Yeah, from some of the companies I've spoken to recently on interviews, it seems like, they have their scientists for sure, but then, they also have traditional software engineering department that's people who've spent most of their career working on things that have nothing to do with, essentially, biotech.

[0:11:12] HG: I think if you're a company that's providing a product, let's say if you are in the business of creating something like a Schrodinger, then you need that software engineering team, because you need to be able to sell a licensed commercial software that has documentation, training, and be able to reproduce it in any environment. Yes. But if you're in a pharma where you're not in the business of creating a software product, that's not your end goal, then I think you will still see a research IT, or that's a mix of comp bio, business who have actually picked up IT, because your end goal is to solve a specific problem, not necessarily create the software that's going to be sold to another company. Yeah.

[0:11:47] SF: Right. Yeah. I mean, I think, developing software to solve a specific problem in research is very different from building software that you're going to, essentially, give licenses to a consumer.

[0:12:00] HG: Yeah. That's a commercial bread and butter. Yes, it has to be robust with grounded engineering principles. Yup.

[0:12:05] SF: Yeah. I mean, most of the things I did in during my time in research, I would be very embarrassed to give to anybody as a consumer product.

[0:12:12] HG: Trust me, we have taken a door and it's fun. Now some of them like to refactor, but it is not easy. Yeah, totally get it.

[0:12:20] SF: Now with the cloud, we have companies like Recursion and BenevolentAI, Menten AI, BigHat, Biosciences, I'm sure many, many others.

[0:12:28] HG: Accenture.

[0:12:29] SF: Yeah. I guess, how is cloud and AI changing the landscape of drug discovery and other problems in this space that we were talking about?

[0:12:38] HG: I don't think six, seven years ago, there was a term that actually could exist. There was a pharma, then it was a small pharma, or the biotechs, but I don't think there was an AI, or a AI-based drug discovery companies, that was a thing. As you rattled off, there's four or five big players, who all have candidates in different stages in the clinical pipeline. Recursion, Accenture, Insilico, who at one, and now phase two, phase three, I think, that's completely AI, where the target and the small molecules are designed by AI technologies. Benevolent.

That is a game changer in the sense that I was reading one of the interviews from Insilico, four years ago, when he was asking for funds. I don't think he had a lot of VCs sign up for it, because you didn't think that you could have AI generate a molecule that could actually be taken forward into something like a regulatory setting, like a trial. That's the game changer.

The fact that you can now accelerate, not just in the past when I said, we were using machine learning, or computational techniques to go down from millions of hits to 100,000s. Now, you have the ability to design that one hit, which can actually be a successful candidate in clinical trial. That denovo design is the game changer.

Now, you still need a human in the loop. You still need the humans to validate it. You still need to take it through a valid process of testing it in humans, testing it in animals. It's not to say that you're going to create something that would be sent out immediately to authorities, or get approvals for it. But it's still accelerated the pace from where you can start with probably a design of a drug and take it forward to trial in three, four, five, six years, which traditionally would take 15, 20. Being able to shrink that timeline is the biggest game changer. That's why a lot of the big pharmas are also having the collaboration with these AI drug discovery companies, because any new candidate that gets into trials is a win-win for them to accelerate to commercial.

[0:14:29] SF: Where does the big pharma come into play? Are they partnering with these companies, or buying the software? How does traditional pharma essentially interface with these different companies that are innovating applying AI to speed up the timeline of drug discovery?

[0:14:44] HG: There are AI techniques that a pharma can license to – actually accelerate their own internal pipeline. You will have the pharma are sitting, especially the big pharma sitting on tons of historical data, compound libraries. One of the classic use cases, even seven years ago, we were talking to in pharma before when cloud was just starting up, one of the first use cases we were talking was drug repurposing. You would have candidates and drugs that didn't work well on a certain target, but you are sitting on a wealth of information. But probably, they could be repurposed for another indication, or for another disease. The way to do that is to do a simulation at scale on various different targets and use literatures to annotate and maximize the probability of success on past rejected candidates.

Pharma can still leverage AI techniques to power through their pipeline. But given that there are companies whose bread and butter that they have fine-tuned a model to the accuracy that they can actually, hopefully, predict with highest degree of accuracy a compound that can bind to a target that can go to a trial, pharma is looking at collaboration opportunities, where if the candidate, say, goes after a certain milestone, the pharma could then take it over for further trials, like phase three, or launch.

It's more of the past, where they used to work with a smaller biotech, where they look at promising assets to take forward for clinical trials and thereafter. It's the same modality. You could get into a collaboration with one of these AI-based biotechs. Then the father the pipeline goes, the pharma could then decide to take that molecule for late-stage trials and then commercial launches, which today the AI-based lockdown rate is not a bread and butter. They are really looking to create the molecule that can have a bigger chance of success in trials, not necessarily go after launch and the regulatory paperwork after.

[0:16:23] SF: Can you share some details in how people are actually leveraging generative AI across, essentially, the drug discovery pipeline?

[0:16:33] HG: Yeah. Quite a few of the techniques, like when we think of generative AI, the most common one that people think of is summarization, or in chatbot interfaces. You interact with your OpenAI, it writes a poem, it writes a blog, cool things that productivity, accelerates our productivity in our day-to-day. In discovery, it's used for a lot of things that you wouldn't think of as possible. One of the common challenges is the protein folding problem. Way back, the start of the drug discovery journey is you need to know which disease you're studying and then there is a target, which is a protein that you want to go after. Then pretty much, you're designing a small molecule that can either activate, or inhibit that protein, which means that you're trying to connect this molecule to the protein, like a lock and a key. You need to exactly figure out the shape of the protein, so you can figure out which of this compound would be actually able to connect to it, or adopt to it, as they call a docking study, or a binding study.

In order to start do this thing, the first step is to figure out what is the shape of a protein and what is the structure. Most of the time, the thing was done in the lab by crystallography, or X-ray techniques, beaming X-rays into it to see where it bounced off and then figuring out where the atoms were and drawing out the protein. It was a long, manual process. The biggest breakthrough came when DeepMind, or by Facebook announced that they have used a generative AI technique to actually figure out from the amino acid sequence, which is a bunch of sequences that code for the protein, like how your DNA sequence is code for a gene. They were able to look at the sequence and then predict what the 3D structure might be.

That's one huge compelling value prop of generative AI in drug discovery, where you can feed a sequence in and you get a 3D output, taking months of effort into a few seconds, or even minutes. That was the biggest breakthrough. They did that by similar techniques. You look at the patterns. You look at sequences. You look at the fold for that sequence. Over the years, there have been a lot of proteins that have been crystallized and those are available in public data sources. The algorithms can train on these patterns and then be able to predict a new sequence, mimic similarly another sequence, so most likely, the structure is going to be so and so. That's one compelling breakthrough for generative AI.

The other one that's interesting is there is a stable diffusion model called DiffDock, that it just was reading about. Similar to how traditionally, you were doing docking between the protein and the small molecule using physics, or biophysics on between whether there was enough – what is optimal state of energy that would be necessary for these things to coexist? That was a physics problem, which was converted into an algorithm. Diffusion Dock pretty much uses just like how you do DALL·E for image generation. They use a similar principle to see how docking, whether a protein and a small molecule can dock together.

Again, it's looking at similar patterns based on patterns that are seen, but the idea is to accelerate with some high degree of accuracy and bring the ability to successful candidates forward faster. That's another promising area for drug discovery and generative AI. I think that's really cool. I've also seen companies like Recursion use images of cells in robotics to actually accelerate how a molecule would perform inside a cell and use that to create successful candidates. I read up more interesting stuff, but for me, the biggest one is protein folding and Diffusion Dock.

[0:19:44] SF: What is the training data for these different algorithms that they're developing? I mean, if they're using stable diffusion, presumably, they must have a lot of image data that they're feeding in and then they're applying, essentially, the diffusion technique to train on.

[0:19:58] HG: I agree. Actually, you've got to hold off that, because I haven't figured out what the training data for DiffDock is. But for the protein folding, it was definitely that we have a lot of folded proteins available on the website, like PDBs, CASPS, that was the training source for the alpha folds of the world. I actually haven't figured out what's the DiffDock’s training data.

[0:20:16] SF: Is that a problem in the space is just getting enough high-value, high-quality data to actually train these things on?

[0:20:22] HG: Yes. For example, not on the drug discovery we were internally, as I said, part of the industry solutions that we were doing, trying to create a clinical trial protocol summarization. It's another use case that's more on the development, the deep part of pharma than R. The one thing that we found was – I mean, most of the customers were – we were talking to, were asking for two kinds of use cases in clinical, right? One is, can I try to create a new protocol, or can I create a study, clinical study report summary?

The challenge we had was if you wanted to create a solution to show our customers, it was easy to pick a protocol, because it is available publicly with clinicaltrials.gov. But we don't have access to the clinical study report summarization as a training data set for us to do it. Then we are dependent on the customers to train it on their corpus. Yes, that is a challenge on how much of these models can work.

As I said, most of these companies whose business is to create AI fit purpose models would be in a collaboration with a pharma who would have their corpus of data. The collaboration is not just creating the model and moving it forward, it's actually being able to train it on – getting access to their data set to fine tune an existing foundational model and then take it forward. That's how they would challenge you.

[0:21:30] SF: Are they building all these things from scratch, like they're building, essentially, their own foundation model? If so, how do they balance the cost involved with doing all this stuff from a compute power?

[0:21:40] HG: It's a good question. I can't speak for the companies that their secret sauce is their AI-powered models. Most of the other companies that we talked to, like I was talking to one of the small medium biotech, where we were looking to predict SMILE properties, the physiochemical properties of chemical compounds using SMILE string. One of the foundational model they want to be leveraged is a variation of BERT called [inaudible 0:22:00]. A lot of these companies that are not in the business of doing AI-powered drug discovery are leveraging existing models and Hugging Face. They're people who are trained, for example, on PubMed extracts that are something like Bio GPT that's available. It's taking from that and then fine tuning on your corpus, works better for companies pharma, or small, medium biotech that's not in the business of creating a proprietary AI solution that we're selling commercially.

That's the route that we also have taken when we build industry solutions is look at existing models that has some medical embedding, or grounding in some biomedical text. Then we add layer on top of it with some additional data that we can get hold of, or we recommend a customer as they do it on their datasets.

[0:22:41] SF: Okay. Then, we had talked earlier about how there's just so much data in the world of bio sciences. Then without the advances that we've made in terms of cloud and advances that we're seeing in AI, do you think it would be, essentially, impossible to make progress as a biologist, as a computational biologist without these advances? Because it's not like you can hold this stuff in your head. You need machines to help you actually do the hard science.

[0:23:10] HG: Yes. A good example is we didn't touch upon that, because it's not really generative AI, but it's still compute power is medical image analysis. Today in a lab, one of the valid use cases is that a pathologist who's looking at a lot of cell images is trying to detect if there is a cancer, its progress to a certain stage, or has it metastized, literally has to look at many, many different images of the same cell to see if they see any foreign particles, or unusual substances that could have really, potentially, say, help with identify a metastasis, right?

That's a manual job. They have to hire, like look at say, 4,000 5,000 images manual. But now, the ability to do image recognition and deep learning of algorithms that are available to your point, open source. Media has made available a bunch of open-source libraries called MONAI for medical imaging. You could use that to look at some of these images and then predict potentially, whether there is a specific classification of a nuclei cell type, or not. That helps accelerate the manual effort for something that's more manageable.

Now, can that be used in diagnosis? For example, can I use a chest X-ray and feed it to a model and say, hey, this person has pneumonia or not? Can I use that to identify, or diagnose and create a treatment factor? I think not. You still need somebody to validate that it is true. If you're looking to accelerate, or highlight suspicious cases, I think AI becomes a very powerful tool for these biologists, or pathologists to move forward faster. Otherwise, you're always going to be bottlenecked by the number of pathologists, or radiologists you have available to do this kind of screening.

I think in that situation where you're looking for a lot of throughput and that involves manual analysis, I think AI has definitely helped. In the drug discovery world with small molecules, I mean, as I said, computational technologies have been used forever. I couldn't have thought, humanly, it's impossible to go from a million to a 100,000. Impossible. That has always happened. What has now happened is what cloud has really enabled you to do is, as I said, you don't have to walk up to the cluster anymore and wait for your [inaudible 0:25:12] to allow the elasticity to scale. Being able to now handle petabytes of data, people are now talking about cell painting. You can also store image data of your brain cells and then process that.

Just the volume and scale has increased beyond the fixed capacity that you would have been allowed to use. I think that's the real value driver from a drug discovery perspective, besides the powering up of these AI companies. To me, real value for the buck is in the hospitals, or in the cancer centers where you can now crunch through images much faster. That's something that you can see even today. You could also see the number of companies that are startups that have come up, who actually do precision image analysis. That tells you how that's more real.

[0:25:51] SF: Yeah. Even if you could automate some parts of the diagnosis process for parts of the world where people don't necessarily have access to doctors as there's doctor shortage, or even specialist shortages. Not that you don't need those specialists, or human in the loop to finally make the decision. But if you could do that with some level of accuracy, it could really change people's lives in terms of giving them access to the medical attention when they actually need it, without having to be limited by the number of doctors you could actually reach.

[0:26:22] HG: Totally. I would think that in areas of the world where there's a lot of data connectivity, but they don't have access to a doctor, yes, you could potentially take a picture of a lesion, or something, hopefully send that scan back and get back a recommendation. At least that warrants a follow-up visit or something that tells you, yes, you might want to go check with the care center. That, I think, is very powerful if you can get there. Yeah, it would not be a descriptive diagnosis. It wouldn't be the one that says, yes, you need to get a treatment, unless you have a human verify it for now.

[0:26:50] SF: Yeah. What are some of the challenges that still need to be solved in the space of that, where we could be leveraging AI, or compute to help solve?

[0:26:59] HG: I think hallucination, which everybody brings up when it comes with gen AI. In general, when I was talking about a traditional comp came ahead, for example, of trying to see whether he would try DiffDock, or the diffusion model for docking, instead of the traditional ones, which was more of the free energy perturbation, the physics-based one people are talking. The idea is that that's rock solid. It's been tested. It's been there for years. You know that you can predict, trust the outcome. With AI, can you, can you not? We don't know, because the hallucination, your point, where was the model trained? How does it predict what it predicts? Can I have a traceability back to the source, because of which it gave the prediction?

I think that's not yet there. That's not with people. Then the fact is, okay, I got this new protein structure. I got this new compound. AI is telling me that this can dock, but is it reality? Is it something that's grounded in stone? Can I go and check if it's valid? Sometimes it may not be, because it's just pattern-based. It's not based on a grounded electrodynamics, or thermodynamics, right? That's where, I think, the challenge is for us getting into a robust way, where you have some traceability to control the hallucination, or say that this outcome is based on all these things that you've seen.

The other is being able to do it at scale for big pharma, because it does require a lot of expertise. Fine tuning is not easy. Everybody wants to do gen AI, but when you go and tell them that you know, in order to get a high-precision, valuable, accurate outcome, you need to fine tune on a corpus of data that's enormous and it requires a lot of sophisticated data science experts.

Again, the skill would become an issue if you're going into that. Probably, they're better off leveraging some of the companies that are doing it for the living. I think the third is for a technology provider like us, it's easy to show art of the possible. It's easy to show, hey, you can string together these compounds. These are the wonderful use cases you can solve, but to be able to tell a story, we need access to the data. That's where I think we also struggle is we don't have access to a lot of public data for us to solve some of these cooler use cases. We pretty much have to show an art of the possible, like for example, in the SMILE string, physio chemical prediction, we took a few data sets that were available from Kimball, but it would be more accurate if we had access to an actual data. We then have to hand it back to the customer and make sure that they could look forward themselves. Access to quality data is also another challenge for us.

The last is, as I said, in the impact where as long as you're using it to accelerate your research, but you're still validating it at a regular setting, it's fine. If you're really looking for it to be to your point, point of care delivery diagnosis, I think that requires some time and investment and a lot of handholding and experts labeling the ground truth better to be able to get to that high degree of accuracy. That requires a community effort. Then the question is, what is the community that gets behind shaping these models? That's TBD to me at this point here. Because you can't have one company create this great model, and then it's very expensive for everybody to use. You're going to get into a black box situation.

[0:29:53] SF: Is validation, or testing also a big problem? How do you know, essentially, that you're making the model better as you do more fine tuning and you do more iterations on the model?

[0:30:04] HG: It's a feedback loop. That's why the human in the loop, to your point, I was talking about there are experts. For example, folks who built these protocol authorization algorithms would have a doctor, or somebody who verifies the actual author content to make sure that this is of good quality, versus it's not. That human in the loop providing feedback to fine tuning is what improves the model quality. That's why I said, it's not easy. If you were a pharma and that was what you're going to do, it requires experts from both sides, technology to fine tune, but also, the experts who knows the outcome to say, “Yes, this is right, or this is not,” and take the feedback to get a more accurate model. Then you track the accuracy over time.

Which is why grounding it in the source is important. Say, for example, you authored a new protocol for a specific phase three study, and it suggested that one, based on something it has seen for a related study that was phase three for this matter, small change in of exclusion criteria, for example. If you were able to link that back to the source, it's more easy to validate and say, “Hey, this looks correct and I could tweak it.”

Connecting it back to where you got the information from being able to create the traceability would also help with the feedback and validation. I think that's still TBD. Again, for the companies that are doing it for bread and butter, I don't know how they track the quality, how they measure the outcome. But they do have humans in the loop, but I don't know how they keep track of the model drift and accuracy over time.

[0:31:25] SF: Yeah. Okay. Then how does Snowflake just fit into this world? How is Snowflake facilitating the integration of AI and machine learning algorithms into why sciences workflows and particularly, maybe in the world of drug discovery?

[0:31:39] HG: Yeah, totally. This is my 2 cents of being a practitioner and leveraging Snowflake as a technology. There was an announcement that came recently called container services. That allowed a lot of possibility. Basically, you could containerize and create a docker of any particular algorithm and bring it into Snowflake. All the gen AI components I talked about was made possible because of this announcement, because we could now bring an open source model from Hugging Face, like Llama 2 and fine tune it and create an app in front of it and let customers use it.

One of the gaps we had, or non-gaps is the person we didn't address to in the past was pure data science, or early research, where as I told you, you didn't need dock. They were not SQL users. The data was not tabular. We needed to tackle a couple of things. One is, how do I handle these semi-structured, or unstructured data, like a FASTA file, like amino acid sequences, or a SMILE String, or an image? How do I tackle that? The second one is, what can I run to find predictions, right? I didn't want to write SQL-based Python programming. Can I run native Python? Can I run R? Can I run stuff in a notebook? Can I create the inferences via an app, like ReactJS framework, not like a Power BI.

What has happened with container services? The answer is it's possible. You could now dockerize an R and run it within Snowflake. You could now have PyTorch models that process images. We deployed of a container so that we can process images. Images can remain in a cloud store and it can do the inference real-time on Snowflake. You can capture, we have a partnership where we wrote with, in conjunction with NVIDIA on protein folding, we didn't host the model because of the GPU needed. NVIDIA hosted it, but we were able to send an amino acid sequence, get the PDB file back, manage that PDB within Snowflake and render it inside an app within Snowflake.

The possibility of talking to this modality of data science persona and users who need tools besides SQL, that is why Snowflake is now talking and taking the journey with R&D forward. We also are building a lot of partnerships on that angle. Some of it, it's not just our story, it's about being able to bring the data in. There is no AI without data. It's how soon can we get the Lamb data into Snowflake. We're also looking at targeted partnerships to connect to instrument vendors and they can pipe the data, and so that we can do some of the prediction and predictive analytics better and faster. That's another thing.

All of this happened with the ability to support multiple modalities with containers. There are also some native LLM features being available, like vector databases, semantic search. If you're looking for very domain differentiated models, the idea would be to fine tune with yourself, or work with a partner company that is creating an LLM that can run on Snowflake, which is another modality that we're also talking to. Yeah.

[0:34:18] SF: Even the native Python support and Snowflake's gotten significantly better over the last couple years, which is probably makes a big difference if you're working in the life sciences space, because it sounds like, they've gone from Perl to Python as their language of choice.

[0:34:32] HG: Yeah, exactly. I don't know if Perl is still used. I mean, I haven't coded in Python. It was all Perl back then. The art is still around. Yeah, without Python, R, I think it's a very tough sale to talk to anybody in R&D. Then you'd be surprised, there are still some C++ machine learning algorithms that are running. Being able to wrap all that into a compute, into a managed container is where, I think the value prop is for Snowflake. Yeah.

There are partners who are also willing to wrap that and run that on our container, so we don't have to do it. Customer doesn't have to do it. You could also choose the partner who has done it to run it on us.

[0:35:04] SF: Yeah, the first machine learning algorithms I ever wrote were in C++. But I'm very glad to have moved on from that.

[0:35:13] HG: That's the interesting part of research is you're always surprised. It's no standard set of reporting. It's no standard set of tools. Everybody has their favorite language. Then voila, there is an algorithm that's been running and does this well. That it’s staying well for years.

[0:35:26] SF: You recently wrote a blog post about Snowflake and drug discovery. Can you talk me through what you were trying to show in that article?
[0:35:33] HG: Yeah, sure. As I said, that was the one I referred to, which was in partnership with NVIDIA. NVIDIA besides our GPU chips that they sell, also have foundational models, like BioNeMo, that is focused towards drug discovery. The idea was, can we just call the model hosted outside and then show you how you can manage the data within Snowflake? Because one of the questions we get asked is why Snowflake, right? The power is not specifically to solve a very niche vertical solution. If I really want to do image analysis, that's not the point. The point is how can I use this data in conjunction with other data? How can I connect a clinical phenotype data back to something that's an omics? Stuff like that.

The idea here is we wanted to show an art of possible, where the sequence is faster, that has the amino acid, that's managed in a Snowflake stage, where we input that into a Jupyter notebook called the folding function hosted by NVIDIA, get the file back, which is again, a flat file, which is the PDB with the coordinates of the protein. Then we created a streamlet app. A streamlet is an acquisition we made, that is a Python library, which is a front-end interactive app.

We fed that PDB to a streamlet, which can understand the PDB, because of the bio-python library called pymol, py3Dmol, and then you can actually see how the protein looks by that rendering. The idea is that imagine you are a company that has a lot of assets, as I said, and you're creating a findability use case, you now have an ability to store the metadata of the protein inside Snowflake, including how the structured information and connect that if you want with the compound that docks to it. You can now actually progressively link from your biology to chemistry, because we also have support for structure search via RDkit, another Python library. You could now do target analysis, compound analysis, and then string together that to later pre-clinical. It's a way of managing all of the R&D data within Snowflake. That was the pattern here.

[0:37:22] SF: Then, you mentioned earlier that there's some partnerships that Snowflake is working on to bring more data specific to the bio life sciences into Snowflake. Are there other, your product investments that Snowflake is making to help in this area to make it easier to, essentially, answer that question of why Snowflake to solve these particular problems?

[0:37:42] HG: I can speak again. On the industry side, the partnership angle is what we've been leveraging, because we're not building a vertical solution. It's more of a technology company. Then the idea is that there'd be leveraged companies that are doing a fitful purpose solution well, or an SI was building that connectivity, or accelerator well to make it work better with Snowflake. Snowflake itself has certain features that we are leveraging. For example, they announced a set of capabilities called Cortex AI, that instead of managing in the continuous service world, you manage the – it's bring your own model. You manage your Pytorch model, or you manage your gen AI model.

In the Cortex AI, it's managed by Snowflake. It lets you use some of the summarization, sentiment analysis, vector data types out of the box. The variation of the use case that I was talking about is, say, for example, you want a rack pattern, you don't want complete fine tuning. One of the things we want to try out is, can I convert each SMILE string into a chemical embedding, store it inside the native vector data type of Snowflake, and then do a similarity search to find similar compounds? Instead of the traditional ways, run RDK to do an index mechanism. That's what we call cartridges.

Those are some of the technology capabilities that can be repurposed for life science. That's what, again, we are looking to build some industry point of views and solutions to illustrate that. The other is create some medical embeddings, like in the protocol use case, create medical embeddings of the terms we see in clinical protocols and then let customers find similar protocols by similarity search.

Say, phase three, her two positive breast cancer, do I have any information? As long as embedding is there in your vector database managed by us, you can find all the hit results that are similar. Again, accelerate search. That would be some of the art of possible solutions, leveraging native capabilities. For specific inferencing solutions, we would leverage on our partners, or existing foundation models, like MONAI that we can repurpose and give it to customers.

[0:39:32] SF: That makes sense. Then outside of Snowflake, from your perspective, what are some of the emerging trends that you foresee at the intersection of cloud technology and AI that you're particularly excited about?

[0:39:45] HG: For me, I think, I'd love to see what happens in the drug value chain to see if the first AI-based candidate makes it further along. That would be exciting. The second thing that's interesting for me is looking at companies that are actually the process of creating variables, patient engagement better, to your point of healthcare at point, it access, right? There are pharma companies that are trying to engage with patients much more closer with digital apps, right? They are looking for information coming back from patients and they want to be able to understand how one of the biggest fears for pharmas when patients drop off, like when they stop being on a certain drug.

Real-time exchange using mobile is something that is being looked at and see, when does a patient go off a certain first-line therapy. Is it because they find it too difficult to use, or is it because they're not monitoring their body measure as well? Is that something that they can push more information on? Things like that.

The most recent one that I think was exciting was being able to look at probes inside your cell and see how a molecule moves inside, or how a particular drug doesn't hit the target. That requires a lot of image capture and that requires a lot of image data to be analyzed. I'm curious to see how cloud technologies scale up to be able to do that at scale to come up with an inference, because one of the biggest challenges for drugs not working is you don't know why it doesn't hit the target, why it doesn't bind. What happens once it's inside a specific cell? It's very difficult to trace it. Cell imaging, high content screening, going to help there. I'd love to see what's happening. There are companies that are actually doing it as a product as well. Yeah, some of those things.

[0:41:17] SF: How would they send that information back to the cloud? Is it some sort of wearable that the person is using? How do they actually monitor the interaction at the cell?

[0:41:26] HG: That's a good question. There is a probe. My understanding, is that one of the customers we talked to was having a probe, which tracks how the particular encoded protein moves inside. Then they can actually get snapshots of that movement via – just like a heart monitor. That is, you have today in devices that are plugged to your pacemaker that actually streams data, talks to your data to get results. Then it can actually come back to you with a prediction of some – there is an additive, or something that's not regular.

Similar to that, the probe would actually stream the data back and potentially, that's the image that you're capturing. Now, I don't know how technology streaming works and what is the modality behind. That's pretty much to me how cell painting was described. That's interesting. I haven't seen that demo myself in real time. I've seen images of it.

[0:42:10] SF: Yeah, that's really fascinating. That's really taking the heart monitor idea to the next level, for sure.

[0:42:14] HG: Yeah, next level of – yeah, because it's not data. It's more images. Yeah.

[0:42:17] SF: Yeah. Amazing. Then as we set to wrap up, is there anything else you'd like to share?

[0:42:22] HG: I think the exciting thing now is looking at the promise of how some of the companies are successful in tackling this at more real-time outcomes. As I said, everything that we talked about is exploratory. I'd love to see how that scales up to something more real. I know that I was told that there was a vaccine of people can design faster vaccines, because you can create variations of effective vaccine. Like in COVID, they can look at the as things mutate, you can create effective variants faster and launched easily in the market.

Again, more exploratory use cases. Love to see how that works. Yeah, thanks for having me, too. I know, you're talking to a lot of cool companies, Sean. You'll probably let me know how things are panning over there.

[0:42:59] SF: Yeah. You'll have to listen in. Well, Harini, thank you so much for being here. This was really, really fascinating. I think you gave a really good background in terms of a lot of stuff that's going on in the space. It sounds like, this is really the age of biology in a lot of ways. A lot of this stuff is emerging, areas that are probably going to have significant impact in the next five to 10 years moving into the future for the century.

[0:43:21] HG: Yeah, yeah. Sure. Every use case that’s interesting happens to be in life science also, because of the fact that there's so much possibilities, as we said in biology, chemistry, imaging. Yeah. I'm sure in five to 10 years is going to be a lot more breakthrough. 

[0:43:33] SF: All right. Well, I’m glad we're leaving on a positive note. Anyway, thank you so much and cheers.

[0:43:37] HG: Thanks, Sean. Yeah, we'll be in touch and we'll keep you posted as well. I'll watch your rest of the cool videos.

[0:43:43] SF: All right, thanks.

[0:43:44] HG: Take care. Bye.

[0:43:46] SF: Bye.

[END]