EPISODE 1633

[INTRODUCTION]

[00:00:00] ANNOUNCER: Recursion is at the leading edge of applying AI and ML to drug development. The company exemplifies a new wave of tech bio companies that tightly couple compute and robotics with biology and chemistry. The task of decoding biology requires vast amounts of biological data and innovative strategies to make use of that data. It also requires close coordination between experts across a wide range of domains, from software to cell biology. 

Imran Haque is the SVP of AI and Digital Sciences, and Jordan Christensen is the SVP of Technology at Recursion. They join the show today to talk about the unique data engineering challenges in biology, the growing importance of automation, reshaping the drug discovery funnel, their partnership with NVIDIA and much more.

This episode of Software Engineering Daily is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[INTERVIEW]

[00:01:06] SF: Imran, Jordan, welcome to the show.

[00:01:08] IH: Thanks, Sean. Great to be here.

[00:01:10] JC: Hey, Sean, nice to meet you.

[00:01:12] SF: Yeah. Thanks so much. It's great to meet you as well. A fellow Canadian. Also, a fellow Stanford grad. We have a few little connections there. Let's start with some basics. Who are you? What do you do? Imran, let's start with you. 

[00:01:23] IH: I'm Imran Haque, Senior VP of AI and Digital Sciences here at Recursion. I've been with a company for about four and a half years. Before that, I spent about a decade in technology-enabled diagnostics first scaling inherited disease screening and then working on building early cancer detection. And prior to that, as you mentioned, I did my academic work at Berkeley and then at Stanford focusing in machine learning for drug discovery. 

Being at Recursion for the last several years has been a real playground to be able to bring all those pieces of my background, the computation, the biology and the chemistry all together and in service of finding new therapies. 

[00:01:56] SF: Awesome. And then, Jordan, who are you? What do you do? 

[00:01:59] JC: I'm the Senior Vice President of Technology at Recursion. I'm responsible for all of the engineering we do at Recursion. Been at the company just over two years. My background is completely different than Imran's. Rather than actually coming from a drug discovery background, my background is more in consumer tech. Before joining Recursion, I worked at companies like Ecobee and Wattpad where we were scaling for our engineering platform for all of the millions of customers that we had. 

And so, a lot of my work at Recursion is really helping us scale our platform and really going from the prototypes and the examples that we can build to the large scale that we need to tackle direct discovery. 

[00:02:34] SF: How is that transition kind of moving into more of the biotech space from maybe what would be sort of more considered a traditional software engineering or tech company? 

[00:02:45] JC: I think the easy way to say it is that it takes work. Wikipedia is always open. The company glossary is always open. People like Imran are amazing to just answer all of the silly questions that I have whenever they come up. But two years in, feel a lot more comfortable and have learned a lot more about the biology and chemistry that we need to do to drug discovery and can bring all of the other pieces to bear with the deep engineering experience.

[00:03:08] SF: And are the type of scale problems different than what you might have experienced in the past? 

[00:03:13] JC: They definitely are. A lot of the scale challenges we deal with Recursion are around the volume of data that we have. It's doing things like massive batch processing of the images from our various labs and things like that. It's a lot more throughput-oriented I would say than latency-focused where in a consumer site you want to try to respond as fast as possible. In this case, we can scale horizontally to just add more and more parallel things that are processing all of our data. It's really more about that scale there rather than just how fast things go. 

[00:03:44] SF: Mm-hmm. Yeah. I'm sure we'll get into this. But I think like when you're talking about like consumer b2c, some of your scale might be around - we just have like tens of millions, or hundreds of millions, or maybe even billions of users to address. Probably not the case when you're talking about things like drug discovery and drug design. But your scale might be. We just have a really massive amount of data to process and figure out how do we handle at scale. 

[00:04:07] IH: The thing that's really interesting is you can get large amounts of data in both domains but the shape of that data looks really different. To what Jordan was saying, where in a b2c sort of a web-oriented system, you may have lots of data but it's coming in hundreds of bytes, maybe KBs at a time. You're trying to process really high continuous request loads. 

For us, a single record might be on the order of 10 Megs for a single image. And then you get them together in a plate where you're getting a thousand of them at a time and you're making a few thousand of those plates a week. The shape of the data looks really different. Both the aggregate amount in terms of the terabytes for week, the petabytes in total. But then how fat the blobs are that you're pushing give you very different constraints on your engineering and your high-performance compute.

[00:04:57] JC: And I would say that on the shape size, it's a lot more unstructured data. Where in the consumer web space where you're dealing with their JSON blobs that are coming at you from a web request. In this case, these are TIFFs. Like six-channel TIFFs, which is like a weird thing. Six channels? Why six? Why not three? Like an RGB or something like that? And these matter in our world the depth and picking all these different color signals and things like that. It's definitely a unique challenge. 

[00:05:22] SF: Yeah. That's super interesting. Before we go too deep on some of the engineering aspects of this, I think a good place to maybe start is set a little bit of context for our conversation. Just for the audience, what is recursion? And sort of what problem are you trying to solve? 

[00:05:37] JC: The stock answer is that Recursion is a clinical-stage biotech or a biotech company. But we actually use other words. We talk about ourselves as a tech bio company because we believe that there's a space and an opportunity for tech to lead and to really start solving drug discovery with technology as a primary driver for us. 

And so, Recursion's mission broadly is, and as we say, is to decode biology to radically improve lives. We're looking at the broad space of biology, the massive, massive unexplored space and using that to search around and find these opportunities to really make people's lives better. And all the other things are there too. We talk a lot about how we bring technology to bear to reduce the cost of it. We'll often talk about things where everybody knows Moore's law that's listening to this podcast. But if you say more backwards, it's Eroom's law. 

And what's happening in drug discovery is drugs are getting more expensive and slower to develop rather than getting faster and cheaper like Moore's law would suggest. One way of interpreting Recursion's mission is that we want to fight Eroom's law and we want to change the slope of that to make drug discovery faster and cheaper so that we can really change people's lives.

[00:06:48] IH: I think that's exactly right. And Jordan's framed the high-level what we do, I think exactly right. That we are in the business of decoding biology to radically transform lives. Today that means how can we discover new drug treatments as rapidly as possible? 

We have focuses in rare disease, in oncology, or cancer, in neuroscience. But I think the how we do it is also potentially really interesting for folks listening. If you look at traditional pharma, if you look at traditional biotech, the order of those syllables, biotech, is really true. You'll see that folks are trying to transform how they do discovery. But like biology is first. Chemistry is first. Tech is very much in a supporting role. 

And I think one of the things that Recursion has uniquely done has been able to hold both of those sorts of at par. And I think like a lot of our secret sauce relates to how we can balance the biology and the chemistry with the technology, engineering, machine learning in order to make those go and like make those go in a tight cycle. And we think that that's really the root of that tech biotech transformation and that combination of data compute experimentation is how we think we're going to drive that impact. 

[00:08:05] SF: And one of the things that you mentioned, Jordan, when you were talking about Recursion was this idea or I guess like truth that, despite progress and a lot of investment in the world of drug discovery, the price of drugs has continued to go up. And the timeline to actually bring a drug to market has also continued to go up. Why is it that the space that has so much innovation going on it has essentially that problem? 

[00:08:30] JC: I think a lot of people just haven't cracked the code on doing that. And I think that's a lot of what Recursion is doing is we're starting from technology-driven approaches that let us process massive number of experiments where we produce more data in our labs than any of the other companies that we aim to compete with. And bring solutions and technologies to bear that allow us to process that data and to get insights from those experiments are things that is shockingly uncommon in this space. 

It feels like there's tons and tons of opportunity. And that Recursion and some of our peers or the companies that are really pushing the envelope there to bring and solve the problems that way. And really a lot of it does come down to the ML-first or the technology-first kind of approaches. Because in a lot of those spaces, we'll see a lot of companies that are using AI and ML but they're not using it in the way that we're using it. They're not using it at the massive scale. They're not using it to do things like develop what we call maps of biology that let us explore all the untouched and unsearched space that exists in this massive world. And so, until other companies start doing those things, we're not going to see a lot of that progress showing up. 

[00:09:41] SF: You mentioned ML, AI and both the investments that Recursion is making there as well as other companies. And it feels like a lot of the industry is sort of moving in that direction. But how do you see essentially what is the potential impact of AI or ML like transforming this industry? Where are you seeing essentially improvements to potentially reducing the cost of drugs as well as speeding up the timeline to market? And where do you think this is going five years down the road? 

[00:10:10] IH: I think the impacts, we're seeing them all through the pipeline. And I think I'd like to highlight one of the things that Jordan mentioned, right? Is that there are different ways that you can approach the problem and different ways you can think about using machine learning or AI? 

The place that it's made the largest contribution for us is actually in the development of methods to acquire and interpret richer and more experimental data than people have been able to understand before. If you look at where the money goes in drug discovery and development, it's most expensive to fail later when you're in clinical trials. 

And for the most part, when you identify a particular target that you'd like to go after for a drug, we have tools to identify whether they're oral therapeutics, whether they're antibodies to go after a particular target. That binding problem, it's not solved but we can often find ways to deal with it. 

The biggest challenges that you have are when you get into phase two, phase three where you're trying to see not is it safe but did it actually do anything? And often, you'll find, well, you spent an awful lot of money, an awful lot of time and the biology that you thought you understood isn't actually the biology that matters for the disease. 

And so, the way that Recursion has oriented its machine learning focus in the past has been primarily to crack that nut of how do we collect the right kind of experimental data at scale? Relate those data points to each other in order to understand the biology going on inside the human cell, going on inside the human body. And then to use that to drive the downstream path that we think will have a higher likelihood of success. 

Now, more recently, we've been investing in a number of other areas. For example, investing in machine learning for chemistry to make it faster to actually identify the compounds that might turn into drugs. Investing in things like language models both to identify novelty of insights with respect to the literature. But also, to assist in things that frankly you probably shouldn't have a human doing. 

When you do a filing with the FDA, you're talking about literally hundreds of pages, some of which may be entirely boilerplate. And there are probably ways to do that using sort of conventional software as well as AI models in order to accelerate that process and make it more efficient to be able to generate these. 

Now, underneath the hood, you still have to make sure that you're doing all the experiments with the right level of rigor and so on. And so, we don't compromise anywhere in terms of safety, in terms of thinking about whether we're actually going to get efficacy in a broad population. What you want to do is use AI as an enabling technology to either access larger scales of data than you could by hand or to roboticize processes that ought not be done manually. 

[00:13:00] SF: Do you think that if we essentially shorten the cycles on some of these sort of robotic type of tasks like you mentioned, being able to fill out this FDA document in other parts where you don't necessarily - you can essentially automate and take away some of the human load, we might actually have more time to spend on rigor around the drug design and other parts that actually lead to failure in drugs and it'll actually lead to essentially a higher rate of success.

[00:13:27] IH: Absolutely. And that's actually the entire way that we've oriented our discovery process. In particular, we've oriented our discovery process around what we call a series of industrialized workflows. If you think about the way that a traditional target-oriented pharma might initiate a program, you'll have somebody at the company who sees an insight from the latest scientific paper and says, "Oh, maybe we want to go after this particular target that I saw on a paper." 

You'll stand up a whole new experimental system potentially around that target, screen a bunch of molecules. At every step, you'll have humans looking at that data identifying the hits that they like, picking a handful to advance. You'll have highly trained medicinal chemists who are then trying to identify what are the next molecules that we ought to do to design around the space and try to improve it. And so on and so on. 

There have been a number of revolutions in both information technology as well as physical technology that allow us to change that. And so, now, today at Recursion, actually big chunks of the step that might take like a year or more at a traditional pharma, we can do in a completely autonomous manner. Where we'll use genetic data, we'll use LLMs in order to identify novel insights from these maps of biology that we've built. 

From the compounds that we've screened that can be put in relation to literally everything in the genome that we have screened. And then automatically use algorithms to pass these through new rounds of experiments, filter out the ones that didn't work. Automatically go through chemical catalogs that can be up to like 36 billion molecules in scale now to automatically optimize those molecules, get next rounds. Until at the very end, you identify a pool of programs all of which may actually be worth spending human time on. 

And then you're going to have those highly-trained medicinal chemists, toxicologists, biologists only spend their time on those particular programs using human insight, using human time and intelligence in the places where they're most relevant.

[00:15:27] SF: One of the things you mentioned also when you were describing some of this was around the using machine learning to help acquire the data and how that is essentially a differentiating factor in what you're doing. Can you elaborate on that? What do you mean exactly around acquisition of the data? And then how are you sort of using that? 

[00:15:45] IH: Yeah. Totally. Jordan mentioned earlier that one of the things that's different about our data as opposed to other places you might see is that we get an awful lot of unstructured data. One of the largest sources of data that we have on the platform is this cellular microscopy assay that we do. 

We grow human cells in these 96 well-plates that are about - or sorry. 1536 well-plates that are about the size of your cell phone that have 1,536 very small wells in them. We can dose each of those wells with a different genetic knockout, a different small molecule and literally take pictures of the cells and what they're doing. Now that makes for very pretty pictures. But what does it mean? 

The really cool thing is that we've been able to train extremely large machine-learning models on these. We actually recently put out a paper that's a spotlight at a neuron workshop on a multi-hundred-million parameter model that what they can do is actually identify biologically meaningful relationships in those phenotypes. 
For example, we can tell if knocking out a particular gene causes a certain effect. That dosing another compound might cause a similar effect. And so, you can actually derive these kinds of relationships. That's a kind of thing that you wouldn't be able to do without the machine learning to power that. Because looking at these images by eye, you'd either see nothing or you'd actually see irrelevant signals.

[00:17:05] JC: Those millions of experiments that we run every week, each one of those wells is one experiment. We take those really high-resolution pictures of those and then we use our deep-learning models to turn those into high-dimensional representations. And we can compare those high-dimensional representations with other high-dimensional representations from other experiments. 

And what those help us do is find areas of biology and chemistry where those relationships aren't known. And that scale doesn't exist anywhere else and would take humans millions of years to try to find this kind of relationships. And we just do that purely by loading large and large amounts, petabytes and petabytes of images into our deep-learning models asking them to learn their relationships and then using those to find the areas of biology that we want to explore.

[00:17:51] IH: In particular, one thing that's really interesting is that anytime you run one of these screens, most of the compounds you look at are not going to be relevant for the particular program that you have. And so, in a traditional setting, you just throw all of that away. It would be waste. 

Because of the embeddings that we learn from our machine learning models, we can actually make use of all of that data in future programs. And in fact, there are a number of programs that we've initiated off prior molecules that weren't relevant at the time but were interesting in a new context. And so, the idea here is that we take this data that other companies would use as exhaust. It actually continues to fuel our discovery. And that's really enabled by those models.

[00:18:30] SF: And besides some of the innovation that's happened over the past decade or so in machine learning and AI, especially with deep learning and some of the things that we're seeing around generative AI now, there's also been lots that has happened in the world of compute and storage. 

You mentioned essentially petabytes of data that you're processing. And then you need large compute to do essentially train and run these models. Can you comment a little bit about how other technologies, how you've leveraged essentially other types of technologies in this sort of step function that we've had around? I'm sure you're using some form of the cloud to be able to do this at the scale that you're talking about. 

[00:19:07] JC: Yeah. We benefit from all of the things that are happening in the industry. In some ways, there are places where we want to innovate like in these machine learning models and places where we're customers of those. And so, things like cloud storage. We need to store several petabytes of data. We could go and buy racks and racks of hard drives. In some cases, that's actually the right thing for us to do. But a lot of the times, leveraging blob storage in Google Cloud, it allows us to store that. And really like take our eye off and stop worrying about that storage. The resiliency that gets stored in modern blob storage just means that, once our data is there, we don't have to worry about losing it. We don't have to worry about failing. Those kinds of things. We still have to worry about cost and we still have to worry about optimizing that storage. And so, we spend a good amount of time thinking about when are we going to need this data? What's the right way to store it? What are the right formats, and encodings and things like that? Generally, leveraging all of the cloud to do that has been super successful. 

I think where that stops being successful is actually really interesting. And that tends to be more in the machine-learning world. And what we found is that the promise of the cloud that anything is at your fingertips and you want to spin up hundreds of CPUs or thousands of CPUs and do that computation, those things are still true. But as soon as you change that C to a G and you want to spend up hundreds and thousands of GPUs, they're actually really hard to get and really hard to come by. 

And so, one of the things that's been great for us is we've had our own supercomputer since I think 2020 and we've been able to leverage that to keep our AI moving forward. And we haven't been slowed down and stuck in ways a lot of people have as GPUs have become scarce. Whether that was because of crypto or whether now because of generative AI. We've had our own cache that we can work on. And that we've recently talked about our relationship with NVIDIA and we've talked about expanding that supercomputer. And that's really just us continuing down that path where, unfortunately, the cloud isn't delivering on the promise that modern AI and high-end generative AI needs. And so, companies like Recursion that are owning their computer are going to be able to win where other companies are going to really struggle. 

[00:21:10] SF: Yeah. I think a lot of companies are probably suffering with GPU starvation right now. You mentioned NVIDIA there. Can you talk a little bit about the partnership that you have with NVIDIA? 

[00:21:18] JC: Yes. NVIDIA been a partner of ours well before I started at Recursion. They were part of the team and the partnership there to help us build that initial supercomputer. We're for enough to have them invest $50 million with us in the summer around July. And now we are working with them to build that next generation supercomputer. And those are the things that are the obvious things. 

But one of the great things about working with NVIDIA is that they know high-end large-scale AI backwards and forwards. And so, when we go from 128 dimension embeddings to 1024 dimension embeddings and everything gets slow, and everything needs more compute and everything takes longer, is more expensive, they can actually come and help us optimize those models and try to make them run more efficiently so we can still keep going on the path that we're on and we don't have to make compromises effectively on these really high-end models. And so, that partnership extends not just from their compute, which obviously is how a lot of people think about them, but it's also the tools and technologies allow us to optimize the use of that compute. 

[00:22:17] SF: Mm-hmm. Yeah. And their experience. 

[00:22:20] JC: Totally. 

[00:22:20] SF: We're talking a lot about the amount of data that you're sort of analyzing and processing the size of these models. And I think like when we think big data, big data existed in biology before we even called it big data. Going back to like the Human Genome Project in like the 90s. 

Essentially, the data that exists in the space is massive compared to like what most people are sort of typically dealing with. Given the amount of data that you need to process in biology and particularly in the fields of like drug discovery, without AI and other forms of automation, do you think it's possible to really like innovate and push the science forward? 

[00:22:58] IH: I'm biased. But I think no. This goes well beyond Recursion, right? I mean, the reason that I am in this field is because I think that it's not even just the quantity of data. It's actually the complexity of biology. 

There's a famous paper that was - I think it's like could a neuroscientist or could a biologist fix a radio, right? And it sort of ridicules some of the approaches that biologists take at unpacking problems compared to how an electrical engineer might fix a radio. 

But what it misses is that the radio is a human-designed artifact fact where we know where everything is, and why and what it does. Whereas biology has for hundreds of millions, billions of years been proceeding by a process of agglomeration and saying, "Yeah, that's good enough." 

And so, you end up with systems that are not monofunctional, that are not feed-forward. It's all dynamic feedback loops. They're mostly not understood. And so, if you sort of do sort of a five Y thing, well why do we need AI to get all this data? Why do we need all this data? It's because most of biology is actually unobserved, right? And it's actually unknown. 

And so, the goal that we have at Recursion is to collect that data about a broad range of biology and then to be able to stitch that together not just with embedding models that relate cellular phenotypes, but broader foundation models of biology to understand how that system is put together and how it behaves. And so, simply no. I don't think that there is a way to get past that without advanced computation on the back of an awful lot of experimental data. 

[00:24:34] JC: And to your point, Sean, on the Human Genome Project and things like that, one of the things that Recursion realizes that we need to do is not just look at a single modality. It isn't just about images of cells. We know that sequencing is part of it. We know that all of the parts of all the different layers of the data need to come together. As we think about how AI plays in there, naturally, these things don't relate to each other. 

Looking at an image and looking at a readout from a sequencer, they don't tell you how they should be connected. And so, we need to use AI to connect these pieces of data together. And that's something that humans just have no ability to interpret those things on their own. If we didn't have AI as a tool, we wouldn't be able to leverage these multiple platforms. We wouldn't be able to make the advances that we're expecting to make. 

[00:25:15] SF: Mm-hmm. I want to dig into sort of some of the technology that's going on behind the scenes a little bit. Can you talk a little bit through what's going on essentially in this AI pipeline? Sort of to make up of the text stack, what tools and/or concepts and technologies are you leveraging? 

[00:25:33] JC: When I joined Recursion, the first article that I got sent was something called the boring technology club. And I think a lot of your listeners will probably be familiar with that. But it's the idea that like where you choose to innovate and what are the cool new technologies you use, you only have a limited number of tokens to use there. And you just can't use all the bleeding edge things. 

Really, a lot of our stuff is boring. We use Postgres as a database for a lot of our data that's structured at a small scale. We use Google BigQuery for things that are larger and we need to do large analytic workloads on. We use Google Cloud Storage for storing the vast majority of our data. We use Kubernetes for running a lot of our workloads. And those are just areas that are not - we're not trying to differentiate on the use of those technologies. We need to be good at them. 

And as the [inaudible 00:26:15] technology club talks about, we need to develop mastery in those things. And we need to know what's good, and what's bad, and what are the rough edges and what are the smooth edges that we need in those tools? And as we get into things like workflows and stuff like that, Apache Airflow, again, not exciting technology. It's the same one that everyone would have chosen by default. But we're choosing those ones because they're the ones that people know. The ones that allow us that are fit for purpose. And the ones that enable us to do the work that we do. 

When we get into the AI side of things, we get into a little more leading-edge things. But we use PyTorch for developing a lot of our models. We use Slurm on our cluster to actually distribute that compute, which is actually more common in sort of the bio space and less in a lot of straight technology spaces where people might use Kubernetes instead. But we've actually found Slurm is better fit for purpose for what we do there. And we just move a lot of data around using sort of standard tools that people probably - if they had to guess what a tool was, you're probably right. It's not that exciting. But that's what enables us to be effective is by picking the important technologies.

[00:27:18] SF: You mentioned Slurm. What is that? I haven't heard of that before? 

[00:27:22] JC: Slurm is - if you think about Kubernetes and all the things Kubernetes does, at its heart, it's a schedule. You have like a job you want to run. You say how much memory you need? How much RAM you need? And it finds a node to put that pod on. 

Slurm is very similar to that, but it's designed for batch workloads. You just say I want to run this workload. I want to run 10,000 copies of this workload with slightly different parameters. Each one of them needs an A100 GPU. Each one of them needs two gigs of RAM. And each one of them needs two-quarters of a CPU. And it will find that on a cluster and run that work for you. 

And in some ways, it's a lot simpler to use than having to build Docker containers to do that. You can just literally hand it a Python script and it'll do that or pretty much anything executable. And it's a lot more common in academic circles, for example, where a lot of the folks that are working at Recursion have come from. And really solves similar problems but fits very well in our workflows.

[00:28:17] SF: And you mentioned a few times generating embeddings in essentially a high-dimensional similarity comparison. Are you using some sort of vector database technology behind the scenes as well? 

[00:28:26] JC: We've been exploring vector databases. Every time we look at it, we're like, "This is really just what GPUs pus already do in a lot of cases." Imran and I were actually doing a project earlier this year where we were experimenting with like should we use some of the vector database or should we just like load the vectors into memory and then tell the GPU to do the cosine similarity across the whole matrix? And just using the embarrassing amount of compute the GPUs have now made it so much faster and so much easier than dealing with a lot of vector DBs. 

Now there's a problem there. At some point you have too many vectors to sit in memory. And we know that at some point we'll get there. But we think we're still quite a ways away from doing that. And just doing it the old-fashioned way is working out really well for us.

[00:29:06] SF: Yeah. By the old-fashioned way, are you essentially able to load all the vectors in the memory and basically like brute force compare against all of them so you don't need to take advantage of the branching and sort cutting down the sort of space that you get from a vector database? 

[00:29:18] JC: In this case, you're doing large matrix multiplies, right? And so, you're not doing it quite in a "brute force way". Because the GPUs have really optimized methods for doing those matrix multiplies. Really, all cosign similarity searches are matrix multiplies. And so, we do those. And we do those at scale.

[00:29:34] SF: Okay. And then the company started, I believe, in 2013. It's been 10-plus years at this point. What has been sort of some of the evolution in terms of the machine learning AI approaches that you're taking? Of course, all the hype around gen AI wasn't happening well even 12 months ago. I imagine that you're using maybe more sort of classical predictive models a few years ago. And I'm sure some are in the works today. But I just wanted to kind of look at how has that changed? Has that tool change changed over time? 

[00:30:06] JC: Our founder, Chris Gibson, in his PhD program, the idea of looking at these images and the computers can do that better than humans was part of his original PhD thesis. But when he was doing that, that was using a lot of sort of hand-generated and hand-tuned features. And there was an awesome open-source package called CellProfiler from the Broad Institute at MIT that actually helped generate a lot of these original features that were used in these original models. 

But a few years ago, and I'm thinking it's about four years ago, three, four years ago, Recursion decided that those hand-tuned features were no longer the right way for us to go and we had to start looking into deep learning. 

Our original deep-learning models were based on semi-supervised learning where we would train a model based on binary features of the images that were there and sort of cut off the last step of that model and that's what we generated those embeddings from. A lot of what we've been doing to advance since then has been iteratively improving that and enhancing that. And we've run a Kaggle competition that helped us improve that further. And that's been the workhorse for our AI and ML in that phenomic space for a lot of Recursion's history. 

It's only really been in the last year where we've realized that - well, it's been for a little while that we realized it wasn't getting better and that those deep learning models were not going further than what we were seeing from them. And adding more data wasn't helping us. And so, it was this year when we started getting into training foundation models and those papers that Imran mentioned are the outcomes of those where we're really seeing what we call the scaling hypothesis come to life. Where adding more data, adding more parameters, adding more compute allows us to solve problems where fancy algorithms were not working anymore. 

[00:31:41] IH: That's right. And I think going beyond that, what Jordan has outlined is sort of the history of our models in biology or biological images specifically. A big theme of 2023 for Recursion has been our investment in chemistry. And particularly, machine learning models in chemistry. 

And so, we actually acquired a pair of companies in May of this year. Cyclica and Valence Discovery. And we've taken Cyclica's models for identifying molecules that might bind to targets in the body and scale that up by probably four or five orders of magnitude. We announced running a calculation of 36 billion molecules against 15,000 proteins in the body. You can figure out 10 to the 14, 15 different calculations there. 

And in Valence Discovery, which is now Valence Labs, part of Recursion, investing in everything from synthetic generation, synthetic data generation to both discriminative and generative approaches to chemistry. As well as research on autonomous agents working on drug discovery and causal inference as well. Really an explosion of investments in different parts of the machine learning space in this area.

[00:32:54] SF: What's usually the input to these models? Is it generally image data?

[00:32:57] IH: There are a range of things. For the kinds of embedding models that Jordan was talking about, those are image data. And there are a handful of different imaging modalities that we applied. The most common one today is, as you said, these somewhat odd for the classical computer graphics person six-channel TIFFs. But we've experimented with other imaging modalities as well. 

For the chemistry data, we can take in experimental data. We can take in things like structures of proteins and small molecules. One area that we haven't talked about actually is there's a lot of video data that we work with. We actually have digitally instrumented cages in our vivarium for animal studies. And using machine learning on those videos and digital biomarkers, we can really rapidly accelerate the progress of animal studies and the drugs to identify safety issues or efficacy earlier. 

[00:33:45] SF: And then how do you go from say video or from these images to the embeddings? What are you essentially pulling out? Is there essentially some form of feature engineering going on there? Or is there something else? 

[00:33:57] IH: We let the deep learning do the future engineering for us, right? As Jordan described, the first generation of those embedding models were weekly supervised models, like supervised on a handful of classification tasks where you just chop off the classification head and take the features that are going into that last stage, right? 

With the newer generation of autoencoding models, how do you come to the embeddings? Well, there's this complex stack of like tokenization, and transformer layers and so on. But at the end of the day, we're not actually trying to do any particular feature engineering, right? We are training these models on particular proxy tasks that it turns out induce really good performance and biological benchmarks. We're training them. We're benchmarking them and then moving them into production. 

 The features that come out are not in themselves interpretable. If you asked me what dimension 124 meant, I wouldn't be able to tell you. But because I can put that 1,000-dimensional embedding in relation to the knockouts of the rest of the genome, I can tell you biologically what it looks like it's doing do.

[00:34:57] JC: And those proy tasks are actually really cool. There's some great images that we circulated including some of the papers where we take the outcome of one of our experiment and we literally hide 75% of the image from the encoder and from the model and ask it to reconstruct it. And it does a shockingly good job of doing that. And those kind of proxy tasks allow us to really feel confident that it's not guessing at this. It's actually understanding what is important in these images and how to reconstruct them. And that gives us - by taking the output of the encoder before the decoder on the other side, that gives us the embedding that allows us to compare these things in this high-dimensional space.

[00:35:32] SF: Yeah. That's really cool. What do you think the limitation or challenge is with AI in the space today? Is it access to enough quality data? Limitations in terms of not having enough GPUs? Where are the sort of limitations of some of these approaches right now? 

[00:35:47] JC: Our CTO says we need three things. We need compute. We need data. And we need people. And I think all three of those limitations exist broadly in the space. You need access to high-end GPUs that can deal with the massively large models with hundreds of millions of parameters. You need the data sets that allow you to encode, and understand and train in the mass of space of biology. You need the talent that can build these models, train these models. It's not just the scientists that design the models or the ML scientists that design the models. But it's also just the engineers that need to handle things that run across hundreds of GPUs. And when there's failures, deal with those modalities and things like that. The talent a huge part in the story too. 

[00:36:29] SF: Do you think that AI will give insights to the biology that - or maybe it already is. That's like would be impossible with humans just doing the work? We kind of touched on the idea that we can't really push the science forward without some of these things just because of the sheer complexity of the data itself. But can we essentially leverage AI to develop novel techniques that maybe a human wouldn't even have been able to think of? 

[00:36:53] IH: I think that's definitely the case, right? My view on this is AI and associate methods are not a magic wand. They're a tool, right? And so, if you turn the clock back 300 years, you could ask the question, "Are there insights in biology that wouldn't be able to get if you hadn't built a microscope?" And the answer is sort of obviously yes, right? There's a world of things that we weren't able to see. 

I think that AI as a tool like serves a number of different roles, right? In its role of thinking about these embedding models, it essentially lets us build a better microscope, right? Where instead of seeing just what the picture looks like, you start to understand the biological relationships of what's going on underneath the hood, right? 

If you think about the way that you can apply it for chemical interactions and chemical design, its estimates vary. But a reasonable estimate of the number of possible small molecules that fit sort of the criteria for what could be an oral drug is something like 10 to the 60th. 

Literally, there are more possibilities there than there are fundamental particles in the universe, right? And so, being able to intelligently search such super-exponential spaces on the basis of limited data is an algorithmic problem and is a learning problem, right? 

And so, I think if you think about AI as a set of tools or a set of enablers to understand and explore the complexity of what's going on, that it's indisputable. That there are things that we are going to learn with that assistance that we would not have been able to find otherwise. 

[00:38:24] SF: And then in terms of the sort of predictive power of some of these models that you're using, how accurate are they? Is it okay essentially where, in the world of essentially drug discovery, a lot of drugs fail to ever reach market. As we've mentioned, we have all these sort of inefficiencies around go to market. Could you have something that actually fails 80% of the time or is not necessarily that accurate but actually still have significant impact in this space? 

[00:38:49] IH: If we had a method that at the outset failed 80% of the time, we would be doing about 20x better than anybody else in the industry. Because typical failure rates are more like 98% to 99%. 80% failure is fantastic, right? 

And the way that we talk about the kind of impact we want to make, the kind of pipeline that we want to build is a common illustration of the drug discovery funnel looks like a V. Where you have a lot of candidate programs at the top and then they sort of get filtered out through discovery, through phase one, through phase two. 

What we talk about is wanting to build a pipeline that looks more like a T where you explore an absolutely enormous number of potential ideas at the very top. You accept that a lot of them may be false positives but you're able to filter them out extremely fast so that only the ones that remain actually make it all the way through, right? 

And so, to that end, there is still a sequence of experiments that we go through, in vitro, in vivo and so on. And we have to use those in order to filter out the things that won't work. But the goal is to use that data to build better predictive models. To bring that V into more of a T-shape.

[00:39:58] JC: And we talked a little bit earlier about the different modalities that exist. And those things are different kinds of experiments that sort of layer on each other. One thing might have a certain false positive rate. Another thing might have a different false positive rate. And when we layer those together, we get the best of both worlds. And so, we get the things that pass through them are the ones that progress on in our process.

[00:40:17] SF: Fantastic. And could you talk a little bit about some of the work you did to predict the effectiveness of drugs using AI models versus sort of classical clinical trial approach? 

[00:40:26] IH: Yeah. Absolutely. I think the best example of this is some work that we did in early 2020. For anybody who doesn't remember, there was this big disease that went around in end of Q1, beginning Q2 2020 called Covid-19. Really bad, right? Caused a lot of sickness. Caused a lot of mortality. We didn't have any treatments for it. And so, this seemed like the kind of thing where Recursion felt a moral obligation to try to work on this. 

What we did was we were able to very rapidly stand up experimental systems looking both that Covid-19 infection as well as late-stage severe Covid. This is something that was enabled by those AI-driven platforms. The fact that we didn't have to build a whole new experimental platform to test a new disease, we just plop something on and immediately start screening really enabled this. 

If you skip forward to the end, this work took us a couple of months. I think the viral screen we started in March. We pre-printed the results in late April. We wrapped up the results and published the results for the late-stage screen in May or June of 2020. That's obviously a lot faster than you can run a clinical trial. We are and were a small company. And so, we couldn't run those trials on our own. We put the results out there. 

If you look now three and a half years later, of 10 of the molecules that we made predictions on for which large-scale trials were run and published, we got the correct answer nine out of 10 times. And so, that's the kind of result that gives us a lot of confidence that the signals that we're seeing on the platform really are ones that are going to translate to successful prediction in human trials downstream.

[00:42:03] SF: And then if that sort of future plays out, will we see a reduction in the number of clinical trials that need to be run? What is the sort of consequences of being able to do this level of prediction automatically? 

[00:42:17] IH: I mean, I think it would be fantastic to have fewer trials that fail, right? Trials are expensive not only in a logistical and financial sense, but these are real people who were in the trials. Some of who may suffer like ill effects from the drug. Some of whom have hope. But the hope doesn't play out because the drug doesn't work. 

I would love to be able to build towards a future where you're in a trial and you have a much higher belief that like this is actually something that is going to make you and make others better in the future. 

[00:42:46] SF: As we start to kind of wrap up a little bit, what do you think the hard problems are that still need to be solved? I mean, I'm sure there's lots of them. But like what is kind of the big, gigantic or biggest sort of gnarliest problem that you think would really help unlock this world? 

[00:43:03] JC: Imran mentioned the size of the chemical space. The 10 to the 60. And being able to search that space is a huge, huge challenge. Because you have to search for it not just on the chemical structure. You have to search for it on toxicology. You have to search for it on efficacy. There's a whole bunch of things on absorption. How it actually interacts with the human body? 

And so, I think one of those problems and one that we're excited to try to solve is how can we reduce that space? How can we more accurately search in that massive high-dimensional space? And right now, those still feel very intractable but problems that we're spending a lot of time on trying to noodle out and trying to get better at.

[00:43:39] IH: I think that's right. And to take it one level deeper, you can think about this as an algorithmic problem of modeling which compounds might be interesting and building effective models to that extent. I think there's sort of a metal layer to that around how do you direct the right experiments to learn the right next things? To take the sort of active learning approach? 

And one of the areas that I'm personally most excited about is how we can combine software and ML systems with physical systems in order to do this kind of autonomous discovery.

[00:44:11] SF: Yeah. We didn't even really - we didn't have time to kind of get into some of the challenges, I would imagine, exist between sort of you running some of these experiments in the computing world and then taking them to a wet lab. And then, also, all the processing that has to happen from the wet lab back to a data science team to kind of figure out what's going on there. A lot of noise. A lot of potential for human error and so forth.

[00:44:32] IH: Absolutely. And Jordan manages a massive amount of automation in order to drive that scale, in order to reduce the scope for human error. But physical systems are physical systems, right? And there's an awful lot of work that has to go into building and controlling those. 

[00:44:47] SF: Is there anything else either of you would like to share? 

[00:44:50] JC: One thing I'd just say is that being on this podcast is a huge honor for me. I was a listener year and years ago when the previous host - when Jeff was hosting it. And I've been following it for years. And so, I'm super excited to be here. And love what Jeff created and where it's going still.

[00:45:04] SF: Yeah. Thank you. That's fantastic to hear. It's always great to have people on as guests that were listeners as well. And think it makes for great interviews and they understand what they're kind of stepping into.

[00:45:13] JC: Totally. 

[00:45:14] SF: Imran, do you have anything you'd like to share? 

[00:45:16] IH: I think as a parting thought, it's something you addressed a little bit in the introduction with Jordan. But I see this in the data science side as well as the software engineering side. There's a lot of intimidation that people feel about making it jump into a field like biotech. Like, "Oh, I don't know any biology." Or, alternatively, "Oh, well, I would rather be designing software things for engineers." I would encourage folks to rethink that. 

There are an awful lot of problems in biology, in drug discovery that either you don't need to know biology to solve in order to drive a huge impact. Or you can learn that biology and you can make that happen. And that there is a great deal of value and mission, I think, in being able to put you know those skills to a higher use. To being able to build these systems in order to make people healthier.

[00:46:08] SF: Yeah. I would think that as a software engineer, it'd be incredibly rewarding to work in a space where you're hopefully delivering something that's going to help somebody potentially survive something that could be life-threatening, or even feel better, or whatever it's going to be. Be able to deliver more of this to market faster so that we can essentially have a healthier future. That's much more rewarding than perhaps other types of things that we could build with our time and our engineering skills. 

Fantastic. I think that's a great place to leave it. Imran, Jordan, I want to thank you so much for being here. I think this is one of the most exciting spaces in the world of AI. We can all get excited about our coding assistants and co-pilots. I think that stuff's cool too. But I think being able to use AI, machine learning to really have an impact on healthcare and drug discovery is just an incredible thing to see. And I'm super excited about where Recursion is going and some of the work that you're already have been successful with. Thank you for being here.

[00:47:04] IH: Thanks so much for having us, Sean. 

[00:47:06] JC: Thanks, Sean. It's great.

[00:47:07] SF: Cheers.

[END]

SED 1633
		Transcript

	(c) 2024 Software Engineering Daily	1