EPISODE 1745 [INTRO] [0:00:00] ANNOUNCER: CRISPR is a powerful tool in biotechnology that allows scientists to precisely edit genes, much like editing lines of code in a computer program. Just as developers can remove or alter specific parts of a code to fix bugs or enhance functionality, CRISPR enables researchers to modify DNA to correct genetic disorders, improve crops, or develop new treatments. The development of CRISPR-based editing was recognized by the Nobel Prize in Chemistry in 2020 awarded to Emmanuelle Charpentier and Jennifer Doudna. Profluent Bio is an AI-first protein-designed company that recently developed OpenCRISPR-1, which is an AI-generated, CRISPR-like protein that does not occur in nature. Importantly, the company also released the protein and nucleic acid sequences for OpenCRISPR-1. Aadyot Bhatnagar is an ML scientist at Profluent Bio and previously worked at Salesforce. He joins the podcast with Sean Falconer to talk about OpenCRISPR-1 and how it was made. This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him. [EPISODE] [0:01:18] SF: Aadyot, welcome to the show. [0:01:19] AB: Hi, Sean. Thanks for having me. [0:01:21] SF: Yes, absolutely. Thanks for being here. So, in preparation for our conversation today, I was reading about the work at Profluent, which I'm sure we'll get into. But in particularly, like, the work that you're doing around using AI design proteins to help treat genetic diseases. I couldn't help but feel, really happy to see AI being used in such a meaningful, potentially impactful way. There's a lot of hype around AI right now, particularly with large language models. A lot of the work, I think, is focused on chatbots and nothing against chatbots. They're fine, but it's not necessarily that exciting or life-changing for people. But the work that you're doing is actually potentially life-changing. So, what's the background on the company and How did you first get involved? [0:02:06] AB: Yes. So, I guess I'll start with my story and how I found my way to Profluent. Before I was working at Profluent as a machine learning scientist, I was at Salesforce AI Research, where I worked on a pretty wide range of more traditional machine learning problems. While there, Salesforce had a few different moonshots of which Ali, our CEO, was leading one, where he was really working on this idea of using language models to understand and generate proteins, and I think he decided that this was an idea that was worth pursuing to a much greater extent, and so he brought me along as one of his initial set of recruits from Salesforce. So, since joining Profluent maybe a year and a half ago, I've been working primarily on sequence-based modeling. With the company more broadly, our goal is to use these AI methods, which I can get into more later, to really unlock protein design for various therapeutic applications, various industrial, agricultural, and like many other use cases, in ways that have previously been really hard to achieve with kind of rational engineering and design, and really leveraging the power of these newer models to do a lot more than was previously possible. [0:03:24] SF: In terms of the work that you're doing around sequence-based modeling, can you describe for those that are maybe less steep in this world? What does that mean? What is sequence-based modeling? [0:03:33] AB: Yes, so by way of analogy to natural language, which is, I think, where a lot of these ideas come from. In natural language, we can model language as a sequence of word or more generally subword tokens. So, for a bit of biological background, proteins are these large molecules that drive a lot of important biological processes. These proteins are comprised of small biochemical building blocks that are called amino acids, which we can think of as a sort of alphabet. The way that these protein language models work is that they're often large transformer models which basically model the proteins as a sequence of amino acids, similar to how large language models like Chat GPT model natural language as a sequence of subword tokens. [0:04:23] SF: Okay. And then in terms of like those models, in order to do this in the context of basically being able to identify or create new proteins, sequences, are you building essentially those foundation models specifically for protein sequencing from scratch, the way that you would build something like Llama2 or Quad, but that's basically the corpus' text found on the Internet? [0:04:45] AB: Yes. So, for OpenCRISPR specifically, we were using, as our baseline, a model called Progen2, which you can think of it as an equivalent of a Llama2 or something, where this is a large, pre-trained, autoregressive generative protein language model that's been trained on a wide diversity of proteins found in nature. In order to actually make this usable for generating CRISPR proteins for gene editing, we needed to first curate a really large dataset of CRISPR proteins as well as various data associated with them and actually fine-tune these general language models on the CRISPR data specifically. [0:05:27] SF: Okay. So, you're starting with something and then you're essentially fine-tuning it for the specific like use case. [0:05:33] AB: Exactly. [0:05:34] SF: Okay. Over in CRISPR one is described as the world's first AI created like open source gene editor. Could you explain how AI was using the development of the system and what makes that approach particularly unique from other CRISPR methods? [0:05:49] AB: Yes. So, in terms of the overall AI approach, the core driver was a generative protein language model, which I just described. This language model, when we ask it to generate a protein sequence, we'll have some random sampling procedure, which when you compare it to any natural protein found in the wild, we'll have hundreds of mutations and is incredibly distinct from anything else that you'll see in nature or even that's been patented previously. Compared to prior work, using these generative models is the first time that anyone's been able to get highly functional proteins that are so distinct from anything that's found in nature. To do this manually or using previous techniques, it has just proven infeasible. [0:06:42] SF: Are they infeasible because of just essentially the scale of the data and the complexity with the analysis? [0:06:48] AB: I think infeasible because for a bit of background on these CRISPR proteins, they are incredibly complex systems that basically scan through all three billion nucleotides in the entire human genome, navigate to a single specific site, and make a cut in the DNA at just that site. There's a lot of complex machinery that needs to work correctly in order for all of this to happen. Essentially, if you make even one bad mutation, you can break the entire protein. So, previously, when you had people kind of making mutations based on often like biological hypotheses, sometimes like data-driven hypotheses, it would be a few mutations at a time at most tens of mutations away from a starting point, whereas these methods are able to go hundreds of mutations away. The reason this is so difficult is because the search space is inherently combinatorial, where you have like 20 different options at each position. And every time you make a mutation, it could just break the protein completely. [0:07:54] SF: In terms of OpenCRISPR-1, is that one sort of monolithic fine-tuned model that you're operating with? Or are there multiple AI models at play there? [0:08:02] AB: Yes. So, I guess this is a distinction I'd like to draw, that OpenCRISPR-1 is actually referring to the molecule that we designed. This molecule is what we've chosen to open source and make broadly available for ethical use in gene editing. However, the models that we use to generate those, at this time, we have decided not to open source those. But to your other question there, there were a broad family of models that we used in order to actually design these proteins. So, there was the generative model that I mentioned, but then there were also a host of other models that were predicting various properties of these generated proteins in order to essentially make sure that they were usable in the way that we wanted them to be, in a biological sense. [0:08:48] SF: Okay. In terms of training those models and getting them ready for this, what's the training process look like? What is the data pipeline look like? [0:08:58] AB: Yes. It's a good question. I think the data pipeline is one of the most important pieces. We have a really strong bioinformatics team at Profluent who curated the world's largest resource of CRISPR-associated proteins, as well as various metadata around these proteins that's important for determining where in the genome that they'll target. So, we first had just a really large database of these CRISPR proteins themselves, which we used to fine-tune the generative language model. But then we also fine-tuned various encoder-style language models that have bi-directional attention, and then some prediction head on top to do property prediction for these various biological properties that we wanted to control for. [0:09:45] SF: Where's the original source material for that come from? You're curating where they actually originate from? [0:09:52] AB: Yes. It's quite common that people will go out into nature and take samples. This might be from the soil, from sea water, from the human gut, and just all sorts of places, and with how cheap DNA sequencing has become, they will just sequence everything they find in these samples, which will contain DNA and genetic information from a vast diversity of life, and these data are then just put into large, publicly available repositories. There is a lot of processing work that needs to be done to actually get them in a form that's usable for training models, but that's where the data starts is a lot of people have just collected samples from all over the world and, yes, sequenced the DNA in them. [0:10:45] SF: In terms of like pre-processing this data to get it ready for being used to train the models, are you using any off-the-shelf tooling there? Is there a lot of custom work that you need to do in order to condition the data for that? [0:10:59] AB: I can't speak to all of the details, but I can say that there were a few off-the-shelf tools that were used. But when they were used, they were strung together in really custom ways, and there was a lot of custom logic in the processing pipeline. [0:11:13] SF: Okay. Then in of like the molecule that you open source OpenCRISPR-1, like what is the advantage I guess of open sourcing? Why was that sort of decision made? [0:11:22] AB: Yes. So fundamentally, Profluent believes that CRISPR medicines are a really flexible tool that have the potential to transform the lives of patients with many different debilitating genetic diseases that have to this point proven incurable. As an example, in 2023, the first CRISPR therapy was approved for sickle cell disease, and there was a lot of excitement around this approval, both from the sickle cell community itself, as well as, I think, more broadly from the biomedical community around the potential of CRISPR as a tool to cure other genetic diseases as well. The big downside, however, is that with this initial treatment, just it being the first and having required a lot of work to get to this point, costs millions of dollars to patients. Our hope in open-sourcing OpenCRISPR-1 is to start to democratize access to gene editor technology to lower the cost and increase access for both research and potentially future therapeutics. [0:12:31] SF: Okay. Makes sense. Then in terms of actually going back to the task of hand of integrating machine learning and this gene editing process, what were some of the big challenges that you faced trying to do that? Was it mostly around getting the right data and getting it ready to train the models, or were there other aspects that were hurdles that you had to work through? [0:12:55] AB: Yes. So, broadly speaking, one of the big challenges is just the complexity of the CRISPR gene editing system, as I mentioned. When we were starting out with like generation of these proteins, we had a certain set of biological properties that we wanted it to match. Specifically, there is a benchmark gene editor called SpCas9, and we wanted our generated proteins to be compatible with certain properties that SpCas9 had. So, there were first like a lot of open questions around how do we even do that with the models that we had. In addition to this, I think there were a lot of questions around how to actually select the right proteins that we wanted to send to the lab. Because we can very easily generate like millions of different candidate proteins. Many of them likely will not work. There's a major question of like, if you're generating say a million proteins and the lab is only able to test 100 of them at a time, how do you actually allocate those 100 in a way that maximizes your chances of success, while also meeting all of your other design criteria? [0:14:04] SF: Yes. I guess like, well, I'll put that question back. How do you do that eventually? How do you solve that problem? [0:14:10] AB: Yes. So, this is where some of these property prediction models came in. One result that people have found in the protein language modern community is that the likelihoods that these models assign to proteins, both natural and the ones they generate, can often be a reasonable proxy for how good that protein is, like whether it's going to work or not. We also used various other datasets that we had in order to train supervised models that would give us auxiliary data to make those predictions. Then we also kind of did some efforts to hedge our bets, essentially, where we wanted to make sure that the proteins we selected were all sufficiently diverse. Basically, we didn't want to have 100 proteins that were all one or two mutations away from each other. So, kind of combining all of these techniques together to both hedge our risk and explore a large enough search space that we felt confident that we would get at least some strong hits. [0:15:13] SF: In terms of the GenAI model development, how did you actually do testing to make sure that, as you're evolving the model, it's actually moving in a way that leads to better results? [0:15:27] AB: Yes. I think this is a big open question for the community at large. I think in this case, because we were fine-tuning on a domain-specific dataset, we're using a lot of standard measures for fitting and generalization that you see in the NLP community. So, this would be, for instance, perplexity on a held-out validation set to make sure that the model was learning something useful about CRISPR proteins that was generalizable. Then there were also some benchmarks based on supervised datasets that we essentially compared our model's performance against to see if it was able to do well on the supervised data sets as well. [0:16:05] SF: Do you have to handle any kind of like versioning of the model and potentially even having like regression where you have to roll it back to a prior version? [0:16:13] AB: Yeah. In this case, we decided to choose a particular version of like the datasets we were training on and stuck with those datasets for consistency throughout perhaps like varying different model training parameters and selecting the best hyperparameters based on perplexity in such metrics. But yes, for this project, at least we didn't train like multiple versions on like multiple versions of the data, like the datasets do continue to grow over time. [0:16:41] SF: Okay. Does things like lineage or explainability matter in this domain? If you're generating these proteins, so I need to know, is the black box result fine, or do I need to understand where that actually came from? [0:16:56] AB: Yes. There are some computational methods to understand the protein itself. So, one that's quite popular is to use a model like AlphaFold to predict the structure of your proteins. I think these methods have limitations, but if you have someone who has this biophysical expertise, they can often analyze these generated structures and at least get some sense of whether this protein is reasonable or not. But ultimately, the only way you know if these things work or not is if you test them in the lab, and that requires an experimental team to really define experiments that test the hypotheses that you're interested in testing. [0:17:39] SF: In terms of the team that was working on this, did people end up surprised by the results? Was there a novel of things that came out of the result of this that really shocked the team no one expected? [0:17:53] AB: I think we were pretty surprised at first, that just using this autoregressive language model generation technique actually yielded proteins as highly functional as they were. We went into this project basically trying out a few different things, and we weren't sure which ones would work or not. I think we were all pretty surprised at how capable these language models seemed, especially given that prior to this work, no one had really applied them to a task as complex as these gene editing systems. [0:18:27] SF: Who's ultimately the end user in this scenario that's actually running these models and then getting the output? [0:18:33] AB: Yes. For the models themselves, we've largely been using them for internal use and really these are machine learning scientists, engineers, protein design scientists who are were fairly familiar with many of the details of these models, who are the ones who are actually using them to generate these proteins, and then select the ones that we think are worthy of wet lab characterization. The handoff is one of the sequences that are generated from these models, as opposed to models themselves. [0:19:05] SF: In terms of bringing therapeutics or drugs to market, where is Profluent in that life cycle of a treatment? Once you've done your work, how long and how many people involved basically before that become something that actually is impacting an individual? [0:19:25] AB: Yes. I'd say right now, given that we're such an early-stage startup, we're really focusing on the earliest stages of drug discovery. There's a long road after those initial stages and characterizations in human cells. But it's a necessary step to actually start off that process. Yes, I think what we do versus what others do will evolve as time goes on. And I think there's no way to tell the future exactly. But yes, right now we're focused on the early stages. [0:19:54] SF: How do you see this technology being adopted more broadly for therapeutic applications? [0:20:01] AB: Yes. I think these protein language models, they've been shown to be quite powerful. I think there's a couple of other groups who've been working on leveraging them to do protein design. One of the directions that we at Profluent are really excited about, and that we've just very recently put out to pre-prints on, is the ability to guide these language models towards generating proteins that have specific properties that we might be interested in. We recently put out a paper called proseLM, which allows us to take a protein structure that we might be interested in and actually ask a generative model to generate a protein that conforms to that structure. We found that it has really strong performance. We're really excited about this guided generation as a topic more broadly. [0:20:53] SF: In terms of AI's role in the field of genetics and bioinformatics, is essentially AI, the key to unlocking research and biology, just given the scale of the amount of data that you're processing for something like the 21st century becoming the age of biology? [0:21:13] AB: We definitely think that these AI models are going to play increasingly important role in driving biology research, drug development, protein engineering more broadly. I think it remains to be seen to what extent that will be, but I think these protein language models and similar technologies are here to stay. [0:21:33] SF: How does something like this impact essentially the overall cost and timelines for bringing something like that to market? [0:21:41] AB: Yes. I think the biggest change in terms of the cost for early-stage drug development is that previously you may have had an initial screen where you have a few different candidates that you're interested in starting with, and then you'll have a lot of rounds where you incrementally modify those candidates. Every round you need to do wet lab experiments, which are expensive and take a lot of time. I think one of our big goals with these models is to reduce the number of cycles in the wet lab. So basically, if previously some designed campaign had taken us like eight rounds of experiments, maybe now we can do it in two or three. [0:22:22] SF: There's also a lot of, I think, like noise that comes out of the wet lab experiments too, where you're kind of doing like this, you do the experiment in the wet lab, and then there's like a handoff basically back to the data science team at some point. Essentially, if you're doing multiple rounds of the wet lab, but you're probably introducing sort of more new noise to the whole process, whereas if you can cut that down, then there's going to be less chance, essentially, for the amount of noise that you're introducing in the system. Do you understand what I'm saying? [0:22:50] AB: I think to some extent, but also with all of these wet lab experiments, just because biological systems are complex and sometimes things break, we'll often do multiple replicates of each sample just to make sure that we're not getting spurious signal from these individual experiments. Yes, there's a lot of care that's taken to make sure that the experiments are giving reliable results. [0:23:12] SF: In terms of the way Profluent is set up, or even other companies in space where you're combining skills from biology, AI, traditional computer science, or software engineering. What ends up being sort of the team makeup and how do those different disciplines kind of play together? [0:23:32] AB: Yes. It's a great question. I would say broadly, Profluent is divided into like three major groups. The first is bioinformatics, which is really how we curate the massive data sets we use in order to train our internal models, whether they be unsupervised or supervised. They also do a lot of work in actually analyzing the results of the wet lab experiments. We have a machine learning team who's driving model development and a lot of the actual protein design work. Finally, we have the experimental team without whom we wouldn't actually be able to test any of our hypotheses. They're the ones who allow the ML team to see what techniques work, which ones don't. And for specific campaigns that might be, say, of commercial interest to get the results that we need to show that the work we're doing on the computational side is actually yielding the desired results. [0:24:31] SF: Then on the ML side, are you, or you or somebody else within the in the organization, having to stay up to date on the latest research? And then are there separate people that are essentially in charge of figuring out how do we incorporate these things or find value in them? Or is that something that you're actually doing on a regular basis? You're always kind of figuring out, "Okay, well, we could actually be able to get improvement here, or we recognize that we have this problem, and we saw this research here. Let's run a sprint on that and see what it maybe yields." [0:25:03] AB: Yes. I think it's a great question and everyone on the team is, I think, quite proactive about keeping up with the latest developments, whether they be new research, new features in say PyTorch, and thinking deeply about what the implications are for the work we've already done, as well as whether it opens up new use for things we can do in the future. [0:25:26] SF: Speaking broadly about the field of genetics, what is AI's role in that space you think in the next few years? [0:25:34] AB: I think broadly for gene editing, there are a lot of different ways in which proteins interact with the human genome. These interactions are often complex and very domain-specific and require a lot of human expertise to really understand deeply. I think the promise of AI for gene editing especially is in creating models that can understand these systems at various levels and really using them to basically replace the current paradigm wherein we look in nature and see if there are proteins that exist to solve a particular problem towards one where we can instead say, "I have this specific problem that I'd like to solve," and then designing a protein that's bespoke towards that purpose. [0:26:29] SF: Yes. It's kind of like the shift essentially from accidental science to more of like an engineering discipline, essentially. [0:26:36] AB: Exactly. [0:26:38] SF: What do you think is the, I guess, the big challenge in the field right now in order to get to a place where this is more of an engineering discipline? [0:26:48] AB: I think the field is quite nascent right now, and there are a lot of different gene editing modalities, which Profluent has explored a couple that we talked about in the OpenCRISPR pre-print. But I think with every new biological system, it's an open question of what methods are best suited towards that engineering task. I think it's really hard to know what will be feasible and what won't. But I do think the results from OpenCRISPR are incredibly promising. [0:27:19] SF: In terms of like working either in machine learning or as an engineer within a company like Profluent, are there unique challenges because of the space that you're in that might be different than working in a space that's outside of sort of like biotech? [0:27:34] AB: Yes. I think the biggest challenge is just of like working in this space is how interdisciplinary everything is, and I think it demands a lot of knowledge from many different areas. I think that's one major strength of the team we have here at Profluent is that different team members come from different backgrounds. I myself come from a pretty traditional machine-learning background. We have other members on the ML team who come from a strong biophysics background and kind of everything in between. I think the ability to really leverage others' knowledge in areas where you may not be as familiar, I think that's been both a challenge but also incredibly rewarding to learn CRISPR biology, for instance, coming from a fairly basic bio background two years ago. [0:28:27] SF: What was that ramp-up time like? [0:28:29] AB: I've always been interested in biology, but it was definitely a steep learning curve at first, but it's been a lot of fun. [0:28:36] SF: Are most people coming to Profluent or other companies, are they coming from a biology background? If they're hired essentially into an engineering or machine learning, data science role, do they have that experience? Or is it something that they end up having to pick up or some portion of it as part of the job? [0:28:53] AB: I think with how new this field is, almost no one in the world has experience in both biology and machine learning. So, our goal is really to have people from either side who is able to pick things up quickly and learn what they need to in order to like execute on the projects that we're doing here at Profluent. So, yes, we have folks from a pretty diverse set of backgrounds both on the biological side as well as like the more traditional software engineering and machine learning side. [0:29:24] SF: In terms of like on the machine learning side, do you end up building like new approaches? Or are you applying sort of traditional machine learning models and it's just a new set of data, a new domain? [0:29:38] AB: I think there's a healthy mix of both. For instance, with these new proseLM models that I alluded to earlier, this was a pretty novel way of combining the ability to condition an autoregressive language model to actually generate a protein that conforms to a structure that you as a user specify. I think that's quite novel and requires a lot of domain-specific knowledge. [0:30:06] SF: On the other hand, you also have more traditional language modeling work, which we can draw a lot of inspiration from the NLP literature from. So, I think it's a bit of both. [0:30:17] AB: What are the next steps for Profluent in terms of refining, expanding these capabilities? [0:30:25] SF: At Profluent, we really do believe in the promise of AI to customize different attributes of existing systems so that these new systems can be a perfect fit for a specific application. Yes, as I've alluded to throughout, the proseLM and as well as this Profluent pre-print that we released recently are some of the ways in which we're doing that. And yes, there's a lot more to come. Stay tuned. [0:30:51] SF: Okay. As we start to wrap up, is there anything else you'd like to share? [0:30:55] AB: Yes. At Profluent, we are broadly interested in the ability of AI to design proteins for a wide range of different applications. We are very open to folks joining us and are actively hiring. Like I mentioned, folks from all sorts of different backgrounds, whether they be more biological or machine learning-oriented, very welcome to apply. I think there's room for all of us to grow and learn together. [0:31:22] SF: Well, fantastic. Hopefully we piqued listeners' interests enough to go and check out the careers page at Profluent and maybe you'll get some good applications in there. But Aadyot, thanks so much for being here. I really enjoyed our conversation. [0:31:34] AB: Yes, thanks, Sean. [0:31:36] SF: Cheers. [END]