EPISODE 1639

[EPISODE]

[0:00:00] ANNOUNCER: Nextflow is a tool for managing scientific computation workflows. It's increasingly popular for bioinformatics, computational biology, and other life science applications. Evan Floden is the co-founder and CEO of Seqera Labs, which develops Nextflow. He joins the show today to talk about his background as a scientist and engineer, the modular design of Nextflow pipelines, the unique challenges of genomic sequence data formats and more.

This episode of Software Engineering Daily is hosted by Sean Falconer. Check the show notes for more information on Sean’s work and where to find him.

[INTERVIEW]

[0:00:47] SF: Evan, welcome to the show.

[0:00:49] EF: Awesome. Thanks a lot, Sean. Great to be here.

[0:00:50] SF: Yes. Thanks for doing this. I'm looking forward to our conversation. So, let's start off with some introductions and some basics. Who are you and what do you do?

[0:00:57] EF: My name is Evan Floden. I'm the CEO and co-founder of Seqera Labs, trained as a scientist and biotech and in bioinformatics. I guess, more relevant to this conversation, I've been working with my co-founder Paolo Di Tommaso, for the last 10 years or so on an open-source project called Nextflow. I'm very passionate about developing tools, scientific analysis. But yes, I've really been into the more computational side of things for the last 10 years or so.

[0:01:23] SF: Then, what was your, you said, you're a trained scientist. What was your PhD work in?

[0:01:30] EF: So, my Ph.D. was in multiple sequence alignment, which is where you take thousands and thousands of sequences, typically either protein sequences, or DNA sequences and line them up. By using that information, you can generate three-dimensional structure predictions. You can do predictions on phylogeny. So, what's the relationship between those two sequences and evolution. It’s very much been used over the last, probably, 30, 40 years or so, sort of the origin of bioinformatics for being able to be trained on different models now. So, now we start to see it used quite widely in things like AlphaFold, which just come out more recently, that structure prediction. It's a bit of a historical field. I have been into it, I guess, I started my Ph.D. on that, but really, through the years, ended up shifting more and more my focus to Nextflow. But it was really kind of the origins of Nextflow itself started with that project.

[0:02:24] SF: That's very interesting. Then, you actually started Seqera straight out of completing your PhD. I did something similar. But I guess like, what was it like to go from academics to actually building a company. In a lot of ways, it’s a pretty different beast to manage.

[0:02:39] EF: Absolutely different beast. I had a bit of a roundabout path, and that I previously worked in biotech, for about four years or so, prior to jumping into the bioinformatics field. That work was very much at the bench. So, spending a lot of time in the lab. A little bit tedious, I would say. So, folks who have spent some time in the, at the bench or recognize the amount of time that goes into that, and really wanting to go into more than the bioinformatics field, onto more computational side.

That's where I really started my PhD, met my co-founder Paulo, and they really kind of from there, ended up splitting up the company. We created Seqera about five years ago, but we were about five years into the Nextflow journey. That journey itself has been more of a transition, I would say, than necessarily a kind of a direct kind of hard turn. We've kept obviously the project open source. We started off very much doing things like training, spreading the word about the community of Nextflow, so it was a kind of a transition into that. I guess, building the company is obviously very different, as you start to scale up in particular, kind of puts complete different challenges on you. But it has really, the focus has remained very much on kind of the goals around Nextflow.

[0:03:47] SF: I guess, talk to me about what you're doing in Seqera. What was the goal of the company originally? Then, I guess, like how's it different than what other people are doing in the space?

[0:03:56] EF: So, Seqera’s goal is very much focused on providing modern software engineering for scientists. That really started with Nextflow, with what we did. We took very early on. We saw the talk of Solomon Hykes. This is back in 2014, where we saw that containers and docker, in particular, could be used for more workflow execution, and it was being used for with long-running web services. So, kind of taking the idea, we can wrap all those dependencies up and apply them for data pipelines.

We also saw that scientists, they wanted to kind of collaborate on the actual source code that they were working on. So, on the pipelines themselves, so integrated very early on with Git, and then also saw the adoption of cloud starting to come about, and we saw that folks wanted to take more traditional on-prem workloads, things that were running, things like Slurm, and start to use them for the cloud. So, the origins of Seqera and Nextflow come from that world. I think what we've seen over time is that like, we are applying that now not just to the workflow execution, but more broadly into the other areas of scientific data analysis. 

[0:05:02] SF: Was there, because essentially, a lot of work going on in biotech are maybe from people who are more trained scientists, trained biologists, but not necessarily trained like software engineers. They were essentially aware of some of the modernization that was happening in the world of software. Then, because of that, they reached certain limits in terms of what they can actually do as an organization because they can't reach the scale that they need, or they can't operationalize certain things?

[0:05:30] EF: And there’s a difference in skills as well. It's a different background that those folks have. I guess, I probably describe myself more, much more in the camp of the user, versus the software developer. Basic training and can do things like basic Python and Nextflow, but I am never going to go into that lower-level, deeper programming.

What we're doing is allowing scientists to typically use existing tools. So, maybe you're taking something from GitHub, you're taking a script in the lab from the postdoc, you're linking it in with some Bioconda package, and you have to pull all these things together and write them into a piece of software. So, there's kind of very different set of skills and backgrounds that folks have who use the software, versus the people that actually are developing the extra core of Nextflow. People like my co-founder Paulo, and that kind of separation of skills means that we can really give the scientists that modern software engineering stack, in a way that is kind of relevant for what they're trying to do, and there's problems they're trying to solve. But ultimately, so they can spend more time thinking on the science piece, so that they are domain experts, in particular pieces of software, in particular sequencing technology, into a particular analysis versus having to focus on spinning up virtual machines and managing dependencies and the like.

[0:06:49] SF: Okay, yes. That makes sense. Then, I guess, can you provide an overview of the architecture and design principles behind Nextflow and kind of what need is serving for the users of the product?

[0:07:02] EF: Yes. I probably can't go too deep into the architecture myself. I can give him more of a view on how Nextflow pipelines are developed, and how that kind of fits in, where we like to think of things is very modular. So, you are designing what we call like processes. Those individual processes, you can kind of think of as tasks. They all can typically contain an input, an output, and script section, essentially a piece, which is a code which is executed. Then, the way that Nextflow works is it follows this data flow paradigm, where each of those processes is linked together with what we call channels, is our first in, first out queues. And that linking of the channels really drives the execution.

So, you can create a process, several processes, then you’re really kind of fill those channels with files or different data structures, and they’re going to push the execution of the pipeline, top to bottom, in comparison with things like Make, sort of like the opposite approach, whereas Make, you're kind of pulling through, just kind of pushing through the execution in that way.

Then, if we think about the other design considerations, it's often around the separation of that workflow logic, from the configuration. And the thinking there is that if we make the workflow logic, be able to be developed by really anyone, it can be written on your laptop. You can split it up into the cloud. I can share it with you and you can run it on your cluster. That kind of separation of workflow logic from configuration has been one of the real keys. And obviously, a lot of that's driven from containerization of each of those tasks when it runs.

So, with Nextflow pipeline, is begin to start and it's got what we call like, it’s almost like a head job, or essentially, the Nextflow execution is running, the runtime is running. Nextflow itself delegates each of those individual tasks to what we call an executor. This could be something like AWS Batch or onto a Slurm cluster, where those tasks end up running. This distributed compute this really passed off to the executor, and Nextflow is acting like an orchestrator. So, hence the name workflow orchestration.

[0:09:01] SF: When you're talking about like a build pipeline in the space, can you give a little bit more detail of what that consists of? When I think of pipeline, I'm thinking of like a traditional data pipeline that's aggregating data from multiple sources, and then dropping it into like a warehouse or data lake. Is this something that's substantially different than that? Or is this like a similar idea?

[0:09:21] EF: For the most part, very different indeed. If we look at some of the more popular workflows which are developed in Nextflow, the main mean example often given is the RNA-seq pipeline, particularly one from nf-core. That consists of something between 50 and 60 different steps of that pipeline, and each one of those steps is typically running a command line tool. There is a command line tools which are developed by scientists and other software engineers, where we're running each of those steps. It may be starting off with taking the raw fast queue files, so that the raw files that have been coming off a sequencer or some modification of those files. Taking them and aligning them against an index. That will be a different step. And then from there, maybe you're running some quality control against those alignments. You may be taking those alignments and seeing if there is any particular differences between the quantities that you'll be seeing across those different samples.

All of that runs in parallel. So, that was maybe how I'm describing it for one sample. But it'll be happening for each of the samples that you have. This can result in upwards of tens of thousands of different tasks. Each of those tasks is running a different piece of software. Got different resource requirements, and essentially running in parallel as well. So, quite a different from the kind of more traditional ETL pipelines that you often see in the space.

[0:10:37] SF: Right. Then, in terms of orchestration, sort of traditional, I guess, like data engineering, there's lots of different orchestration tools that exists, is the problems in this space, like unique in some way that requires a new type of orchestration tool that you couldn't do, sort of using like an off the shelf solution?

[0:10:58] EF: The scientific pipelines in particular, there is some differences there. Often, when we think of big data, or people talk about big data, it's often come from big structured data, or data, which is coming from something like Spark. In this case, here, we're typically dealing with files, and those files are really unwieldy. If you think about a sequencing file, it's really just thousands and millions and billions of As, and Ts, and Gs, and Cs, in a compressed text file. That's really how it works. Dealing with those files at scale in the paralyzed way, in a distributed way, isn't always easy, and it's not always the current sort of traditional technologies lend themselves to that. So, one of the key differences is just that bulkiness, and that kind of size of dealing with gigabytes of compressed text.

The other one is the use of those tools. So, when you're thinking about a traditional data pipeline, you may be thinking that you're going to be writing the tools yourself, where you're going to be scripting this yourself. The exact tools that we use in the bioinformatics space, they're very different. They'll be coming from many different resources. They might not necessarily have the same kind of standard of software engineering around each one of them. Then, the resource requirements will be very different for those different resources. So, in one moment, you may be needing, for example, a very large memory machine, because you're trying to do an index of a genome. The next step, you might want massive paralyzation where you want heaps of CPUs.

There's also a big uptake of acceleration in the field. So, whether this is things like NVIDIA GPUs through things like a parabolic solution, which is acceleration of that. There's also a tool called [inaudible 0:12:30], which is an acceleration at CPU level, and even at Illumina’s Dragon which runs FPGAs. And what you're running there is essentially very specialized hardware for that problem. So, an individual pipeline may contain many of these different pieces, makes it very, very kind of challenging. I would say, the last piece is just the users who are running these pipelines or developing these pipelines. As I mentioned before, that kind of skill level, though, is not necessarily the same across the board. Although, there is of course, many, many well-experienced folks who are using this.

[0:13:03] SF: Who's the typical user of Nextflow?

[0:13:04] EF: We see a big spectrum that's very wide across that user base. There is, of course, people who are developing pipelines, as bioinformaticians, who are developing them for themselves. And they would be, say, a single user running that pipeline. Then, there is of course, folks who are developing those pipelines so that they can be used by other scientists, maybe scientists in the wet lab who are trained to run analysis over and over again, routine analysis, and we want to empower those folks so that the bioinformaticians can create those pipelines, share them out.

Then, at the other end of the spectrum, we're if you're thinking in a pure production use cases, like in clinical setting in the hospital, you may be writing this from a pure kind of software perspective, where it's purely API driven, completely automated, connected in with other systems. So, you do see a whole kind of spectrum across that, and that's really something that we're trying to solve with Seqera, is we focus with Nextflow, like many open source tools, command line tool useful for a single user. And as you spread that out, and you want to make those pipelines available to other folks, having more of a platform-like approach is obviously required to be able to just kind of scale up and to serve the needs of those people.

[0:14:15] SF: Then, how does Nextflow handle parallelism? You mentioned that there can be certain pipelines where you need to essentially scale massively from like a parallel standpoint. What are you doing there? Is it all based on using the elasticity of cloud resources to be able to address essentially, really specific scale needs for a particular pipeline?

[0:14:39] EF: Pretty much comes up to the way that the user is going to right the pipeline. But in most use cases, you are taking things and you may be splitting them, say on a file basis or per file bases. So, maybe you're taking two reads, two files which contain the different reads of your sequencing and you want to process those together. What's takes place as Nextflow really takes the files themselves, puts them into channels. We call these elements of a channel. Then, for every element of the channel, the pipeline execution will submit a task to that.

If you're in a situation like when you're in a cloud-managed services, things like AWS Batch, Google Batch, Azure batch, you typically are defining what's the maximum queue size that you want to have, or the number of virtual machines that you want to be there. And Nextflow will submit that, those jobs to those virtual machines or to the system itself, which will then kind of do the queuing itself. If you're on like a local machine, say you're developing this locally, you may have eight CPUs, and you really define per processor or per task, how much memory, how much CPU that you want to request on that, and it manages at that level. And then there's more shared systems, things like Slurm, Grid Engine, which have been around a while. In those systems, Nextflow is submitting those tasks to there, and it's really delegating on the sort of batch execution as it takes place.

Interestingly, when we look at this, and there starts to be some work in the Kubernetes field, although it hasn't been developed as much. We're looking at some really interesting technology now, which is kind of coming out about how we can connect into that, and we see very much that is the future for this kind of batch execution, even though it's still a little bit far away from most of our users who are, say, less involved in that field.

[0:16:17] SF: At Seqera, how do you make decisions around investments in technologies like Kubernetes? You mentioned earlier, like Git cloud, containers, and so forth? How are you becoming aware of those things as potential opportunities, and then making decisions about actually building a product that's going to leverage those technologies?

[0:16:36] EF: The main thing for us is trying to remain as close as possible to the users into the community. So, we run things like our Nextflow summit every year, where we're really bringing people together to share ideas on exactly how they're using the technology. What's the new thing? What technologies they're integrating with? And by really keeping close to that and almost like providing a space for those conversations to happen, we can keep exposed to it. In terms of the business investment side of things, we're very committed to growing Nextflow and Seqera via the principles of open science.

It's a little bit different from kind of pure open source where you think about putting your code available. Whilst that's one part of it, when we think of open science, it's more around scientists themselves, developing analyses, sharing things on Git. So, they were going from just publishing papers to being able to publish a whole study or a whole analysis, which can then be shared and replicated. That's often done by scientists with domain expertise. So, keeping close to them, seeing the technologies that they're using, and then going in that way. We're up to now just over 120 customers or so. They obviously, keep us very grounded in terms of what their requirements are. It does shift as we get to go into more enterprise deployments and the enterprise requirements are slightly different, that they provide us with a fantastic rich set of feedback on the features that we should be developing.

[0:18:01] SF: How has the Nextflow community evolved since its inception?

[0:18:05] EF: Related to the point I was mentioning, before around, different users and the different use cases, what we've seen is that like many open source communities, the community value very much comes from the content which is created. So, we started Nextflow 2013, ran our first community event in person in 2017, and it was really from there that some of the users who were proponents of Nextflow, spent some time and got together and realized that there was a missing layer. That missing layer in itself was the development of those pipelines. How can they come up with a way to essentially use Nextflow, essentially? Almost like a way which they could all agree on to define the pipelines themselves. And that's really what led to nf-core, which has been a huge success, and it's really been the growth.

You can kind of think of it more like a user group community than necessarily a developer community, developing the core of Nextflow. That second piece is starting to come along now, with things like having a plugin system, so that people can very easily develop their plugins. Examples of that are things like a SQL plugin that can create channels, specifically from that, as well as a range of other plugins that allow folks to connect in their own different technologies in that way. And we really see the core of Nextflow starting to be developed now by different folks.

Still, I would say, it's very Seqera oriented, and a lot of that comes just with just with the requirements that we have for developing the software in a certain way. We're obviously always looking for more contributions and to continue to grow the community as well as ecosystem. It goes beyond just individuals. We now have companies like Oxford Nanopore, so one of the large sequencing companies who are developing all of their workflows in Nextflow, putting them all on GitHub, making them all available. And we start to see that, that kind of almost industry ecosystem engagement is really being key to the success of a project.

[0:19:59] SF: Then, people who are developing these plugins, how do they go about actually making people aware that the plugins exists? Is there like a marketplace or a platform where I can go and look up the plugins? Or is it, maybe a little less organized than that at the moment?

[0:20:14] EF: So, on a plugin side of it, we do have repositories, and we elevate those into the main documentation. So, you can find those plugins in the next flow docs. In terms of the pipelines themselves. there is the communities like nf-core, which are very easy to kind of search and find. But we are looking at the moment to see how we can aggregate all of that information. So aggregate, the contributions that come from folks developing pipelines, or developing plugins to kind of a higher level. So, it's a little more organized, and that people can find each other, and also almost get that get recognition for those contributions that they're making to the community.

[0:20:51] SF: How does someone actually create a pipeline?

[0:20:54] EF: There's a bunch of ways. It can start with as simple as a text file, and you can start with a text file as you would any coding language. Looking into the documentation, there are some good Hello World examples. A pipeline really can be as simple as three or four lines where you're defining an input, could be a file, maybe you're reading that line by line, and generating some transformation on that. There's also more sophisticated ways of doing this, if you wish to create something a little more production-ready. There's some fantastic tooling, again, that comes from the community. There's an nf-core tool set, which is a Python toolset, which allows you to define, basically, using a template, create a pipeline, to find exactly what's relevant for you, creates the repository for you, creates the license file, for example, creates a schema, which allows you to create a UI around that, and just helps you kind of get started with that.

I think one of the things which has changed over the last couple of years, is we've introduced a module system into NF core. What that enables you to do is, has now about a thousand modules. Each of those modules is a command line tool, or a piece of software, which comes with a container, which comes with a definition of an input and output, and allows people to really just take those individual pieces and link them together without having to write everything from scratch. So, there's a whole bunch of different ways that folks can get involved and develop the pipelines.

I'd also suggest that if you are going to this, there’s a Slack organization with 8,000 or so people in that who are more than happy to help people get started, and you start those conversations. It's always great to like see if there's something that exists already in the community as opposed to duplicating it, or even better to like, help out something which is underway.

[0:22:39] SF: Then, in terms of like, you mentioned like the idea of like a Hello World pipeline, are you creating that in a standard programming language? Or is this done using like a domain-specific language as part of the Nextflow platform?

[0:22:51] EF: Yes. So, Nextflow itself is a domain-specific language, which is built on top of Groovy. I'd point out that most folks don't really need to learn any Groovy to use Nextflow. The DSL itself is supposed to be really self-contained, that it's obvious what you're doing. Whether this is defining inputs, writing these process blocks, doing modifications on those channels, really can come from the DSL itself. So, there's no real kind of need to learn another programming language to get involved in Nextflow.

The real reason for that, kind of logic for that is, we want folks to be able to focus on the science piece on really to be writing readable code that almost has like a functional way to it, to the way that it's written, so that you can very clearly see what's going in where, what the kind of core pieces are, and focusing on the kind of scripting section as well. So, you're not necessarily jumping in context across multiple files, et cetera. 

[0:23:48] SF: Then, then if I have an existing my product, or some piece of technology that I want to be able to be part of a Nextflow pipeline, do I need to do some level of integration so that Nextflow can actually call out to my piece of technology to be part of a pipeline? Or is it something that somehow Nextflow is able to take advantage of something, even if it's a command line tool that doesn't necessarily have like an API interface that's easily to be consumed?

[0:24:17] EF: The best way to think about this is typically to containerize, whatever is required. So, whatever you will be able to put in the container, whether that's a very thin client, which is calling out to your service, for example. Or whether you just want to call it from the command line, or you want to have a Python script, which is going to do that. Really, Nextflow allows you to mix and match those different languages, and there's different ways to interact with interfaces, and the fact that you just containerize whatever you want to run in terms of dependencies, really solves many of those issues for doing that, and it's what provides a ton of the flexibility that Nextflow provides.

[0:24:53] SF: Okay. I see. Then, in terms of Seqera platform versus Nextflow, is platform essentially the managed service version of Nextflow? Or does it encompass more than just Nextflow?

[0:25:05] EF: It started out very much based on the requirements of organizations that were running Nextflow in production, and they wanted to go from that single user or command line use case, to be able to have management of the workflows, monitoring of them, to be able to call them from API as part of the service. That's the origins of the platforms were started from that use case a few years ago.

Now, since there's a curve, we've realized that there is many adjacent problems to that. The first one was around, how can you set up that scalable cloud infrastructure in an optimized way? And given the experience that we have had with interacting with the community and working with some of the organizations to do that, we felt that there was almost like an opinionated way in which we could go about and create those environments. So, we have products for being able to develop that, but also it’s almost to run, to create that infrastructure for folks.

The other one is around collaboration. If you go from that single user experience, and you want to trigger a pipeline, maybe you're going to trigger it. These pipelines last from 10 minutes to sometimes 10 days, and over that time, you need to be able to monitor them, to be able to create alerts, to be able to get information based on them. Importantly, track all of the information that's going on. So, if you're running a pipeline, which needs to be certified, or needs to have some kind of validation on it, you want to have a full history and a record of that. Having a database tied to the execution, with all of the task information helps a lot.

That kind of workflow execution piece is what we've been developing for the last couple of years. So, going out from that, we realized that there were some scalability challenges that existed when running this stuff, particularly with regards to the file problem that I mentioned before. When you have very large files, and they're stored on object storage, when you need to go run that inside the virtual machine, there is this system need to simply transfer all of that across, to run the analysis, and then to transfer all of that back.

We've been working on a technology called Fusion, where it essentially mounts the object storage inside of the container in this way, and removes much of the challenges associated with using object storage for these kinds of workloads. Obviously, it has some scalability and cost savings, which exists as part of that. Likewise, on the container development, so realizing there's challenges for folks to maintain and develop all of these containers, particularly if you've got a pipeline with 70 different steps. So, we have a service called Wave, which is built for that. Then, kind of going out from the workflow execution, thinking more about the data management piece, as well as the interactive analysis.

So, you can think of things like Jupyter Notebooks or studio environments, how can we link the pipeline execution to that, as well as kind of close that circle on what we like to call the data analysis lifecycle, which links going back from development of the tools, development of the workflows, all the way through. And this is really where something like we have a new product coming out called Studios, which is linking in the execution with something like a VS code environment. So, you can run and develop your pipelines at the same place where your data is in production and kind of get that native experience.

[0:28:15] SF: Then, as a company, how do you balance the like open source projects that you have, versus, investment in resourcing the products that actually are essentially making money for the company?

[0:28:29] EF: I think, it comes from a long commitment to open science in itself. We spend still significant, still the majority of our engineering effort goes into that open source base, whether that's development of pipelines, supporting that community, development of Nextflow, and even some of the things like Wave, which are open source technologies. Thinking about the balance, there is obviously some pragmatism, particularly when I mentioned before about enterprise features. But it is a balance, which kind of goes into many of the conversations that we have around resource planning, and the like. I think the balance is about being pragmatic there. You want to obviously grow the user base. You want to support them. But realistically, like many open-source projects, we will only ever monetize a small percentage of the user base. In our case, that's an acceptable trade-off to make, particularly when we think about all the fantastic things that people are using the software for, whether this is contributing pipelines, or whether this is reading documentation, and providing feedback on that. There's just such a wide range of positive activities that you get from open source engagement, that are a little bit tricky to put maybe a number on in terms of how does it affect the top line revenue, but something that we think is worthwhile investment.

[0:29:43] SF: I mean, I think one of the big values of open source is that you're increasing the number of, basically, the eyeballs that are actually using your system. So, you increase your feedback cycles. If you're only, I think you said, you had 120 customers. If those was your entire base, then you might not be getting as much feedback as you would from maybe thousands of individuals that are contributing or interacting with the product, but not necessarily paying for it, but they're paying for it in terms of their time investment, and giving back in terms of feedback that helps improve the product that you then can monetize on.

[0:30:14] EF: Yes. So much of that is around the eyeballs on that, whether this is trained beta versions out or early features, providing the feedback on that. Even just finding bugs, quite frankly, is always great. The Nextflow user base has around 20,000 people who look at documentation every month. Each one of them is constantly running the software and providing feedback on that. It does provide you so much with regards to that.

Also, in some sense, provides the distribution as well, by having a distribution, it's allowed us to think about other investments into other products or other open source projects, which are adjacent to that. So, things like wave, container service, which is obviously, quite tightly linked to the workflow execution, but separate in itself, and that open source engagement provides an audience for new expansion.

[0:31:05] SF: Who's the actual buyer of the product? When we look at our customer base across it, it's very much focused in the life sciences industry. Although, I should point out, there's really nothing in Nextflow, or even Seqera which is like intentionally tied into life sciences or genomics, or bioinformatics analysis itself. I think there's just a real need for a product like this, which is driving a lot of that adoption there. The buyers themselves are typically into bioinformatics teams, which are running services for other teams, or users themselves who are running drug development programs. We have a lot of interest which is coming from folks in clinical analysis.

So, this is like running routine analysis. And personalized medicine is a big one as well, because personalized medicine, you got examples where, for example, you're taking a tumor sample, normal sample, you're looking at the differences between those two, and you can predict new antigens on the surface of the tumor cells, and all of that requires a lot of sequencing effort to do. This kind of idea that we are sequencing patients all the time, really, is part of that routine analysis, obviously, which results in a lot of sequencing analysis has to take place, a lot of computation, and a lot of custom workflows. 

The drug approval process is becoming more and more data-driven, and much of that data is required in sequencing information. Imaging as well, is obviously, it's a big space. Of course, all of the work that's coming about with AI is just driving a lot of the adoption of these pipelines.

[0:32:32] SF: What do you see as some of the current challenges in the field of bioinformatics? And what is Seqera doing, essentially, to try to address those things as you continue to develop the product?

[0:32:45] EF: For sure. Ever-increasing data volumes and this is really driven from the decreasing cost of sequencing, which is just putting even more pressure on folks to analyze the data. Obviously, it creates challenges around hardware as well, and how we can do that. The resource requirements are becoming more varied. As I mentioned before, this work on acceleration, which is becoming prevalent in the fields, this is making it more difficult to really acquire those resources, but also to manage them in an effective way. The cost management, so being able to assign cost structures, particularly as you go to cloud, and you've got a whole bunch of different users with different experience requirements. It's not as simple as running like a kind of a software shop, necessarily, when you have these different users interacting with it.

I think, the final one is just around personnel who can really straddle this kind of boundary between the technology and the biology. People who are able to either learn very fast, or who can converse in both worlds, I think, are a major asset, and something that we talk to our users and customers, they're always looking for, for more folks who can do that.

[0:33:50] SF: Yes. Is finding the right talent for a company like this? Do you think it is more challenging than finding talent for maybe like a traditional tech company? Because you do need people that kind of, even if they don't necessarily have the biology training, they're smart and curious and can learn some portion of that? Or on the flip side, they had the biology training, but they don't have necessarily the computer science training. But they can learn some of the tech skills that they need to be successful?

[0:34:16] EF: We’re probably not the best example here. We have a little bit of an unfair advantage, and the folks that we were able to hire, having a ton of experience in the community. It gives us access to a pool, I would say, of really talented people who we can hire from. There is of course, like always a need in tech to be able to hire very experienced, for example, infrastructure engineers, or security engineers. That’s a kind of more common challenge, a challenge which exists and in many other companies.

I think, where we see it more is where folks that are maybe going into more computational biology, and they need to find the skills to work with scientists in those spaces and do that. It's something which I think that we're betting on that the workforce and that folks are becoming more skilled, and that this idea of kind of knowledge economy where everyone does have the ability, and does have the skills to work with software, and to develop software in this way. Something that we're kind of bidding on when we think about enabling scientists with modern software engineering. We think that this trend will continue.

[0:35:20] SF: What do you see as the future of workflow management systems orchestration, and sort of by informatics tools? What's the evolution that's happening? Is there, I guess, everything that's going on in Gen AI. Is that something that's going to start to play a larger role in some of these products?

[0:35:36] EF: Yes. So, we see a big demand for example, GPU integration. We already have that in some aspects, but really increasing that and improving the support there. For example, being able to get the metrics off the GPU is exactly like we can with the CPUs currently. The increased automation, as well, is a big piece, where these pipelines and workflows become much more part of standard production. So, how can they integrate with other systems, so they can very easily just connect, and it's kind of like, there's an orchestration layer, which is probably a little more similar to the traditional ETL orchestration layer, which is almost like sits above these pipelines and says, based off data coming in the sequencer, I'm going to run this pipeline automatically, and it's going to generate this result over here. There’s a kind of a top-level piece, which we see coming in.

A big increase in just the usability. So, the amount of people who are having to interact with these pipelines, be able to write them in a more user-friendly way, I would say. I think that there's work which we're doing around the syntax of Nextflow, about integrating AI in terms of helping you code better, being able to have code generators, which help a lot there and make the software more accessible.

I mean, just an ever-changing environment where the science is kind of pushing the technology, and in terms of those technologies, also changing with what's been offered by cloud providers, by technology providers, and kind of constantly living in the middle between these two things, means that there is, that’s going to be ever-changing. And then the next 10 years of Nextflow looks even more exciting than the previous 10.

[0:37:08] SF: Do you think that there's been a change in expectation when it comes to user experience of tools like this in the world of science, from where it used to be? Even if you look at sort of enterprise versus consumer software. There was a time when people on the enterprise side where you just kind of accepted, like, the UX is not going to be that great. But then people got better at building consumer products. The user experience expectations were higher, and then they ended up reflecting at the enterprise as well. Is a similar trend happening in the space as well?

[0:37:38] EF: I would add to that more to say that the different users who are joining there, and maybe have those different expectations themselves from the beginning. There is an obviously adoption curve aspect of this where you've got early adopters who are willing to put up with some rougher edges. And as you scale, and as you approach it, it becomes more popular, and the user base increases, there is those kinds of expectations around that.

We certainly see that we've built the company for a very small base and a very small amount of time. And as we think about the investment and things like UI, it's something we're putting a lot of effort into, particularly around, but not necessarily the programmatic user, but the user who's going through an interface like that. Expectations are definitely a lot higher than when we began. But I think, it's great. It pushes us to deliver a product which ultimately can serve more people, and hopefully that allows scientists to really focus on their work.

[0:38:31] SF: Fantastic. Evan, as we start to wrap up, is there anything else you'd like to share?

[0:38:36] EF: Yes, just like to quick shout-out to anyone from the community who's listening to or anyone who's thinking about joining the community. We think about their work that's gone on across Nextflow, across open source. We're really excited about what's going to happen next, particularly as we think about containers, interactive environments. So, if you're interested, please shout out to the folks on the nf-core, the Nextflow Slack channels. We've created a community forum as well. So, head over to that and we'd love to help you out. Thanks a lot for the time, Sean.

[0:39:04] SF: Yes, awesome. And we'll have to include some of the links to those in the show notes. But, Evan, thanks so much for being here. This was a really enjoyable conversation, and I think that those listening are going to enjoy it as well.

[0:39:14] EF: Yes. Thanks so much.

[0:39:15] SF: Cheers.

[END]