EPISODE 1682 [INTRO] [0:00:00] ANNOUNCER: Anaconda is a popular platform for data science, machine learning, and AI. It provides trusted repositories of Python and R packages and has over 35 million users worldwide. Rob Futrick is the CTO at Anaconda, and he joins the show to talk about the platform, the concept of an OS for AI and more. This episode is hosted by Lee Atchison. Lee Atchison is a software architect, author, and thought leader on cloud computing and application modernization. His bestselling book, Architecting for Scale, is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee is the host of his podcast, Modern Digital Business, produced for people looking to build and grow their digital business. Listen at mdb.fm. Follow Lee at softwarearchitectureinsights.com and see all his content at leeatchison.com. [EPISODE] [0:01:10] LA: Rob, welcome to Software Engineering Daily. [0:01:12] RF: Thank you, Lee. It is very exciting to be here. I actually have listened to this podcast in the past, and so it was a little bit of extra excitement in addition to looking forward to a great conversation. [0:01:21] LA: So, why don't you start out by just telling me what is Anaconda? I'm sure most people on our podcast have at least heard of it. But I'm sure there's several, there's probably many people who haven't yet. Let's start out with setting the ground straight and telling everyone exactly what Anaconda is. [0:01:40] RF: Oh, man. There's quite a few different answers to that question. I guess, at its core, Anaconda is a company that is really focused on helping people innovate and doing that by giving them a way of connecting to the broader open-source ecosystem, specifically around Python, but not just Python. Anaconda was originally named Continuum Analytics, I think, when it was founded. They did a lot of work in the data science space, recognized the need to really empower Python programmers specifically in data science and other areas. In order to do that, basically, they were solving their own problems. These were Python developers that were actually trying to help enterprises run their numerical computing workloads, run other kinds of workloads, and they realized, especially in like Windows environments and other environments, that there was a need to standardize how people got access to all the broader open source ecosystem, that Python packages that millions people around the world were producing. So, they created this distribution of Python packages and Python libraries. Not all of them written in Python by the way. That's kind of the reason why a lot of this stuff's written in Fortran, C++, and other languages and getting the right bits onto people's computers, helping them do that simply, manageably, et cetera. They came up with the Anaconda distribution, which is, where actually the name came from. You want a big distribution of Python, oh, big Python, Anaconda. Then suddenly, everybody knows Continuum Analytics through the Anaconda distribution, so that's how the name got changed to Anaconda. So, at its heart, we produce trusted repositories of Python and R packages. Actually, we provide products and services around security and governance of those packages. And again, that connection to the broader open source ecosystem. We actually have kind of enterprise products as you'd imagine, around AI, data science, et cetera. That's our workbench, product, and others. And we also do a tremendous amount around the open-source ecosystem. So, we either actively participate in, or fund, or develop many different open source projects, everything from Numba, BeeWare, Panel, toon and on and on, and we donate quite a bit of money to the various open source projects as well. But again, the real goal of the company is to connect scientists, data scientists, engineers, programmers, knowledge workers, actually. We can talk about that later. But Excel from Microsoft last year announced that you are now going to actually use the Python programming language inside of Excel. [0:03:51] LA: That's really exciting to hear. I remember hearing that. [0:03:55] RF: When you have the broader knowledge workers, that kind of advanced knowledge workers that want to have access, again, not just to the Python programming language, but really to that broader ecosystem, and all the innovations of those packages that community provides and anaconda wants to connect people to that. [0:04:09] LA: I that's one of the problems that lots of languages have. I'm more familiar with the problem of how packages work in Ruby and Rails, especially when there are third-party packages that are not all in Ruby, and how it has to build in the background, and the installation of some of those packages can be quite problemsome. Python, it's a lot easier in general. But a lot of that is because of the Anaconda packaging system. Is that correct? Is that the best way to think of it? [0:04:37] RF: Yes, I think so. Yes. Actually, there are open source efforts that Anaconda created, Conda, specifically that that's where the majority of the user base uses actually. Actually, I think, it's funny, our last measurement was actually 45 million users, which I think a lot of that comes from the explosion in AI, and Python kind of being the lingua franca of AI and using Conda to set up your Python environment and not just get the right bits on your computer in terms of, is this NumPy? Is this [name inaudible 0:05:03] or PyTorch. But getting all the dependencies, all the things those things rely on and doing that solving in order to make sure that those environments run. Yes, that's exactly the problem you're talking about. [0:05:13] LA: So, quickly deploy small-scale apps to share. I think I got that phrasing from your website and that sounds a little bit like what you're describing, but it's probably, it is a lot more than that. But I know one of the major use cases for your platform and you just mentioned that briefly here now is the growing field of artificial intelligence and AI tooling and data science in particular. You've kind of taken on a step further, and you're actually pushing a concept that you call OS for AI. Can you talk a little bit about that? You meaning Anaconda, I'm sorry. [0:05:47] RF: Yes. I guess it's more of a conceptual or philosophical point than like a literal operating system. I mean, it does confuse some people. But the idea here is, you have this - an operating system kind of takes all the hardware bits, especially, you go back to the seventies when people, well, I guess, people assemble their own computers up until this day. But you had to back then, and you needed the operating system in order to take all of that complexity and actually make this platform that then people can then use. People could develop applications for, people could actually use the hardware and kind of enable that interchange. Conceptually speaking, that's what we feel we do. By providing that middle layer there that connects people to that broader open source ecosystem, makes it simple and easy to develop your applications, to deploy their applications and share them applications, I mean, everything from traditional data science to a modern AI integrated models and applications. [0:06:40] LA: That's not specific to AI, though. But are there specific things that you do towards the AI use case that make it better for AI? [0:06:48] RF: Yes, absolutely. So, it's all an extension of what we did. If you look at Anaconda and say, "Hey, I get my Python packages from you. I get my packages and you provide ways of doing this." It's the individual user, that's what they care about. I want to get my environment set up. I want to get access to what I need so I can train a model, I can deploy a model. I can leverage that model inside of my application. Then you have the organizations that employ these people that want to make sure that they have some control over the Wild West, that other users are entering with. I mean, there's hundreds of thousands, if not more different packages that can be used. It's not a very secure world. So, traditionally, that was the data side space. Now, if you just think, okay, well, if Anaconda is really good at getting bits onto your computer, the right bits for you simply securely, audibly, et cetera. But we changed the definition those bits. It's no longer a Python in our package. Now, maybe it's datasets that you want to use in order to train your models or fine-tune your models, or it's the models themselves, that you need to get installed and get up and running very easily, again, for your specific environment. Are you running some kind of a Linux server? Are you running a Windows desktop? What kind of GPU do you have? Do you even have a GPU? Is it a phone? Mobile device edge computing? So, if you kind of broaden what Anaconda does to say, it's not just Python on our packages, it's getting the right bits. Those bits happen to be models. They happen to be datasets, et cetera, then you can see why the kind of AI explosion is a natural extension of what Anaconda does. [0:08:09] LA: So, you're also the data delivery for use by AI? [0:08:14] RF: I would say some of the stuff is today. Some of this stuff will be. [0:08:17] LA: Okay. Fair enough. [0:08:20] RF: Yes. So, a lot of the rapidly changing improvements in the industry right now, which I'll be honest, I think everybody can relate to what I'm about to say, which is, I can't even keep on top of it at all. It's that fast. We have the millions of users. We have people who understand how we already helped them get their environments up and develop their applications with each other and whatnot. So, that's a very natural extension for them to come to us and say, "Well, help us with these problems, too." And for us to do them. So yes, some of this stuff we do today, some of this stuff we'll do in the near future, and some of the stuff is going to be coming. [0:08:47] LA: So, some of those datasets, though, are giant. Are you still packaging those for delivery? Or are you also looking at tooling to get access to datasets and other locations. There's a whole world of possibilities there and a whole world of - involved in. [0:09:03] RF: Yes, there is. It's not just access to the data itself. It's everything around the actual provenance of the data too. So, you have the ongoing copyright discussions and kind of, do I have access to the right data? Do I even know that I'm using the kind of data to train, that I have the license to train on, or that I have to write to train on? So, helping people solve those problems. It's not just us. It's a very common refrain that people are holding off on some of these technologies, because they don't understand the exposure. They don't understand what they can do and how they can do it. Then you'll even hear things about like what data should they even be training on. What data should they be using. Actually, I went and attended a very interesting presentation on the FAI model from Microsoft at GTC. They were making the point that sometimes training these models is a bit of garbage in, garbage out. You want to use high-quality data. So, it's one thing to train your data in all your models, I'm sorry, on all the data on the Internet. It's another thing to say we're going to actually take a subset of that data and get better output because we don't need to have every possible piece of fanfiction that was ever written and incorporated into our model when we really wanted to be an expert in science or physics or something else. It's not just about the size of the datasets. It's about provenance. It's about security. It's about auditability, reproducibility, and then collaboration and sharing. [0:10:12] LA: Cool, cool. So, two different topics. Just taking from words that they you said, two words that you said, and I'd like to talk about them independently. We could probably have a whole conversation on both of them. But one of them is secure, and the other one was copyright. So, let's take whichever one you want to cover first, but I'd like to talk about both of those areas of what you do. [0:10:35] RF: I mean, flip a coin. They're both tremendously interesting. [0:10:36] LA: Then, let's start with security. You mentioned securely deliver to the desktop, the package. Whether the package is data, code, whatever. So, that implies a lot of things, and perhaps a lot more than what you actually do. I'm not exactly sure. It can be a wide spectrum of different things that securely delivered to your desktop can mean. Why don't you tell me what you mean by that? [0:11:02] RF: Yes, again, it could absolutely be a lot of things. But I guess at its heart, they talk about a lot about software bill of materials. You want to think of it as like chain of custody, like you watch Law and Order, any of that kind of police procedurals. You want to make sure that evidence, you understand what happened to it every step of the way, so that there's nothing that was corrupted. There was nothing, no kind of broken trust or misuse. You can think of packages and whatnot, the same thing. So, how do I know that this open source code that I'm pulling into my own computer and I'm trying to use or I'm trying to share with others, doesn't have anything nefarious in it? It wasn't the wrong code. It wasn't something that somebody, some malicious actor injected something bad into it. There have been plenty of very high-profile security incidents over the years, where maybe a company like, I don't name any names, but you can look them up. But lots of big-name companies had a supplier that provided them software, provided them something, and the attack actually came through the supplier to that major company. I didn't talk much about my background, but I spent 17 years in high-performance computing space. So, I've started several companies over the years. The last one that I founded was acquired by Microsoft in 2017. I joined Azure and I led the HPC and AI software infrastructure team, from the product side actually, supposed to be the dev side. Then, that's where I left and then joined Anaconda. While at Microsoft, it was a very, very, very major concern that anything we're pulling in from anywhere else that we understood exactly what we were doing, because Microsoft was such a huge target for various obvious reasons, and our clients were such huge targets. The last thing we wanted is for people to be able to kind of backdoor into our clients through us, through our suppliers, and so on. All those same concepts apply to getting bits on your computer. So, when it's Python packages, whether it's a model that comes from somewhere, whether it's a data set that's used them, you want that same chain of custody knowledge, you want that same provenance, you want that same control. That could be for security reasons, like I said, in terms of malicious actors. But it could also even just be for reproducibility later. If you're going to use this stuff in pharmaceutical industries, or financial industries, or other industries that have regulatory needs, or other kinds of - especially life or death situations. If your model makes certain decisions, if your system causes certain things, and something goes wrong, they're going to understand why you better be able to reproduce your results. So, even just being able to track your data, track your models, track your packages, and control them over time is very important. [0:13:14] LA: So, kind of putting a fence on what it is that you do and what it is that you don't do. What you do is you do things like security validation. Make sure the chain of custody of the bits is correct and validated, so that people can have a surety that the code coming in, is where you intended it to be. But what you're not doing is things like packaging the code into separate containers and running them in virtual environments and that sort of thing. You're not managing that process. [0:13:43] RF: Not necessarily. Yes, that tends to be more what our users want to do, how they want to deploy this stuff. You can think of it as, we do things like CVE curation in some of our paid-for products. So, we can give people the ability to say, "Look, these packages or these other bits, they have known vulnerabilities, but we're okay with those vulnerabilities." Those are things we can mitigate, or they don't apply to us, or we understand them. So, go ahead and use those. But these other CVEs, these other problems, do not use those. This gets back to giving organizations the ability to just have a bit of control over the Wild West. So, they want to empower their users, they want to power their scientists and engineers, but they have to have some level of control over that. Yes, it's everything from building the source. It's building it ourselves so that we know that what you're getting is what we have - we've actually produced the artifacts from the source code itself, to CVE curation, to making sure that our stuff integrates with other security providers tools. As a general philosophy, I very much like to force myself to not think competitively and not think in a zero-sum way. So, anytime there's a sense of like, "Oh, well, this other person has this thing that we'd like to use too, it's great." How can we work together to give you a better experience, a better solution? When it comes to things like security and other capabilities, I don't view what Anaconda does as having to - I don't view people who partner with Anaconda who are used alongside Anaconda as threats. Instead, it's kind of growing the pie for everyone. So, we do look and say, "We don't need to do every possible thing for security. We just need to make sure that the pieces that we handle, we handle very, very well. And then we integrate that other tooling so that people can get the solution that they need." [0:15:09] LA: Great. Well, let's go to that second word, then. I'm beginning to think maybe they're more closely related than I initially thought by copyright. You imply that you do things to help with copyright validation. I imagine there you're talking about the same sorts of things as far as tracking where source of information and things like that, and making sure that it comes from the original source, and is credited back to the original source. But can you elaborate a little bit more what you meant by copyright? [0:15:37] RF: Yes, so we don't - today, if you go get on that kind of stuff today, that's not something we're doing. It's more something that we're very, very interested and leaning into, and I feel comfortable saying that we're not going to solve that problem. But we want to be part of the solution to that problem. It's really the same thing. If we want to help people innovate, if we want to connect them to open source, if we want to help them do it, simply and secure myself, then we have to help that problem as well. We actually recently -Peter Wang, who was the original co-founder of Anaconda and the CTO for a long time, again, big shoes from - now, he was CEO. He recently actually stepped over into the role of Chief AI Officer. The copyright and kind of data provenance and making sure that we're all doing the right thing with how we use intellectual property and data in terms of training these models and whatnot, that is kind of a passionate subject for him. It was actually one of the reasons I even came to Anaconda was listening to him talk about those with such intensity and kind of deep thoughtfulness. So, he is also leading some kind of AI-specific technology and other developments within Anaconda. A lot of this is just around, being aware of that problem space, and leaning in everything from us joining the AI, Anaconda joining the kind of recently formed AI consortium to even collaborate with people in the legal industry and others, to again, be aware of this problem and help it evolve in a direction that we think is going to be best for innovation. I don't think it's right to look at this and say, "Lock it down, because we don't know how to handle it." At the same point, we have to do this responsibly, so that we can keep this kind of comments around for everybody to benefit from. [0:17:05] LA: So, it is safe to say that you are, as a company, skewing towards AI, versus the general package delivery process. Is that a fair statement? [0:17:17] RF: Yes. We're not skewing in a sense that we are moving away. So, our existing business, our existing community, our existing efforts, and user base, that is not being abandoned in any way, shape, or form. A lot of this is, as I mentioned, kind of a natural extension of what our users have been asking us to help them solve, or in many cases, what we've been helping them solve. So, I know, you know this, and a lot of listeners are going to know this. But AI reminds me of those bands that had been toiling for 10 or 15 years, and then they get hit out and people were like, "Whoa, where does this overnight success come from? It's like, "This has been going on for a long time." [0:17:47] LA: A long time, yes. [0:17:49] RF: We've been doing machine learning and artificial intelligence, almost from the beginning of the company. It's just that the new kind of impact that LLMs have had, and some of the advances they have enabled in conversational programming and chat interfaces and that kind of revolution, I think it's broadened what people can apply it to. It's broadened the number of scenarios people can actually - the problems people can solve and stuff. So, it feels a bit like an overnight shift. It is in the sense of those breakthroughs. But it's not like the field of AI hasn't existed for decades. In 1997, I worked as a summer intern for a company called BioComp Systems in Seattle, that used neural nets to help people do industrial optimization. People used it for like financial stuff, which the CEO was really unhappy with. I remember leaving that job and thinking, "Oh, neural nets are so cool. I'll never see those again." Obviously, very wrong about that. Anyway, so this stuff's been around for a while, and Anaconda has been working on it and helping people with these problems for a very long time. So, it's really just, I think, the explosion of interest, the kind of addressable market, if you will, the number of problems makes it seem like we're, I guess, moving over there. I think instead, it's really more adding these capabilities and bringing them to our community, and then adding all that energy and kind of growing it for everyone. [0:18:59] LA: I got a story to go along with your early AI story. In '89, I was working for Hewlett Packard, and my boss at the time, along with his boss, and a couple other people from the group I was working at, had moved not long ago from working on a Lisp machine, which was part of the HP AI strategy. It was all Lisp. Lisp was the AI language of choice. That was back in the eighties. It wasn't long where you didn't hear about that anymore. It's like, "Well, AI obviously isn't going to happen. It's never going to come. This is unrelated to anything that has any value anytime in the future." Then bam. It hasn't been a bam. It's just been worked on little by little in the background. Now, suddenly, it's come into its stride, and it's really has made some great progress in recent years. [0:19:48] RF: A good friend of mine, a gentleman named Ian Finder, he actually has a Lisp machine. [0:19:53] LA: Oh, really? [0:19:54] RF: Yes. Ian collects computers and he has a mind-blowing collection of technology that he has. I've worked with him for years, previously, and a good friend. But yes, the Lisp world is definitely world I'm familiar with. One thing I've tried to keep in mind with anything is things are never quite as good as they seem, but they're never as bad as they seem either. With hype cycles, it's the same thing. So, AI is going to take over the world. Sure. How do I say it? Super worried about the impending doom or the impending transformation? No. I don't think it's going to be as good or as bad. But I think you're right. People are going to get caught up in those cycles like they did in the seventies, and the eighties, and the nineties. The disappointment that it isn't as good as they thought it was going to be, leads to that disillusionment. When in reality, it's just a long, steady march towards progress. [0:20:35] LA: Exactly. Yes. AI is going to be a major part of that, just like other technology has been, but no more, no less than that, eventually. We just don't know exactly where yet. So, I was speaking to a group of interns the other day and their number one question was, "Will we have jobs anymore?" Software interns. Well, a couple of things to keep in mind, one, AI is going to change jobs, but it's not going to eliminate jobs. It's going to change jobs. Just like any other piece of technology, it's going to change jobs. But the other thing is, if any job is going to be even more important than it was before, in the world of AI, it's software developers. So, don't worry. That actually helped a lot, I think. But I think it's amazing how many people are actually worried about AI at this point. That really, for valid reasons - well, they're valid reasons, of course. But they're not reasons that are going to come to fruition. We just don't know yet. It's just too early to tell for sure, what's going to happen. But history has shown us what will likely happen. I think it's that uncertainty. That is what drives the worry. If you have a high-trust environment. If you can rely on your community, your friends, your co-workers, whatever group you're kind of thinking about and you have that support, and you have that trust, I think you can face that uncertainty and that adversity together and it removes some of the worry. I think when you have a low-trust environment, then you feel much more responsible for yourself to solve those problems, and I think it leads to that anxiety of what if I do lose my job? How am I going to pay my mortgage and feed my family? But I do try to maintain a positive attitude that technology, industrial, societal progress, has made the world better for everyone. I would not trade my life today to go back and be a Roman Emperor. I would absolutely not. As a result, again, I tried to keep in mind that, yes, there's going to be change, but we're going to navigate it together. My own purpose, my own kind of goal in life is to make the world a better place, and hopefully capture a bit of that betterment for me, too. I am a capitalist. But I do want to make the world a better place and that means helping not to be too grandiose, but helping the world through this change. Trying to make it so that as we go through this AI revolution, it is better for everyone. But to your point, yes, I absolutely think there's going to be jobs in the future, and I think one thing that I made the mistake early on. I took a lot of pride in being like a hardcore C++ developer, and being like really low-level knowledge and all this kind of hard stuff. I actually shied away from Python for a long time, because to me, it was like, if the tool wasn't hard, was it a real tool? I had kind of lost the purpose, which is to solve a problem. Why are you doing what you're doing? All of a sudden was like, "Hey, we're actually trying to solve these others problems. I just want the best tool for the job. Hey, if AI is going to remove grunt work, if it's going to be this network of experts that I can have helping me solve problems and answer questions and broaden my creativity and learn things, like how is that not better for me?" So, yes, I think as long as you view what you're doing is helping people solve problems, address challenges, do things like that, and you don't get caught up into specific skill set of the tool or your knowledge, kind of esoteric knowledge of something is being the reason why you have value, then you'll be able to adapt, you'll be able to learn. Whatever the latest tool is, learn it, help people keep solving problems. [0:23:36] LA: Yes. Sounds like we have a similar background. I spend most of my early career in C++ as well and I was the expert in C++. Matter of fact, I was actually - someone wrote an article about some of the work we were doing at HP in C++ journal, because we were one of the first ones to use C++ code in a Unix kernel. It's like, that was revolutionary back then. It's not that hard, because it's just, this is how this works. Of course, it's all pretty common nowadays. But it was kind of interesting. I always imagined, I would never move away from C++. Now, the thing that moved me away from C++ was a project that ended up pulling me into learning Ruby. I always thought Ruby was such a fake language until I really got to know it. Then, now, of all the languages I've used, Ruby is by far my favorite, just because it's so easy to do what you want to do. Now, it's not the best for a lot of environments, a lot of projects that I work on nowadays. But at the time, it was really interesting. But it's the language. It's like your Python that I brought me away from C++. [0:24:40] RF: Yes, I totally agree. I remember, I took such pride. I did a ton of that template metaprogramming back in the early aughts, and I was so proud. In fairness, if I not so humbly say, I actually did come up with a great design to a particularly challenging problem that I was working on at the time. I read all [Andreescu inaudible 0:24:55] books and all that stuff, and there was so much pride in the difficulty of it. It really was a revelation to say that it's just the right tool for that job. But there are plenty of other tools out there. Yes, I was introduced to Ruby too and it was funny. Actually, early on, we had Python vs Lisp arguments. But then we use Chef back when it was still called Opscode. But the Chef automation language, and that was my introduction to Ruby. It was the same thing of like, "Oh, again, right tool for the right job. Really kind of broaden my perspective." [0:25:22] LA: Yes. Makes perfect sense. So, when I was doing research for Anaconda, I found a phrase that I found very enlightening that - I shouldn't say enlightenment, very interesting, that I don't think really applies now that we've had this conversation. But I think a better phrase might apply. Setting that context, let me tell you what the phrase was. It was Anaconda is low code AI development. Now, I don't see that in what you're saying here. But maybe a better phraseology for that same comment might be faster on ramp to AI. Get started with a quicker, easier - is that a better example of what you really are doing here? [0:26:04] RF: I think those are two aspects. I don't quite think they're the same thing, although they are related. The low code part is so we acquired a company last year called EduBlocks, and it's like a visual programming. That was originally around helping people understand and learn Python. It's aimed at students and whatnot. EduBlock, it's kind of in the names. But the real focus there, we're taking the abstraction out of it, that is low code, no code kind of composition of capabilities to produce results. If you look at things like, I want to have models, and I want to incorporate them in my application, that's just more LEGO blocks that you're kind of using to assemble and build your final structure. But yes, faster on-ramp to AI. That is both we're trying to do today, and I guess, a guiding principle of the company. That is everything from making it simple. I don't repeat myself, but making some people incorporate, and you're going to see some stuff coming from us later this year, I think, that'll embody those things much more clearly. [0:26:51] LA: Great. I'm very much looking forward to seeing that now. That's great. Talk about standardization. How does standardization fit into your strategy? [0:27:01] RF: In what way? What do you mean? [0:27:03] LA: So, is AI closed or open? [0:27:09] RF: I'm obviously biased in this perspective. But you know what, I'll put it this way. I think open-source ecosystems have clearly demonstrated over the last couple of decades that they really drive innovation. Look at the change in Microsoft. Look at the Microsoft under Gates and then under Ballmer, definitely under Ballmer, open source was absolutely a feared term there, and all that kind of fun at the time. Now, one of the largest, if not the largest contributor to open source projects in the world. It's really, it's not just in words. It's actually throughout that company's DNA now, and I think, being able to say, "Look, there are billions of people in this world, and we want to empower all those people. Talent is evenly distributed. Opportunity is not." So, let's make opportunity be equally distributed and let's give people the ability to contribute all that innovation. I don't see how any closed ecosystem any one company could possibly hope to compete with that. Obviously, I work at a company. I want companies to be [inaudible 0:28:02] again. I'm a capitalist. But I think the world is better place when you have that open innovation and there is absolutely a role to play there. Anaconda does look at this. We say, "Okay, what are the kinds of problems the community doesn't want to solve? What are the kinds of problems that communities aren't good at solving? What are the kinds of problems that we can collaborate with those communities? And then what are the kinds of problems those communities run into in organizing, and collaborating, and maturing, and operating over time? And how can we help address those problems?" When it comes to that, I guess, I'm a true believer that open source and open collaboration, that's the main driver. Standardization, I think, really helps when it comes to helping to enable that. If you want to get together with a friend and go to dinner, that's trivial. Now, it's maybe a birthday party and you got 10, 20 people. Takes a little bit more effort. You might have to schedule a time. I have to make a reservation. You have to put [inaudible 0:28:51]. Now, you're doing a wedding, and you got 100, 200 people. That takes real planning. Then, now, you have like a major concert. You have your Taylor Swift, and you want to come to a city and take over. It's just orders of magnitude more collaboration. I think, when you want to have innovation happen at scale, and when you want to enable people to build and to, again, put those LEGO blocks together, you have to have some kind of definition of a LEGO block, right? If you had 50 different kinds of sizes and connectors and whatnot, no one's building whatever the latest hot LEGO model. So, I think there's a role for standardization, which is really all people just coming together and saying, "Look, let's define the" - you're a programmer. So, you get it. You have an API. Let's define the interfaces and let's define those interfaces so that we are all free to innovate within those interfaces. But now we have those ways of collaborating together. So, that's where I believe standardization can play a role. Where I don't like standardization, I think, it goes so far as to say I think you and definitely listeners would agree to this, is when it's used for any kind of capture. When it's used for any kind of - in the old days, companies were trying into standards bodies and make it so that their technology was only compliant one or things like that. But again, getting back to supporting open source and supporting the comments, I think there are ways of having those standards that foster innovation and foster collaboration without locking people out. [0:29:59] LA: So, you can imagine open interfaces for AI, that general comment can apply to a couple of different layers within the AI ecosystem. Certainly, companies like OpenAI are trying to create open interfaces to give you access to AI. But there's also interfaces to consistent uses of the same datasets and making datasets available for AIs, and making large language models available for multiple use cases, and different ways, and standardization that can happen in those areas or openness, I should say, maybe is a better term than standardization given the language you're using. I'm not sure. We can talk about that. But there's different layers there. Where do you think we are in that hierarchy, as far as, we know how to do open software, and we know how to do open APIs. Do we know how to do open datasets? Do we know how to do open large language models? Do we know how to do whatever the next level is there? Are we good at that yet? Are we starting out? Or do we just not know where we're going at? [0:31:04] RF: That's an interesting question. I've never been asked that question with that phrasing and framing, which is really cool. I would say, on one hand, I think we absolutely do know what to do. What I mean by that is, you look at how software has been distributed with the different licenses and the kind of evolution of how people open up their code, and the variety of ways that people can do it. You say, "Okay, conceptually, we're probably going to do the exact same thing for data. We're going to do the exact same thing for models." So, I think in that sense, there's a bit, I don't want to say it's common sense. I'm not trying to oversimplify. But I do think that there is a way of saying, "Look, these problems are solved in this domain." Let's just broaden our perspective and say, "Okay, they're probably going to be solved in a similar way." The devil is in the details, though. The part that we haven't gotten right is what do those data licenses look like? How do we make sure that people who contribute are maybe compensated, or credited, and whatnot? It's not that those technical challenges and those legal challenges and those philosophical challenges aren't there to be solved. But I do think that we can say we've solved these problems before in a different domain, and we can just almost apply those approaches in this new domain. But I think there's a tendency to look at something new and say, "This is new. So, therefore, it's entirely new." We don't know how to do anything in this area, when in reality, it's just a different view of the problem you've already solved. So, you just kind of have to have that adaptation. But it's going to take a while to shake through all that. Given the, again, the massive uncertainty, and that kind of changes there, and honestly, the economic impact, it's probably going to be a complicated discussion. [0:32:22] LA: I 100% agree with everything you said there. But there is one aspect with things like data and large language models that use data that's different, typically, not all the time. But typically compared to just code. I hate saying the word "just code", but you know what I'm saying. That is privacy. Whether we're talking about PII, or whatever, there's information in data that is specific and valuable, just by having the data. Independent of whether you have the right to use it or not, having that information may or may not be appropriate, and the privacy aspects that go with that. That's different than with code, though, is rarely a time where code itself, open source code has to be private. [0:33:09] RF: No, you're absolutely right. [0:33:10] LA: Merely that situation. But that wouldn't be the situation with data. Open-source data might still be private. [0:33:17] RF: Yes. That's exactly neat. The emergent risks or problems that come also when you - any one data set may be fine. But you could put three or four of them together, and suddenly now you can't identify people. And suddenly, now you can do attribution. So, there is an issue where you get these emergent things that come out of, "Okay, well, I've released my dataset, because it's fine." By itself, there's no PII, or there's no way of tying it back. But then, other people release their datasets, and suddenly, somebody realizes, "Well, if I get datasets, A, B, C, and D, now I can do horrible things." Well, how do you deal with that problem? Because you have to source. Each one of the datasets is fine. How do you - yes, that's exactly a challenge. I remember the first time I learned that, this is obviously before AI. This is just when people were like, "Oh, I can take an IP address information. I can take health information. I can buy datasets from credit card companies, and your shoppers club card for your grocery store, and I can put these things all together, and I can learn all kinds of interesting stuff about you that you did not intend." But again, every dataset provider in that sense, wasn't doing anything wrong. So, how do you then deal with that in this world where the models can be trained on that data and people can actually do stuff is a very interesting question. I don't have an answer for that one. This is why I love being in this space. Every day, somebody asks me a question, I get excited. [0:34:26] LA: Yes, that's great. I think, you would agree that this is an important area that has to be dealt with in the open data. [0:34:34] RF: Well, I do. I also wonder how much data is actually not going to be public? I think there are a couple of guiding principles that I think I'll just say, I have, at the moment, always open to evolving them. But one of them is that small open-source models are going to be great, if not perfect for many, many people. But they're going to want to be able to control the data. I think the AI revolution has really, I think, changed people's view of the data, right? Like you see people putting all their content behind paywalls or locking it behind agreements that you're not going to use to train models, because they suddenly realized, "Hey, all that data is actually incredibly valuable in a way that it was" - not that it wasn't before. But that, there wasn't that direct connection. So, I think you're going to have people keep those datasets private, and they're going to want to train their models internally. They're going to want to govern them internally, and probably run them privately or at the edge, or some hybrid fashion. So, I think that is a change. I think you're going to see what people previously kind of gave away, or at least, didn't necessarily govern, and they aren't going to do that anymore. That will actually hinder innovation. It would be very interesting to see, like, does that mean that only companies that have massive resources like OpenAI and others, and Microsofts and the Googles of the world that can license content are going to be able to train? Or how is that going to evolve over time? I don't know. [0:35:41] LA: It's a great question. I think that's one of the fundamental questions that's going to come with the whole AI revolution. Is AI going to be, in general, giant datasets? Or is there going to be many, many, many small datasets and training that comes from that? I think there's a big world, in the small dataset AI. Let me give you a simple example. I would love more than anything I write a lot of content and I've got books and articles and everything that I've written. I would love nothing more than to take all my content, put it into an AI, and have a chat bot be on my website to be able to have people ask questions and respond with things that I know. Now, I know, I can do this today. I just haven't taken the time to do it. I know there's companies that are looking or have started to do that sort of things. But that's really what I want. That's not a large dataset problem. It's a large learning problem. But it's not a large dataset problem. Should we be separating the learning and the ability to learn from an AI model from the data to be able to do things like that easier? [0:36:51] RF: If I'll answer that question, I think so. Because I like to think of these models as almost like an expert assistance. A grad student that maybe is fresh out of college, or even just a computer science student, fresh out of college. If I'm billing - okay, I'm a technical co-founder of a startup and I'm trying to build a team out. I don't look for one person that can be the product manager, the technical lead, the system architect, the IT person, security person. The front-end developer, back-end developer, et cetera. I look for a team of people and I combine them together. It's the sum is greater than the individual parts. That's the power there. I feel the same way about AI models. So, I think you're right, being able to say, "Oh, we have these" - you'll see that through a mixture of experts and other systems that say, "Let's take these invalid pieces and put them together, and actually generate it that way." I mentioned earlier, you train your model on the entirety of data on the Internet, and you're going to get every piece of fanfiction, and you're going to get celebrity birthdays, and celebrity obituaries, and you name it, all in that model. Do I need those in order to ask it to help me how to improve the flow of my story, or the grammar, or to explain physics concepts to me? I'm a big fan of Khan Academy, so I'm trying to brush up on different math concepts. I don't need all that. So, I think you're exactly right. Taking small datasets and small models and putting them together, so I have this team of experts that really enabled me and empower me. I do think that that is a large part of the future. [0:38:06] LA: That's great. So, I normally, about this time in an interview, ask what's next? Or what's the future? But we've been talking a lot about that. But maybe the best way for me to rephrase that question this time is to say, we've talked a lot about where AI is going. But where is Anaconda going next? [0:38:23] RF: Perfect. Yes. So, we did touch on this earlier, but it's taking what we've traditionally done. Getting the right bits on your computer, making it manageable, and governable, and then helping people solve kind of higher-level AI machine learning data science problems, and expand that into data and models. I mentioned serverless Python, things like PyScript, and whatnot that allow you to actually execute this stuff using the cheapest hardware, which is the hardware you already own, i.e. your laptop or your phone, things like that. We're also very, very heavily focused on high-performance Python. People often talk about, "Oh, Python is not as fast as C++. Or it's inefficient or whatever." And I think getting back to the earlier point about why, what are you trying to solve? Why is this not the right tool for the job? That wasn't the limiting factor. The limiting factor was getting into people's hands, making an understandable, again, collaboration, secure, all this stuff that Anaconda traditionally focused on. But Python is the lingua franca of AI and AI is central to the world, and that NVIDIA GPU isn't cheap. They're in short supply. So, helping people get the most out of their investment in their infrastructure is actually a core concern, and I think Anaconda is uniquely suited to help solve Python performance problems. So, it's a category of problems, and it's a category of technologies and approaches. You're going to see a lot of stuff from Anaconda around that both directly from Anaconda, but also what we're going to foster in the open source ecosystem in the Python ecosystem, everything from the interpreter, to the language, to the libraries, you name it. Then, you kind of combined those altogether, and it's really about making it easy for people to build these models, to incorporate these models, to deploy these applications fast, efficiently, effectively at scale. [0:39:53] LA: This has been a great conversation, Rob. I really appreciate it. We're so close to being out of time. But I want to thank you so much for coming on. It's been a great conversation. [0:40:03] RF: Thank you. [0:40:04] LA: My guest today has been Rob Futrick, the CTO, not the EVP of engineering. But the CTO at Anaconda. Rob, thank you for joining me in Software Engineering Daily. [0:40:14] RF: Thank you, Lee. I always love an exciting conversation. Thank you for absolutely providing one. It's been fantastic talking to you. [END]