EPISODE 1692

[INTRODUCTION]

[0:00:01] ANNOUNCER: The size of ML models is growing into the many billions of parameters. This poses a challenge for running inference on non-dedicated hardware, like phones and laptops. Argmax is a startup focused on developing methods to run large models on commodity hardware. A key observation behind their strategy is that the largest models are getting larger, but the smallest models that are commercially relevant are getting smaller. The company was started in 2023 and has raised money from general catalyst and other industry leaders. Atila Orhon is the Founder of Argmax and he previously worked at Apple and NVIDIA. He joins the show to talk about working in computer vision, building ML tooling at Apple, optimizing ML models and more. 

This episode of Software Engineering Daily is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[INTERVIEW]

[0:01:02] SF: Atila, welcome to the show.

[0:01:03] AO: Glad to be here.

[0:01:04] SF: Yeah, thanks for being here. You're the founder of Argmax. Prior to that, you were working in generative AI at Apple. Maybe from that experience led you to starting your own company, or maybe it wasn't a direct result of that, but what, I guess, inspired you to go off on your own and try to create something new?

[0:01:20] AO: Yeah. It was definitely a progression. Maybe I need to roll back a bit more. I was doing my PhD at USC, and then I was working on data science. At the time, it was deep learning adjacent, so it was interesting enough. Then I looked at the industry and a lot of computer vision applications were taking off. For example, I interned at NVIDIA and there were a lot of video frame prediction, deep learning, super sampling, super amazing technologies that people are working on.

When you get exposure to that, it had me thinking like, I should actually go to the industry and start shipping this stuff, because it looks like the technology is starting to become ready to be used in practice. That was just the background thinking in my head. When some recruiter from Apple reached out, I just said, okay, I'll just join Apple and see how this technology can be commercialized.

In the first few years, what we did is exactly that. We were this applied research and development team, where we trained some models, we showed with the product teams and we partnered with the product teams to put it into production. What does production mean? There are a lot of first-party apps, like camera, photos on iOS and macOS. You can imagine a lot of technologies that power things like, better image quality when you take a picture. For example, if you know where people are, where the sky is and where people are looking, things like this, if you can predict, the camera will be able to take advantage of these predictions and enrich the image quality.

Similar with photos, if you're trying to search for something and you don't know when you took that picture, or where you took that picture, you can intelligently search for your photos and things like that. At least in the last few years, I focused a lot more on infants and especially infants on device. Because if you consider the Apple context, it's a privacy-first company. It's in the branding, it's in the marketing and it's in the philosophy. You'll hear Tim Cook talking about it as well.

Naturally, when we build ML technologies, especially for the inference aspect of it on device is a perfect fit for that. Obviously, there are other aspects like, if you're training, there's data privacy and there's a whole different set of things there, so I'm not even getting into that. If that's the context in which we're building engineering teams, we're trying to scope projects and ship, sort of I pushed my team and myself towards a scope where we're building these tools for internal developers, as well as using these tools ourselves to build models and ship them in the operating system.

I would say, in my technical time, I was much more emulled in taking a look at what open source is doing. For example, in late 2022, Stable Diffusion came out and everybody was so excited. If you look at the internet, everybody's trying to deploy it. There were a lot of very interesting attempts. I realized like, okay, there's actually a lot of attention here and there's a lot of developer tooling needed to make it much easier for these people to do what they want to do.

At the time, we were starting to actually be able to share some of the projects that we did internally as open-source code as well, because it's actually very useful for Apple's ecosystem. If you can create a happy pet in developing for Apple platforms and share that happy pet, which is tested to be working and performant with external developers, right? That's what we did initially with ANE Transformers, which is a reference implementation of the transformer architecture for the accelerators that everybody has on their iPhones, or Macs. Not that Apple Silicon is also available on Mac. It's almost like this accelerator that you should be using if you want to do energy efficient on device inference on these consumer devices that a lot of people have now. That was the starting point.

That was the first project we put out there as developer tooling and looked at what the feedback was like. It turned out to be interesting that people read the blog post and looked at the code, but it wasn't necessarily obvious, as far as I could tell. Like in hindsight, how to leverage that information and tooling to go all the way and deploy these complex, large scale, bleeding edge foundation models, right?

Think about at the time, I think it was probably llama had come out by then, and Stable Diffusion was much more in vogue and Whisper from open AI just came out. That was also getting traction. We picked Stable Diffusion because I think that was, I don't know, maybe it was the zeitgeist. I don't remember how we picked it specifically. But we optimized it, and using the same tooling, but I guess, we added a bit more tooling to it. There were some issues that we had to fix and I think in two or three months, we released another project, where we deployed Stable Diffusion on Mac, iPad and eventually, iPhone. They got much more traction, because now this was this out-of-the-box available tooling that you could just depend on and build applications on top, as opposed to use it as a building block first and then sort of like, maybe it's more the first thing that we released was the engine, but the second thing that we released is like a car that's ready for branding and just putting the user experience on top and starting to experiment in production point. 

Yeah, when we did that, because of the added visibility on the project, a lot more developers took notice, as well as a lot of enterprises took notice. We started getting inbound on like, "Oh, I didn't know this was possible." All these large complex architectures are really runnable on device with this much tooling provided by Apple. We started talking to different companies. You can imagine, I shouldn't name any names, but anonymous number of million users. This is actually being considered in production at scale.

That's when I realized there's a commercial value here, as opposed to just being useful within the Apple ecosystem context for apps, because, I guess, that was my decision point of like, "Oh, okay. There's actually a real market shaping up." Even though it's relatively early, if you build the technology now and build a platform now, we can both help create that market and make it bigger than it might otherwise be.

[0:07:44] SF: Yeah. It sounds like, essentially, based on some of these experiences, Apple and some of these preliminary projects, you saw that there was value to being able to bring some of these generative AI models to being able to run directly on a device, both that plays into Apple's stance of philosophy around privacy, because it's going to be privately run on the device. You can keep inference there and so forth.

Then on top of that, it sounds like, there was also a commercial interest. I guess, that combination of factors led you to starting Argmax. What is the visionary, the vision essentially, like let's figure out a way to create a market around running models on device?

[0:08:22] AO: Yeah, exactly. The vision is some developers and some companies already see the value, but there are other things that are holding them back. For example, maybe they're waiting for the software stack to mature. They're waiting the consumer hardware fleet to become a bit more capable. They're waiting to see what the commercial spectrum of models, like commercially relevant spectrum of models will be in terms of the smallest size and the largest size over time, because it's also very dynamic, and so there are multiple moving parts.

I think the reason not a lot of people early on decided to take a bet on a technology and directly put it into production, it's unclear at steady state, whether it will be a feasible problem. For example, if the smallest commercially relevant model is 100 billion parameters and the devices are not getting stronger at the rate that everybody anticipated, there's going to be a mismatch. In that scenario, it's not great. Our thesis is the commercially relevant model spectrum, like let's say, Llama 3 that just came out yesterday is going to expand in both directions.

The largest models are getting larger and the smallest models that are commercially relevant are getting smaller. One good anecdote, like from yesterday's Llama 3 release is Llama 3, 8 billion, which has like, 8 billion parameters, is almost as good, if not better. I don't recall the exact ranking, but it's pretty much in the same ballpark as Llama 2, 70 billion, which is almost 10X larger, which came out less than a year ago. With that rate of progress, we're seeing capabilities emerge on the frontier, which is more than 100 billion parameters. That's for the cloud. You should probably deploy those on the cloud. Even if you can deploy it on the device, I wouldn't - that wouldn't be a portable thing from an energy efficiency perspective and things like that.

On the lower end, we see that 7 to 8 billion parameter models getting even more capable and outperforming last year's frontier models almost. We think it's going to go in that direction and maybe converge towards probably more than 1 billion parameters, but probably less than 7 billion parameters as well. There's a huge intersection between what the devices are already capable of and where the commercial spectrum is expanding.

We want to accelerate that convergence by means of advanced software. The software that we built can create an arbitrage, where what is currently available today out of the box in open source can be the baseline of what you can do if you wanted to deploy your model on device. You want it to be more performant, frictionless, more energy efficient using less memory, running faster, so all these axes. You want to improve a lot of that, so that the adoption of on-device inference accelerates.

I think in steady state, though, just maybe call it out earlier, I don't think in steady state, it's pure on device, or pure cloud. We think it's going to be a hybrid feature, where based on a model scale, or the application, or the deployment context, where maybe it's very low-end devices, regardless of how many years passed, like lowest-end consumer devices will not be as powerful. I think it will be hybrid based on the deployment context, but we just want to make sure on-device can reach its full potential.

[0:11:52] SF: A lot also depend on use case, because I would think that for specific use cases for a specific application domain, you're trying to call for something, you don't necessarily need the 70 billion parameter model. It's just more power than you need to solve for certain things. That way, if that's the problem you're trying to solve, it's a fairly small self-contained problem, you can get away with doing purely on-device. But then, there's potentially other applications where you do need this heftier model and then you have to go to the cloud. Or maybe like you're saying, there's this hybrid model where for certain aspects of your application, or certain aspects of trying to do something with AI, you can keep that on device, get some of the value that you get by doing that on device. Then for things that maybe require a larger model, more compute, you can go to the cloud so you have this hybrid system.

[0:12:43] AO: No, totally, totally. Actually, I can share a few anecdotes along these lines. For example, let's talk about an application context, where on-device might be advantageous, right? Thinking about code autocomplete, right? A lot of these co-pilot products, there's one from Microsoft, GitHub, there's a lot more from Replit, Codium, and Tabnine and Supermaven, which I use. There are different features that are powering these co-pilots. There are chat features where you talk about the repository and you asked certain questions for debugging, or the baseline feature set is as you're typing your code, it predicts and suggests autocomplete of multiple lines, or a single line and things like that.

The latency budget for being able to submit a prediction and get the result back and present that to the user is in between two keystrokes, right? Because there's a very high frequency triggering of the model and these models are usually not that big either. I think some of these companies publicly talk about their technology. I think Replit's is open source, actually. These are less than 7 billion, or 7 billion models. The latency budget, again, it's probably less than 50 to 100 milliseconds for you to be able to receive these predictions in between two keystrokes.

If you can't fulfill that, for example, prediction one was submitted, let's say it's on the cloud and there's a round trip over the network. Let's say, the GPU gave you the result in zero milliseconds, ideal scenario. It's infinite speed, right? Then you have to come back and present that to the user. Even before you could do that, there's a new keystroke and there's a new prediction, so your previous one is invalid. Some of these companies are actually building request invalidation infrastructure, so that you don't waste GPU time by fulfilling predictions that are already stale. I think Codium's - I heard on a podcast from a few Codium people that they invalidate more than 90% of their requests that reach their server. In hyper-latency critical applications where on-device can actually give you faster than network round trip latency, this is a context image that's useful.

[0:14:57] SF: Now, would you also, if you're using something like a public model there, or you're running on a GPU, I mean, that's going to get pretty expensive, if you were doing that round trip for every keystroke and actually running. Then if it gets invalidated, then it's like, throw away costs, essentially, as part of that inference process.

[0:15:15] AO: Right. If you actually fulfill stale requests, yes, it's very wasteful. I think people started figuring out how to skip those still requests. This is actually an extreme scenario, where a product is actually triggering inference with every keystroke. It's probably somewhere in between where it's not every keystroke, but it's something close to that. I don't know their exact deployment context, but this is something I heard from quoting people that was interesting to me.

[0:15:40] SF: Yeah. With relying on the cloud, you have these challenges around variable speed, potentially reliability, user privacy, the latency, essentially, of the round trip. When you're doing this on device, what are some of the new problems that you're introducing by having to do it on device?

[0:15:57] AO: Yeah. There are a few ways to look at it. If you look at it from our roadmap perspective, the hardest thing is multi-platform, because you have a lot of different devices running different operating systems, based on different silicon vendors and different silicon generations. Even though there's some huge chunks of user slice that you can address, for example, if you support Apple Silicon, it may amount to somewhere between 20% to 40%. If not for some companies that are single platform apps, like 100% of their user base.

But when we go talk to an enterprise, it's seldom the case that by selling them an Apple Silicon point solution, we don't support more than 50% of their user base. It's a matter of how do we build the performance required? If there's a latency target, can we hit that target with all the platforms and with devices that are, for example, let's say, it's iOS. iOS 17 supports all the way back to, I think, iPhone XS, which is by now six-years-old or so. We have to hit the latency budget on that device, as well as the newest device. That's one of the challenges where even if iPhone 15 is fast enough, it's actually irrelevant, because we have to be fast enough on iPhone XS.

[0:17:16] SF: Basically, you don't have control over the compute environment, because you're relying on the person's device. You got to, essentially, cater to whatever the lowest grade performance potentially is going to be.

[0:17:25] AO: Exactly. That actually separates a POC from a production system, where a lot of demos can be shared on iPhone 15, which signals the future, because at some point, iPhone 15 will be the oldest device and that will be an interesting thing to look at. For now, you have to care about the older devices more. Also, cross-platform. When you go to Android, can you get similar performance? If not, your product is heterogeneous in terms of what you're offering. Cloud's main benefit is uniform performance across the board. Doesn't matter if it's a $10 Android, or a $3,000 Mac, you get the same performance.

[0:18:00] SF: If you can get a model and run on a device, does that create new use cases for AI? Is that maybe are not possible through more of a traditional server-side AI deployment?

[0:18:11] AO: Yes. I think there are multiple facets to this. Maybe we can explore one or two. One is real time. Really, like some of these applications, like code autocomplete was one, like real-time transcription and translation is another, where latency is critical and it's just a different user experience that if there's on-device inference being offered to the developer, they will architect their product in a certain way, versus cloud inference, where they might do some concessions and there's a different user experience.

Another example is obviously, true data privacy. There's a lot of personal context on device. If you're not exfiltrating data, you can stay on the device and personalize these models by means of either conditioning on it in the prompt, or fine-tuning overnight, things like that. It actually enables different tiers of personalization, I would say. That's the second perspective. I'm sure there's more, but, yeah, that might become an exploration exercise.

[0:19:16] SF: How do you actually get one of these models on the device? Does it require compressing it in some way? Are you giving up certain things, like accuracy, or breadth of the model by getting it to go on to device?

[0:19:30] AO: Right. I think, it again, it will depend on which model we're looking at and which device fleet we're looking at, because the pristine model version is considered to be 16 bits of precision, called float 16. As long as you keep the same precision and you don't drop any part of the model, there's a reference way of implementing it and there's a reference way of running inference, right? Any deviation from that has to be requalified. Here's what I mean by that. Let's say, we're trying to deploy a 7 billion parameter model on a phone and 7 billion parameters will be roughly 14 gigabytes in float 16 precision.

That's not the only memory that you need. Obviously, as you're running it, there will be runtime activation memory that you have to account for, but just for the purpose of a simple discussion. You don't have 14 gigabytes on a phone, right? You have to somehow reduce your memory consumption. The total amount of memory that a phone has is not even your budget, because a lot of it is being used by the operating system and other apps. In the context of a single app for a single use case, even if you have access to more, you should probably assume that more than 2 gigabytes is not good for system health, on a recent iPhone, let's say.

If you're trying to get below 2 gigabytes, it's a huge compression factor, right? There are different ways of compressing models. If you look at the literature, they'll tell you like, oh, the quantization, the pruning, maybe distill into a different model. All of these things actually change the behavior of the model. It's a last resort that we don't do by default, but we develop techniques that are as lossless as possible. That's a misnomer, because lossless, there's a single definition of lossless, right? As closely as lossless as possible.

There are different tiers of equalifying quality. For example, I think one common practice that I don't really agree with is after compressing the model, you compute something called perplexity, which is a summary of how well the model is still doing language modeling. If that didn't change as much from the original model, people are like, "Okay, maybe this is still good enough. I should put this into production." In reality, there are dozens of different capabilities of the model that are impacted in varying degrees. For example, there's in context learning, question answering, retrieval capabilities, right? All of these things, like even more that I'm not enumerating, are impacted, or not impacted to different degrees, and providing this observability is also very important.

With Whisper, for example, when we were compressing it, we compressed it by, I think, 3 to 4X by means of some quantization techniques. We actually re-validate the model on the test sets that it was initially benchmarked on. Well, Whisper is relatively straightforward, because it has a certain task. It's supposed to translate, or transcribe. There's one correct answer and there's one capability. We had a relatively easy time re-qualifying it. If it's a language model and you're still relying on its general-purpose capabilities, then it's a different story. If it's a single task language model, then you know how to re-benchmark that single test case and then I think you should have a relatively easy time as well.

It's a matter of like, is it a general-purpose model that's being used across different use cases, or is it a single-purpose model? In which case, I think, compressing and squeezing into a certain hardware is going to be much more productionizable, I would say.

[0:23:11] SF: Yeah. Let's talk about WhisperKit, which was the first release project from Argmax. You alluded to some of the stuff that you were doing there, but it's basically an on-device version of the Whisper model, which for those that aren't aware, this is for translation, transcription model from OpenAI. I'm assuming, you focused on that because one, it's probably easier to test whether the compression is actually still accurate. Then also, back to what we were discussing before around what are some of the use cases that play better on device, real-time translation, like, if you want to do real-time translation, you have to do that on device. There's no way to do that otherwise, and that's a great use case.

If I could be speaking in English to you and you hear it in German, or something like that, then that would be tremendously valuable and super useful for the future of communication on the world. I'm assuming there are reasons why you're focused on it, but what were you hoping to learn from that project, and also, I'd love to get into the specifics of how you got that to actually run on device.

[0:24:10] AO: Of course. But before I answer that question, there's a funny thing I need to tell you. With WhisperKit, we also built an example app. We show how fast it's in practice, and then it's also to encourage developers to take it up. We put that app on TestFlight, which is Apple's beta testing platform. It's just a test app, right? It's clear, we signal that, and it's meant for developers. Then, I think one of our tweets, saying like, "Oh, we improved WhisperKit XYZ, and this is doing real-time transcription and translation." Then it went viral in Japan, and a lot of people thought it was a product, and it was a free product. They downloaded it, and they shared all kinds of videos saying, "Oh, this is very useful. This is the product I've been looking for." I'm like, we're trying to clarify in the comments saying like, "This is a test app. This is for developers. Please, go to our repository and check out our own device inference." It was like, "No. I'm a consumer. I'm going to use this."

[0:25:06] SF: They were finding that through the TestFlight implementation and getting it running on device?

[0:25:11] AO: Yeah, yeah. I think it was like, some channels got mixed up on Twitter. Yeah, anyways, that was just a fun anecdote that happened a couple of weeks ago. Yeah, the reason we picked Whisper is it's a very useful model to begin with. A lot of products already rely on it. It's actually, many more products than I realized, because I was tracking Whisper existing implementations for it. I think there's a really nice implementation by Georgie, the GGML project and this recipe we built on top of it. There are other server-side implementations that are really fast. It seemed to us that almost all of them focused on batch offline transcriptions, and their performance metric was how many seconds of audio can you transcribe in one second, right? That's this high throughput scenario, where they're not talking about latency, because you don't get an answer right away. You're like, you submit thousands of audio files and how fast do they come back? On average, if it's 10 hours of audio, how many seconds, or how many minutes did it take before I got all the results?

It's not like, how many milliseconds it took for me to get the current result, right? It's a very different problem. We thought, there needs to be a solution where using all the best practices that we established so far and building on top of, there needs to be easy to use framework that leverages an optimized Whisper implementation and makes it easy for developers to put it on device and integrate it into their apps, right? We also used it, that was the first reason. Whisper being interesting and a lot of the existing implementations focusing on offline versus online streaming applications.

The second one was, if you don't consider the particular name Whisper under the hood, it's just a transformer encoder-decoder, right? We were interested in building generic tooling to deploy transformer models, because we see a convergence of a lot of these image generation models, transcription models, LLMs and other modalities, even a lot of the computer vision models converging towards this architecture, where on any optimization that we build on top of the transformer is amortized across all these domains, or application areas that we can go into next.

We built a lot of tooling while developing WhisperKit, and we're already seeing that it's useful for a company where, for example, we get a customer model and it's an LLM. We use the same tooling that we've built for Whisper and applied on that. It's a huge efficiency gain and the platform that we're trying to build, which is there's this platform, you onboard your model, as long as it can be canonicalized into a transformer, then a lot of our tooling is going to work out of the box and give you the same performance that you saw with WhisperKit, for example, even though you have nothing to do with Whisper, right? That was the second reason.

The third reason is, to be honest, it was small enough for us to deploy it in two months. The first version we put out was two months after funding the company. We wanted to put something out fast to get feedback and also, to build some developer community, our presence on GitHub and things like that, as you can imagine. If you picked a large model and if you wanted to be proud of what we did, I think we would need much more than two months to go to market with that. But with Whisper, I think it's still large enough for it to be interesting. I think the largest model is 1.5 billion parameters. It's near the commercial scale, even if you consider the LLMs. But it was also small enough for us to not have to invent something new to deploy it.

Even though we have some novel compression techniques on top of it, we didn't have to - I would say, if it was 5X larger, I think we would have to work on it for several more months. That's pretty much the summary.

[0:29:04] SF: What were some of the ways, techniques that you use in order to compress it? I think before, you had mentioned that you were doing some quantization, but also, that can potentially change the model in some way. I guess, through testing, you can make sure that you're not giving up a certain amount of performance with the model through that technique, but what are some of the other techniques that you employed?

[0:29:26] AO: We actually used it as a testing ground. We developed a few techniques, which were giving us different trade-offs. For example, we built a technique where, I mean, I should probably gloss over the details here, but you remove the outliers, you compress with very low bit precision. For example, 1, 2 and 4. Those are natively accelerated by the hardware that we're looking at. That was pretty accurate, but it was relatively slow, because we had to do all that filtering online, where outliers are processed differently and inliers are processed differently. It's almost like, two models in parallel. I think it was 20%, or 40% slower. We decided to say, okay, even though that's accurate, we're trying to make a point on speed. Let's put that aside.

One of our teammates, Arda, was an intern at the time, developed a technique where there's a distinction between post-training compression, which is you take a model and you compress it and you don't correct for your compression, and that's your artifact, right? That's relatively easy. There's no training in mode, or further fine-tuning in mode. That's what a lot of people do. There's the other approach where it requires more investment, where you actually compress the model and fine-tune the model to be robust to that specific compression. Because you introduce some compression artifacts and compression error, and the model can adjust to that if you let it see more data and readjust itself.

We wanted to be in the position where as a developer platform, we wanted to be self-service. If it's a fine-tuning base workflow, which even though it's going to give you better results, it's in involved. You have to own more training data. You have to set up a fine-tuning job. Maybe it requires tuning. The fine-tuning job itself, because there are a lot of hyperparameters. We don't want that to be an involved process. What Arda did is there's this known technique called QLoRA from University of Washington, Tim Dettmers, from last year's, where you take a model, you quantize it, and you add small adapters, like low-rank adapters, which turn out to be much smaller than the original network. You only train those small adapters as opposed to the model itself.

The model is frozen, and you end up having these tiny layer on top, where the model is actually recovering original quality. Even for that, you need the data. What we did is, what if you don't need data? What if you remove the data set, and just train it on random noise? What is the target now, right? Because if you have data, you know what the target is, and you can try to reach that target. If you have text, and you know the next word, and next word is your target. In random noise, there's no next word. What we did is take the random noise, give it to the original model, and the original model will give you some prediction. It's not meaningful, because its input is not meaningful, its output won't be meaningful, but it's still the same input-output mapping that we're interested in.

If those are your input-output pairs, then you can automatically apply this fine-tuning work-based workflow on different models. Without the platform having to request any training data from the developer, having to request any tuning from the developer, you just have to find your workflow that the model automatically goes through and gives you a high-quality compressed model back.

[0:32:54] SF: To make sure I'm understanding this correctly is, so you're compressing through quantization But, then to make adjustments for that in a loss of accuracy, you're doing, basically, a round of training, or fine-tuning to match the random noise input to the original model in what output it would generate. Then basically, you're fine-tuning the compressed model, so that it generates a similar output to know the parameters have been quantized.

[0:33:20] AO: Correct. Once we do that, we go back to the original test set where, because we can't trust the performance on random data, right? It doesn't necessarily tell us anything about the real data performance. We go back to real data, and it turns out, you actually recover the performance on real data. That was surprising to me at least, because that was doing that as our intern, and I was doing some other technique, and his model turned out to be better, so we just are using his instead, for example.

[0:33:46] SF: Good intern.

[0:33:47] AO: Yeah. Useful time now.

[0:33:51] SF: You've also got significant speedups in both the audio encoder and the text decoder. Can you walk through some of the optimizations that you discovered there and implemented?

[0:34:01] AO: Right. That is the weird part, where those things are specific to the Apple Neural Engine. Early on, that's an interesting bit of the project. But as time goes on, we wanted that to be less interesting, because as we go into more platforms, these techniques that are specific to the Apple platform will be less amortizable. I'll go into the details, but contextualizing these improvements where these compression techniques will be useful on Android, Windows, Linux, Apple, whatever, because that's on the model itself. The way in which we accelerated the text decoder and the audio encoder was through working with the Apple stack, where, as a byproduct of having worked on Apple Neural Engine and now to accelerate models on it, we had a mental model of what changes in the user code, which is in your PyTorch code, how do you define the model, even before you consider deployment? Let's say, test the developer's interface to the model.

If you change it in a certain way, we know what kinds of impact it will have on the underlying final model asset. Because you start with a PyTorch model, then you convert it to CoreML through Apple-provided tooling, and then CoreML under the hood does a lot of different conversions that are not transparent to the developer, that eventually creates a binary that happens to run fast on Apple Silicon, right? What we did is we changed the user code, where we know the underlying binary that will be generated by Apple tooling will be as fast as possible. It's like, you can consider that the V2 of ANE transformers, which was this reference implementation of the transformer will be put out there in 2022 as part of, I think, ..DC22.

[0:35:51] SF: You also did some work around pre-computation for the KV cache to reduce some latency there as well. I guess, first, can you explain what KV cache is? Because I'm not sure that everybody listening is going to necessarily be familiar with its role and inference, but then how did you come up with the optimization solution and some of the things that you did there to help with the real-time transcription performance?

[0:36:13] AO: This one is actually a bit funny. This is very specific to Whisper, where you have to tell Whisper what you're looking to do if you wanted to do a particular task. For example, which language you want to translate from, or do you want it to translate, or do you want it to transcribe? Because, for example, if you're speaking in Japanese and you want to transcribe in Japanese, you have to tell Whisper to say like, "Oh, I just want transcription. I don't want translation." That's a token. There's a special token for that.

Even if you don't tell Whisper, it will try to predict which language you're speaking and then start doing its own task, but you can force it as well. By forcing it, you're actually enabling it to skip computation, right? There's three tokens that Whisper has to generate before it starts producing actual text tokens that are the transcriptions, or translations of the audio that's coming in. Those actually had to be recomputed every time the audio is updated. If you're trying to do real-time transcription, the audio should be updated very frequently Ideally, every second, or even more frequently than that.

That means you're recomputing these three tokens every second, and you're wasting that, because you could have been computing text tokens instead and catching up with the transcription, right? If you're already lagging, this will add to your lag. What we wanted to do is can we actually pre-compute this, because it doesn't look like it should be contextual? The fact that we have to recompute it with every new piece of audio implies that the embeddings for these specific tokens that you're computing are contextual. We thought they're probably not contextual.

What we did is on random data, again, we like random data, we passed random data through Whisper and computed average embeddings for these three special tokens that describe the task that you're telling Whisper to a solve. We realized that they're actually consistent with real data embeddings for these tokens as well. We said like, there's no reason we should compute this on real data. We should just compute it offline on random data and read this data, instead of computing it. It's just space, time, I guess, trade-off.

You asked about KV cache and what that means as well. I think that's a good call-out, because I usually take it for granted that people implement KV caching, because that's the first optimization you can do for these transformer decoder models that generate data, whether it's text, or other things. The way the architecture works is because you're sequentially generating new data, every time you create a token, you predict a token, that's your output of the model, right? The next time you call the model, it's going to be part of the input as well, because you're trying to build up your sequence over time.

Part of the model actually has to do redundant computation if you don't save the results, because you're always recomputing results from historical tokens to be able to create a new token. KV cache is pretty much whatever you can reuse from previous predictions are returned from the model. You put them on the side, and the next time you call the model, you're actually referencing previous results, as opposed to recomputing them. I would say, that's the lowest hanging fruit for any transformer decoder inference. Yeah, in this case, we pre-compute part of the KV cache, and that turned out that we could actually compute three more tokens, instead of wasting it on these special tokens.

[0:40:01] SF: Some of these techniques that you used in order to get the performance that you want out of Whisper on device, as well as compress it, some of these feel probably techniques that you can reapply to other types of models, but some of them are maybe specific to Whisper. As you start to think about deploying, working on other models to get to run on device, how much do you think that you'll need to continually create bespoke ways, or new novel techniques for getting performance and compression out of the model that's model-specific, versus more general thing?

[0:40:36] AO: No, exactly. You already have this dichotomy of what can we do in our underlying tooling that can be used as generic optimization toolkit, versus what can we do at the framework level, which I consider to be application-specific that are bespoke, they're application context-aware, like deployment context-aware, such that we can also leverage what we know about the use case itself. I mean, obviously, any number that I can give you in terms of percentages will be random, but the goal is to accumulate as many techniques as possible under the hood.

Today, it's about transformers. We care about transformers. We want amortizable techniques for those that we can develop on Whisper, but apply it to customer LLMs, for example. Again, it has to be a mix, because if you don't look at the application context, if you don't look at the deployment context, then you'll be leaving things on a table. You'll probably be doing something redundant, even though you don't have to. Maybe the abstraction is things that go inside the model are more generalizable. Things that go around the model, like how the model is called, and maybe there are multiple models that are chained in certain ways, and how can you architect that chaining, and maybe even remove some of the model codes, because maybe they're redundant. That's the bespoke application context optimization.

[0:42:04] SF: Since you launched WhisperKit, have you seen any interesting uses, or applications of it?

[0:42:10] AO: Yeah. Yeah. It's available under MIT license. It's completely free, as far as the open core goes. Obviously, there's some custom features and better performance offerings offline that we work with companies with. As far as the free offering goes, an interesting user is Detail. Detail is an app. They do video editing on your phone, right? There's an iPhone app where, if you download it, first, you can automatically caption these videos, and then the video will play, and there'll be a caption that displayed to the user as well. Then you can select the caption and maybe unselect part of the caption. That way, you're able to crop your videos, because all of these words in the captions have timestamps coming from WhisperKit.

If those are accurate, you can actually use those timestamps to edit your video. It was an application that I didn't really think about before, but because this technology both has accurate transcriptions, as well as some localized transcriptions, you're enabling these different applications coming from video, because text and image streams are aligned. That was interesting to me.

I think a lot of projects and apps exist, which are just dictation apps, but they really nail the workflow. They really nail the user experience, the integration with user's habits, like user's notetaking habits, or content consumption habits. Those apps are really doing a great job, where they're relying on WhisperKit, or other projects to run the inference, bring the ML in, and they focus on exactly what I just said, which is the workflows, the integration with the user's habits. For example, SuperWhisper is one, where I think they're doing a good job on that.

[0:44:05] SF: Then for developers that are listening to this and want to get involved, try it out, maybe make contributions, what's the best way to get involved?

[0:44:12] AO: Yes, exactly. Two things. We have Discord, where we're already working with our contributors. We're guiding them, we're doing reviews on their PRs. If there are new people who are trying to understand, "Oh, I'm interested, but what should I contribute? I don't even know what you guys are doing." We actually put out a roadmap on GitHub. It's exactly clear what we're going to be working on for the next three to six months. There are certain involved projects, which might take several months, and there are some getting started projects, where maybe it takes a week and it gets your feet wet. Those things are definitely there. If you have time and interest, just go to the roadmap. Take one. Come to Discord. Tell us what you're trying to do, and we'll help you.

[0:44:57] SF: Awesome. Then what's next for Argmax? Is it you're focused on just continuing to build out WhisperKit and support that, or are you investing in other model, on-device model deployments?

[0:45:09] AO: Yeah. Definitely other models, because this question was coming from a few other people as well, where they're asking us whether we are a Whisper company now. Not at all. Today, what I can tell you is we're working on diffusion kit, for example, right now, because Stable Diffusion 3 is going to come out, and we're working with Stability AI to become the on-device reference implementation for that. Again, we're seeing the amortization of the tooling we built for Whisper to apply to Stable Diffusion. That's great. But we need to build some other application-specific optimizations on it as well.

Actually, to be honest an increasing portion of our time is going towards our customers, where we're taking a look at their models, and we're trying to hit certain performance goals that they have. We're definitely committed to open source, so we'll keep pushing on WhisperKit, we'll keep pushing on diffusion kit, and maybe an LLM kit, but I don't know yet. It doesn't make sense yet. Maybe at some point, when it makes sense, we will also tackle that. It's a great question. I'm glad you asked that, because we don't want to be perceived as this Whisper company.

[0:46:16] SF: What are you seeing when you're working with customers and just based on your own experience in terms of the way people are working with models, are a lot of people doing multi-model, or public model, private model? What's the trend that you're seeing in the industry right now?

[0:46:31] AO: I can tell you two things. One is they're open models, and they're proprietary models, right? They sound like, completely different sets of models. In reality, because open-source models are developed in the open, and their recipes, and their learnings, and all the techniques that are out there, it turns out that a lot of the proprietary models are refinements on top of open-source models, as opposed to from the ground-up reimagine things. That's not to say anything about novelty, or anything like that. It has to do with efficiency and expediency. It makes a lot of sense.

Why is that good for us? Because we get to deploy open-source models, and we get to show that they are deployable, and this deployability actually transfers over to proprietary models, because of the shared recipes. Let's say, we deploy Stable Diffusion 3 and Stable Diffusion XL, which we did, I guess, last year. A lot of companies are building products around, diffusion technology is actually doing something akin to Stable Diffusion. The performance numbers and the optimizations that we share are usually directly applicable.

The other thing I would say, is there's all this notion of like, oh, frontier models, hundreds of billions of parameters, and they should be used everywhere and things like that. What I'm seeing is if the customer knows what to do, in terms of there's a single problem they're trying to solve and that's their product, they're aware that they should actually fine tune a model. That's probably 100X or so smaller than what these frontier models are.

If they don't know, like during prototyping, it makes a lot of sense to rely on frontier APIs and pay a lot of money and do the POC, understand what the limitations are. As soon as you have your own data that you accumulated in production, or maybe even curated or paid for, people actually understand that smaller models then fine-tuned on the right data can become viable, productionizable systems that you control. 

[0:48:37] SF: That's a similar trend of what I've seen and heard from other folks as well. It seems like, at the moment a lot of people use the, maybe an API proprietary model to prototype, build demoware. But then when they start to actually get into building something real for a company deployment, then they're getting into having, essentially, more control. They want more control over the model and they need to be able to modify it for their purposes, more than what they may be able to do through the proprietary models.

Maybe one nuance is think about the spectrum of features that can be enabled with, I guess, the default example is an LLM and text applications. The frontier models will increase the set of features that people have in mind, because some things were previously impossible, or not even thought of as something people can do, right? Last year's frontier capabilities become this year's commodity capabilities. We see that where there will always be a sliver of features that are probably only powered by the frontier models. Then over time, which is ideally less than a year, the next generation of these foundation models will actually include those capabilities in the smaller end of the spectrum. Whatever, last year's 7 billion model can do, next year's 7 billion model should be able to do.

[0:49:58] SF: Yeah. I mean, it's like, I don't know, the analogy I think of is the luxury car market, to maybe the more commodity car market, where a lot of features start at the luxury car, like heated seats and cameras and all those sort of stuff. Now over time, those become commodity products that are part of a car that you can buy from a reasonable price point and stuff like that. I think we see a similar trend when it comes to things in technology. They start that one level where it's maybe more proprietary, more expensive, but then through popularity and scale, they become more cost and become something that is available to everyone.

If you think about the economic incentives, this might be a bit technical, but there's this notion of like a Chinchilla optimal, which talks about, given a budget of, let's say, compute, you have access to this many GPUs for this many hours. What is the model size that you should pick, so that you optimally spend this compute budget to train the best model possible, right? That by definition, doesn't consider deployment, because that's just a training budget. Whatever model you get, you haven't traded off the model scale with how much it costs to serve it. That was the original way of looking at it.

If you look at common practice today, like Llama 3 release from yesterday, this recipe of optimally spending your training budget would have told meta to train Llama 3, 8 billion, which is the smallest model, 75X less. They should have stopped much earlier if they wanted to spend their training budget optimally. Because meta is not just optimizing for training, but also for deployment at scale, now there's a very different trade off, where training small models for tens, if not hundreds of times for longer makes a lot of economic sense as well.

[0:51:56] SF: That's interesting. As we start to wrap up, is there anything else you'd like to share?

[0:52:01] AO: Yeah. I mean, we're looking to - the reason we do open source is that we want to get a lot of developers using our tooling, improving our tooling and us getting to interact with them. It's great that you called out earlier, how to get involved. I'll just call that out again. We have a roadmap on GitHub. We've got these projects on GitHub. You can talk to us on Discord, Twitter. We are everywhere. Don't hesitate to reach out and get involved if you're interested in this stuff.

[0:52:31] SF: I feel like, I go to a lot of events, speak it meetups, or organize meetups and stuff like that, and I see a lot of people at those things that are sometimes looking for jobs. They're, of course, they're interested in getting into gen AI. They'll have questions like, how do to do it? There's a tremendous amount of these projects that exists right now, where I think a very natural place to start is start contributing to them, get involved, get your hands dirty and actually learn the craft. No one is going to, essentially, just hand you a job in the AI space. You got to earn it and get up there and contribute.

[0:53:01] AO: Actually, that's a great call out, because one of my co-founders, I met them through an open-source project. When we open sourced a project at Apple, he started contributing. After three months, I was like, "Okay. When I start a company, I'll work with this person."

[0:53:14] SF: There you go.

[0:53:15] AO: We're also hiring. That's a good segue.

[0:53:17] SF: Okay. Yeah.

[0:53:18] AO: We can maybe put some links in the show notes, or?

[0:53:21] SF: Yeah. Yeah, we can do that.

[0:53:22] AO: Okay. All right. Sounds good.

[0:53:24] SF: Awesome. Well, Atila, thank you so much for being here. This is really fascinating, interesting discussion.

[0:53:28] AO: Yeah, it was great. Thanks for having me.

[0:53:30] SF: Cheers.

[END]