EPISODE 1811

[INTRODUCTION]

[0:00:00] ANNOUNCER: LiveKit is a platform that provides developers with tools to build real-time audio and video applications at scale. It offers an open-source web RTC stack for creation of live interactive experiences, like video conferencing, streaming, and virtual events. LiveKit has gained significant attention for its partnership with OpenAI for the advanced voice feature. Russ d'Sa is the founder of LiveKit and has an extensive career in startups. In this episode, he joins Sean Falconer to talk about his startup journey, the early days of Y Combinator, LiveKit, WebRTC, LiveKit's partnership with OpenAI, voice and vision as the future paradigm for computer interaction, and more.

This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him.

[INTERVIEW]

[0:01:01] SF: Russ, welcome to the show.

[0:01:02] Rd: Hey, Sean. Thanks for having me.

[0:01:04] SF: Yeah, absolutely. I'm looking forward to this. I'm excited about all the things that you're doing at LiveKit. I know that LiveKit isn't your first company. You had a couple of swings at the entrepreneurial bat, so to speak. You were even part of an early YC batch. Can you talk a little bit about your background and perhaps, your experience with YC?

[0:01:23] Rd: Yeah, for sure. Background-wise, my dad was an entrepreneur as well in technology. He was starting companies in the semiconductor era in early GPUs, in the late 80s and early 90s. I grew up in the Bay Area. I've been around people starting companies and around technology for pretty much my whole life. Y Combinator was interesting. I was in college and a friend of mine ended up dropping out and joining a company that was just down the street from YC's Mountain View office. I think this was, YC started in Boston, more in Cambridge and then opened up an office in Mountain View.

My friend walked over one day and said, "Hey, how do we be part of this?" They invited us to a dinner. I think that was for the second batch. We went to dinner and met a bunch of founders and decided we were going to apply to YC. We didn't get in our first time, but then the second time we applied, I think it was for the fifth batch, summer 07 was when we ended up getting accepted into YC and joined a group. I think it was 18 companies in that batch. Everyone was scared, because it was a jump from the previous batch size. I think the previous batches, one through four were all maybe 10 to 12 companies. Maybe even the first one was eight. This was 18, and so we weren't sure how that was going to work.

We moved to Boston, to Cambridge. I think we were the second last batch in Cambridge and started a company there. In my batch, who was in that batch? Drew and Arash from Dropbox were in that batch. Immad from Mercury was in that batch. I think, YC today and YC back then, both valuable propositions for an entrepreneur, but also very different. This was in 2007. There wasn't a lot of information on the Internet, but also, it just wasn't normalized outside of maybe Zuck about being a young founder, coming out of school, and how do you actually start a company and build a company from the ground up? At that time, the iPhone platform was just announced, sorry, actually the iPhone just came out that summer.

[0:03:36] SF: It wasn't even a -

[0:03:38] Rd: Yeah. There wasn't even a development platform for it yet. You couldn't build your own apps for mobile. Everyone at that time was just working on a website. I guess, Drew was working on a desktop app for storing files, but everyone else was pretty much working on a website. It was just a different time. I think what was interesting about my YC experience that maybe isn't necessarily the same as it is now, just because it's a different beast, but back then, you found your tribe in YC. It's 18 companies. When you're in college, there weren't a lot of people at that time who wanted to start companies, or were thinking in that way. Everyone was, "I'm going to go get a job at Google. I'm going to go get a job at Microsoft."

Being part of YC, it was much more like a small family of like-minded individuals who were trying to go and do this thing against the grain of what young kids out of college are supposed to do. Amazing experience, one of the best in my life, but yeah, YC now is different. Valuable, but different.

[0:04:32] SF: Yeah. Absolutely. I vividly remember those days. I remember Dropbox in the YC batch. That was a time when you could really - if you were deep in this world, it was easy to pay attention to everything that was going on in YC, because it wasn't that many companies. That was also not a hot time, but 2006, 2007 to 2011, 2012 was not a hot time to do a startup. I started a company in 2009, and it was really a time where you're like, "Really? Is that a good idea?"

[0:05:00] Rd: Yeah. We were hitting a startup winter in 2007, and I remember that as well. I want to say, one of the top VC firms, maybe it was Sequoia, they definitely published a RIP The Good Times, but that was similar sentiment in 2007 as well that was happening. I think another interesting part about YC during that time was the biggest hit that YC had during that period was Reddit. It was a 10-million-dollar sale to Conde Nast. That was the big exit for YC. I would say that there was more of a sentiment back then around, oh, YC is this startup school that is taking bets on kids, but who knows if this actually works? Maybe it's actually a negative signal that you're funded by YC. Positive signal that it's Paul Graham, but negative that it's this new, unproven thing.

Then over time, of course, you had Heroku and other companies that came out that started to do a bit better in the market and have bigger exits, and then YC became legitimized over time. Dropbox, of course, helped with that too, and Airbnb. But it was a very different time. Another interesting tidbit about YC was that we were on the East Coast, and we were pitching all these East Coast investors, and I remember Paul saying like, "Okay. Well, this is a warm-up for the real pitch, which is the real demo day, and that's on the West Coast." After the East Coast demo day, we all flew to the West Coast, and Paul recommended that founders stay on the West Coast. Move to California and build your companies there.

At that time, Silicon Valley was still, I would say, largely South Bay Peninsula focused. There was, of course, Twitter in the city, and Salesforce was in the city, but a lot of the smaller startups weren't. I think Justin Kan and friends from Justin TV, they had this apartment in, I think it was called Crystal Tower, somewhere in the city. I don't exactly remember where. But from my batch, all of these startups, I think Drew and I think Daniel had discussed, and a bunch of them moved into Crystal Tower, and it became called, I think, YC Tower, I think was the name, or something like that.

I've thought about it back now in retrospect, and in some ways, I feel like, our batch of YC created the San Francisco startup ecosystem. Not intentionally. It just incidentally happened, because all of these founders from our batch were moving to the city and all living in the same place, and everyone knew about this being the YC startup epicenter after you graduated from YC. So, interesting to look back.

[0:07:36] SF: Yeah, and I think that, of course, as companies that did well, that maybe had their start in San Francisco, it creates a situation where people take the wrong interpretation of that and associate, okay, well, if I wanted to do well, I also have to be here and stuff like that. It becomes this, people like cloning what they believe to be the recipe of success and so forth. Now, you started LiveKit. How did that end up? What led to you starting that company?

[0:08:02] Rd: Well, LiveKit is my fifth company, and the one I did in YC was the first one, and I've had a few in between. It's interesting, because my YC company back in 2007 was trying to do real-time streaming a video over the Internet, introducing two people to one another that have never met before. I don't like to use a chat roulette, three years before chat roulette analogy. But effectively, that's what it was. Very few people were doing real-time audio and video through a web browser at that time. We rigged something together with glue and tape, and there wasn't great support within the browser for it.

Fast forward to 2012, WebRTC, which is this protocol that was designed specifically for streaming real-time media, started to get built into Chrome and then slowly expanded into other browser implementations as well. Now in 2021, when I started LiveKit, I'm returning back to the same similar type of technology that I was working on in 2007. Now much more mature, and I think the use cases are a lot more clear, too. We started off trying to connect people over the Internet in 2007, and then pandemic happens in 2020, and everyone's building software to try to connect people over the Internet, because that's the only place you could go to find other people and interact with other people that weren't in your house with you.

In that world, it's shocking how little had happened in terms of having scalable infrastructure that made it easy to build an application that connected people in real-time. Yes, you had something like Zoom, which they'd been working on it for a decade, and it was very mature and performed very well, but that was at the application level. If you wanted to be able to build something like Zoom, or take Zoom's features and put them into your application, that still wasn't easy to do. There was actually no open-source infrastructure for doing it in 2020, 2021.

I was working on a side project that was trying to do something similar, which is connect people over the Internet in real-time, this time, though, for audio and no video. I struggled with the same thing. There was not really great infrastructure to do it. There were commercial providers, but they were really expensive, and they didn't scale to really large numbers of people. My application needed to be able to run economically, it needed to be able to have thousands of people in a session together, potentially. That's when I started to look at open source, realized there was nothing there. Pinged my old co-founder from the previous company, company number four. Fun fact, we met during company number one. We started separate companies in YC, and we met in that batch. But we started LiveKit together as an open-source project to build infrastructure that any developer could use to create real-time experiences.

[0:10:48] SF: In terms of that infrastructure and that open-source project, I guess, where did you start with that from an engineering perspective? Then, how did you actually get from this open-source project into turning, essentially, into a company?

[0:11:02] Rd: The way that it happened was the original impetus for working on it was a side project. I was working on, my previous company had been acquired by Medium, and I was now leading product at Medium. I was working on a side project that was best described as Clubhouse for Companies, where you could have drop-in casual audio conversations with your co-workers during the pandemic. When I went to go actually integrate the streaming audio piece of it, what we found out was when we were looking at open source to integrate into the application, there wasn't anything that was easy to deploy, that had SDKs on every platform, and that really felt consumer, or production grade to use.

I liken what LiveKit was trying to do and is trying to do as what Stripe did for payments, LiveKit is trying to do for communications. Stripe didn't invent payment processing. There's already an underlying network of payment processors and gateways. What Stripe did was they effectively took all of this infrastructure, which is very difficult and a bit obtuse to integrate into your application, and made it into a very simple piece of software that you can go and write, and they handle all of the complexity, or the undifferentiated problems of handling payments in an app.

LiveKit is doing the same thing, or an analogous thing for communications. Handling the undifferentiated pieces of how do you actually connect one, or multiple people, or machines together with ultra-low latency anywhere in the world, where those nodes in this graph are located anywhere. The interesting part about the technology that existed before LiveKit existed is that the protocol itself is already there. I mentioned it earlier. It's called WebRTC, Web Real-Time Communication. The way to think about that protocol is it's a higher-level abstraction built on top of UDP.

Most of the internet itself is actually when you use a browser, when you interact with applications, what you're interacting with underneath is really, TCP. TCP and UDP are these two different protocols. TCP is not really designed for real-time media streaming. The main reason for that, there's one key difference between TCP and UDP. A lot of differences, but one key one. The key one is that TCP requires every packet that is sent over a network to be ordered. The receiver must wait for those packets to arrive in the order that they were sent. That's the main difference. Why is that a problem for real-time media streaming? Because with real-time streaming, you only really care about the latest packet. Something that someone said a second ago, or a video stream capturing someone from five seconds ago doesn't really matter. You really just care about the absolute right-now edge packet.

With a protocol that requires that all packets arrive and are handed to the application in order, if a packet gets delayed while it's transported through a network, or if it gets lost somehow and not delivered, you have to halt the entire application before you can actually touch any of the packets that are coming in real-time. You have to go and get that old packet before you can do any of that. TCP is not well-designed, and the Internet in general is not really designed for real-time audio and video.

UDP gives the infrastructure provider, or the lower-level software layer, it gives you full control over what to do when you miss a packet, or whether a packet's delayed. You can decide to actually fetch that old packet if you want. You can decide to try to conceal it by interpolating between packets you have and closing the gaps that way. You can add error correction, so that if you have a damaged packet, or missing information, you can reconstruct the information that you were missing from the pieces that you do have. There's all kinds of techniques that you can use to deal with packet loss, or delayed packets over the network. UDP gives you that control. You don't have that control in TCP.

WebRTC is just another layer on top of UDP that gives you nicer facilities, but at the same time, the problem with WebRTC by itself is that it's peer-to-peer. You're sending your data over the public Internet, and there's no server in the middle when you use vanilla WebRTC. If I'm talking to three people in a video call, I'm sending three copies of HD video to all of those people at the same time over my home Internet connection. It just doesn't really scale. What LiveKit does is LiveKit builds server infrastructure and client SDKs where you take the server, you deploy it somewhere in a data center, and then from the SDK is the client SDKs that you integrated into your app, they all connect to that server, and the server is like a router that everyone only sends one copy of their information to that one server, and that server is determining who should get what, at what frame rate and resolution. It's the mediator here.

LiveKit Cloud is the next step above that, and that's the commercial thing you were asking about. Like, when did we start a company from this? I think we were actually on a more accelerated timeline than most companies are in the commercial open source space. Folks like Elastic and Mongo have had a longer amount of time to build out communities and continue to improve the open source by itself, in isolation, before they have the pressure to commercialize. 

For LiveKit, it was a little bit different. The reason it was a little bit different is we put out the open-source project in July of 2021. It became a top 10 repo on GitHub across all languages for six months, very quickly, I think within three weeks, and then we had companies like Reddit and Spotify and X and Oracle who were already deploying it internally and starting to experiment with it. When we had conversations with them, they said, "Hey, look. We're coming from Twilio, or we're coming from another one of the larger real-time streaming companies. We love your code. We can see your code. You guys designed a good system, but we don't really want to deploy and scale this ourselves, even though we know we can. We would rather you do that for us and operate your own network, and we're happy to pay you for that."

I think a combination of large companies that were pinging us and offering to pay us right out of the gate was one piece of pressure that caused us to commercialize very rapidly. The second was it was 2021, and so the demand for real-time audio and video infrastructure was very, very high because of the state of the world in the macro environment. We made this decision that we were going to continue building an open source and supporting that and giving free support and helping people in the community build, but we also had to raise some money and build the team up a bit bigger and start to actually work on a commercial product, which is called LiveKit Cloud.

[0:18:02] SF: Yeah. I mean, that's a fantastic story. Actually, I had a similar conversation with the CEO and founder of CrewAI recently, where similar idea he built something for a side project that he was interested in, open sourced it, suddenly, Fortune 500 companies were calling him and asking him questions and how do they manage it and so forth. I mean, that's the dream scenario for anybody that wants to start a company.

[0:18:25] Rd: Totally.

[0:18:26] SF: In terms going from this peer-to-peer WebRTC world to then making it so that you can have this, essentially, server-side router that is going to proxy these calls to the various people who are trying to interconnect, what were some of the engineering challenges that you ran into with changing from this peer-to-peer model into this more client-server model?

[0:18:48] Rd: I think that there's kinds of two pieces to this, or two scaling mechanisms that you have to tackle when you do this. Going from a peer-to-peer model to a server-mediated model, that I would say is relatively straightforward. There's this term used in this world, not super important to know, but just for completeness, it's called an SFU. A Selective Forwarding Unit, just a fancy name for a router, a router of media. The Selective Forwarding Unit is your single server system. You deploy it somewhere in a data center. Then, what it's doing is it's acting like a peer in WebRTC. Just like my client device is a peer, this server acts like a peer as well. It speaks the same language as if it's just another human somewhere else in the world.

A client device sends its media to that server and then that server makes a decision, or it's aware of who else is connected in the session to that user and then starts making routing decisions about, okay, send them this byte, send them only audio, because they only want the audio track and they muted that user's video, etc. Building that server fairly straightforward, there's other people who have implemented SFUs. LiveKit was not the first one. Now it's the most popular one, but it definitely wasn't the first one.

That system works to surmount the first scaling challenge of peer-to-peer WebRTC, where everyone's sending multiple copies of their media to everyone else. The problem with that model though and the scaling wall you hit relatively quickly with this single server mediated WebRTC, we call it single-home architecture, is that there's three problems. There's a reliability problem, there's a latency problem, and there's a scaling problem. The latency problem is that when you send your packets from point A to point B, what you really want to do is you want to minimize the amount of time that your packets spend on the public Internet.

When I send data from San Francisco to Germany in a peer-to-peer world, I am effectively sending my packets over the road system of the world. Networks are like the road system. There's all kinds of bridges and ditches and traffic and rush hour happening over here. Routers and ISPs can hold your packets for certain amounts of time. It depends on what other data people are sending. It's a pretty complicated web that a packet has to traverse to get to where it needs to go. What you want to do is you want to be able to bypass that web. It's not a big deal if someone in San Francisco is sending packets to someone else in San Luis Obispo, that's not a big deal. But sending packets from San Francisco to the UK, it's a much longer haul and you're going to encounter more of this massive road system on your way between those two points.

What you want to do is you want to try to terminate the user's connection as close as possible to them and then use the private Internet backbone to route the packets. What is a private Internet backbone? All the data centers of the world and all the cloud providers, they basically have interconnects and they have these private fiber networks on the backend that are wired up and not as noisy as the public Internet. You want your packets to spend as much time as possible on these super-fast routes. I don't want to say information super highways, but you want them to be on that good network for as long as possible and then -

[0:22:18] SF: The auto bond of the Internet.

[0:22:20] Rd: Exactly. I love that. The auto bond of the Internet, and you want them to spend as little time as possible on the residential streets. You want to have a server as close as possible to San Francisco and a server as close as possible to the UK and then route between those two servers over private backbone. That's something in a single server system that you don't get. Everyone connects to that single server wherever it is located in the world and you're going public Internet to the server and then public Internet out of the server to wherever the destination is. Latency penalty. Reliability penalty, because that server is on commodity hardware in a data center. It's going to crash. How do you resume a session seamlessly without a user interruption, or with minimal user interruption? Unclear how you do that. The failover recoverability, or reliability is another issue with the single server model.

Then the third issue is scale. You can vertically scale that machine, right, add more and more connections to it, more and more users, but ultimately, you will run out of resources of the physical device underneath and you need to be able to horizontally split and scale out horizontally. These are the three weaknesses of the single-server system and you have to be able to scale past that. How do you scale past that? Well, you scale past that by creating a multi-server mesh network. What you want to do is you want to have servers all around the world, be able to spin up as many nodes as you want anywhere in the world. Then any user that is trying to connect regardless of where they are, they connected the closest server to them, so you minimize that time on the public Internet.

Then, this mesh network on the backend, all of these servers and data centers communicate with one another and they form a fabric for routing information between them, and you have software that is doing all of that connecting different points and packets go over that mesh on the backend and then exit out and travel along the shortest path to their destination. That's the system that we built with LiveKit Cloud. When these larger companies were asking us, the open-source piece is all of the SDKs and our single server. Then the part that is not open source with LiveKit Cloud is this orchestration system on the backend that allows you to spin up these LiveKit servers just infinitely as long as you can find machines, and they all get registered as part of this fabric and there's software that does all the routing.

The challenges that you have to solve there, there's a lot of them. One in particular that I think is quite interesting is state synchronization, right? What you don't want to have in a system like this, where it's a mesh and all of these data centers are connected to one another is you don't want to have a single point of failure. If you have a single point of failure, or one single coordinator or a few of them, when those coordinators go down, all of a sudden, a session is severed if the session spans multiple people all around the world. You need to have this almost a truly distributed system where any data center can run independently. Then there's a type of a quorum that is formed when they want to intercommunicate.

How do you manage state synchronization properly across all of these data centers and you're aware and measuring the network in real time for, are there connectivity issues between two data centers and does a connection get severed? We had to deal in the early days with these undersea cables getting severed between Europe and Asia. How do you deal with situations like that? There was another issue, I remember, where in India, in Bangalore, somebody in a data center somewhere went and just disconnected a line in a router. All of a sudden, Bangalore and San Francisco could not talk to each other anymore directly. You had to go through another link. How do you detect those issues and deal with them?

For us, it's one of those things where I tell people in the same way that with SpaceX, Elon in the early days said, "Go explode rockets in the desert as fast as possible, so you can figure out what makes rockets explode, and mitigate." What kinds of situations in networking and connectivity happen between two, or multiple users around the world when they're trying to talk to one another? Figure out what breaks and what situations cause things to break and what you need to mitigate and then mitigate them as fast as possible. In the early days of LiveKit's commercial life, we had outage after outage after outage after outage. It was painful, but we ended up building pretty sophisticated software that now understands the, I would say, 97%, 98% of the scenarios that you can run into with streaming media around the world and we have software that can deal with it.

[0:27:04] SF: Where's this all run? Are you running this on, essentially, off a cloud and then you're deploying your own proprietary software there to help manage the routing of this based on your deep understanding of where network issues could happen?

[0:27:19] Rd: Yeah, it's a great question. The way that we went about it from the very start was, I think a default answer for how - Well, so backing up, we take our software and we deploy it on public clouds, right? Public clouds can be AWS, or GCP, or Azure, or DigitalOcean, Linode and Akamai, they all have servers all around the world. We go and deploy our software there and then we have at the software level, these servers are interacting with one another and speak a language or a protocol between one another. I think a default answer, even for myself before I was working on LiveKit was I'm just going to use AWS, right? It's the most popular cloud. A lot of people use it, and their network is great. But one thing that we realized with AWS, well, we realized a couple of things, but one important one was that their product is really good and they charge accordingly. It's very expensive, too

When you build on them, you get insulated from a variety of networking issues, because they're running their platform quite well. And so, you don't really get a viewpoint, or visibility into some of the things that can go wrong. If you were to say, build on your own cloud, spin up your own data centers all around the world, which eventually we want to get to, but you insulate yourself from that problem when you build on AWS. There's other reasons to not build on AWS, too. I mean, it is expensive and all of that. But, well, two things. One, we didn't want to depend on any one cloud provider, because an entire cloud provider could go down and we wanted to make sure we were resilient to that. The second was we wanted to from the very start, understand what all the different types of issues that might happen and we don't want to insulate ourselves from those.

What we did was we said, okay, we're going to go build on other clouds that maybe don't have as mature of a product and we're also going to build on multiple of them at the same time. What we do is we run our own overlay network that treats them all like one massive cloud, but spans different hardware providers underneath, and then we have software that deals with, okay, I need to send data from here to here, and it goes across clouds. That transparently happens due to this overlay network that we have. Then, we're always measuring the connections between different providers and different data centers in real-time. Software is automatically taking some out of rotation, slotting other ones in. We have this multi-cloud fabric that is impervious to not just a single server going down, but it's impervious to a single server going down, an entire data center going down in a region, or an entire cloud going down.

Without naming names, there was actually a cloud provider where all of their data centers became impacted at the same time, maybe about two years ago. We had never seen that problem before. And so, we built in software that allows us to deal with that scenario and we never run across just a single cloud now.

[0:30:16] SF: Presumably, you're end up having to route these packets across multiple cloud providers hidden behind this fabric that you've created. Have you seen issues that you have to deal with?

[0:30:27] Rd: Every minimal. Now, I guess, I said, there are different quality of interconnect between these cloud providers. They have peering agreements and stuff like that for routing packets between them. Yes, we've seen definitely different performance levels across them as well. But in general, the added delay is minimal. I would say, on the order of a millisecond, or maybe two at most for this transfer piece, so it does not add an appreciable amount of latency to the transport.

[0:31:02] SF: What about where you're dealing with multiple modalities across real-time streaming. Even us talking to each other right now, we have audio, we have video, for example. How do you synchronize across those different modalities?

[0:31:17] Rd: Yeah, it's an interesting question. I'll say a couple of things about synchronization. When you're in a Google Meet session, or you're in a LiveKit session, or a Zoom one, too, synchronization is an interesting problem, because for some use cases, you don't actually want to synchronize. The reason you don't want to synchronize is because, let's say that a user, I'll just use a UX example. You're talking to someone that is located halfway around the world and their network connection, maybe their router, or something like that, their ISP hits some connectivity issue where packets are getting delayed. Video packets and audio packets are different bandwidth, so video carries more information than audio, so it's maybe 10X or more. The amount of information you have to send is larger and the packets are actually bigger.

When you're synchronizing these packets, let's imagine that the network connection isn't good enough such that it can transport high-quality video, or video period, but it can still sustain audio. If you're synchronizing those packets, that means that when you hit a network blip, you actually can't render audio, or video until you receive both. But if you have them separated in independent tracks or streams, then you can actually do things like say, "Hey, you know what? I'm just going to freeze frame the video, or I'm going to shut the video off, but I'm still going to allow audio to be delivered." What that does is it doesn't break the sense of presence that you have with someone else when they're around the world.

By default, we don't actually synchronize the packets, but, well, let's say this, we don't enforce pure synchronization of the packets, but we do things on the backend using sequence numbers and timestamps to try as much as possible to synchronize the packets, because also from a user experience perspective with video, especially paired with audio, you want the mouth to match the voice, or what they're saying. You definitely try your best to do this, but you don't have an invariant that the audio must be synchronized with the video. You let them jitter a little bit independently, or flex a little bit independently.

Now, for AI use cases, it's a little different, because you have an AI model that might be saying something to you, speaking to you and you want the transcription of the what it's saying to line up with the speech of what it's saying. The transcription is sent over a data channel, so it's not audio or video, but it's text. It's sent over a data channel with LiveKit, and then the audio is sent over the audio stream and you want those things to line up. Another use case there is those transcriptions, or let's call it metadata. I'll use an example with an avatar. You might have an avatar that you render on the client, and what you want is you want the manipulations of that avatar, like I need to move the mouth this way, or I need to move it down. You want those XY coordinates to line up with the speech itself, so that the avatar which is rendered on the client, its mouth can actually move synchronized up to the audio that it's saying, so it looks like it's actually saying the words that it's saying. That's another case where you want synchronization between the audio track and the data stream.

Another scenario is where you're doing video avatar generation. There are companies like Simli and Tavus, for example, who they use generative AI to generate a video-based avatar that can speak, and you want to line up the audio and the video such that there's perfect synchronization between the two, because humans are very sensitive to seeing this and you don't want to be in that uncanny valley, where it's like, okay, this is obviously fake. Those are scenarios where you do want to actually do true synchronization between the streams. And so, we have mechanisms built into LiveKit that will actually enforce that and make sure that the packets are held in a buffer on the receiver side, such that once they both come in then you can actually hand them to the application and allow them to play out.

[0:35:19] SF: You mentioned some AI related use cases there. How does OpenAI use LiveKit?

[0:35:25] Rd: OpenAI uses LiveKit the way to - I think, I'll describe the way the architecture works and then it'll probably become apparent as to how OpenAI is using us. There's two fundamental modes within the ChatGPT application. One is where you're in this text chat and you're texting with ChatGPT, and that's using a traditional HTTP request response. I type something, I hit send, makes an HTTP request to OpenAI's servers. The model responds with some text and uses HTTPS SE to stream the response, or those text tokens back out down to the client and render them.

This model of using an HTTP request doesn't work for real-time audio and video, for the reason I mentioned at the start of our conversation, where HTTP is called hypertext transfer protocol, not hyper audio, not hyper video. It's built on top of TCP. But for audio and video and advanced voice and advanced vision, you want to actually use UDP. WebRTC is the layer on top of UDP and LiveKit is WebRTC infrastructure. What OpenAI does is when you tap on that advanced voice mode button in the ChatGPT application, you enter into a different view of the app where there's a LiveKit client SDK on your iOS device, or on your Android device embedded within the ChatGPT app, even in the desktop apps, too, and on web. But when you tap on that button, there's a live client SDK that connects to LiveKit's network. LiveKit Cloud, our cloud servers all around the world, you connect at the closest point on the edge to you, and then at the same, time there's an AI agent on the backend that is getting taken out of a pool and connected to you.

You say, "I want to talk to ChatGPT," there's a ChatGPT agent on the backend using LiveKit's framework on the backend, our agent's framework. That agent gets de-cued, taken out of a pool, connected to the user and it's also connected through LiveKit Cloud. Now when the user speaks, their audio is traveling through our network to that agent on the backend, that agent also connected to our network is taking the audio, processing it in GPT, and then as the audio or text streams out of GPT, it is getting passed from LiveKit's agents framework on the backend through the network again, received on the user's device where it's played out.

Similar thing for the vision feature, so where you can screen share, or ChatGPT can see what you're looking at through your device's camera. Similar type of thing, except instead of audio, it's now video, where that video is traveling over the network arriving at the agent and then the agent is processing it and then the video is getting transported back over our network to the client device when it's generated. Yeah.

[0:38:14] SF: What are your thoughts on this where, I think historically when we look at how people use computers we've built a lot of devices that we've gotten good at using to interface with a computer, like a mouse, a keyboard and stuff like that, but that's not naturally how we interact with people. Or even if you look at the history of search, we've over the last 20 years figured out the mechanics of how to manipulate searches on Google to get what you want and it's not really how you talk to people. I think, some of the things that has changed over the last couple years has been with large language models and now multi-modal models is I can essentially talk to something, or at least type to something in the way that I might talk to you, essentially. And it talks back in a way that you might speak to me.

Now, multi-modal, I might not even need the mouse and the keyboard, I can just simply speak to it. What are your thoughts on where maybe some of this stuff is going and how it might actually change the way that we interact with computers?

[0:39:14] Rd: I think that we're already seeing this interface that's presented to us, right, in the ChatGPT application, there's other voice-based applications that well, that present this interface. I also think in other areas, like telephony, for example, which is a native voice interface to a system, I pick up the phone and I call a customer support line, or something like that and I invariably talk to an IVR system, an automated, a dumber automated telephone system, we're starting to see AI get integrated into these use cases and a glimpse I would say, of where computing interfaces in general are going to go over the long term.

I think that there are going to be some bigger catalysts to, well, so at a high level, as the computer gets smarter, the inputs and outputs to that computer become more natural and human-like. Where the computer is starting to become more human, that means the senses also become more human. What it takes in, what it can output. Naturally, I think over time, the keyboard and the mouse, they won't go away completely, but they will, or it will take a long time before that happens if it happens completely. But I think that voice and vision will become predominantly the interface for how you interact with the computer and give it information and how it gives information back to you.

It's not to say that a screen won't exist anymore. There's still certain types of things like, if I'm having a computer order food for me, I don't want it to read the entire menu to me. That would take way too long. It's much faster for me to just look at the menu and scroll through it and tap on what I want, or maybe tell it what I want with my voice. Screens won't go away as a visual mechanism for presenting information, but we will increasingly leverage cameras and microphones as the peripherals, or sensors for how we interact with the computer.

I think the catalyst for this, so as you mentioned, we've become really good at how to use computers in the old paradigm. We're really good at keyword search and using that to find what we want on Google. But I personally have been using Google less and less, now that I've gotten accustomed to using ChatGPT. There's this talk about copilots and if copilots are going to be the interface. I think that they will definitely be an interface. I think Jarvis is going to be an interface for creators to do things and express themselves. But I think that for us, we've gotten so good at, or so comfortable with the flows that we're used to, I think the real proliferation of voice and computer vision as the dominant interface is going to come from one, a younger generation.

I don't know if you have kids. I don't, but I have friends that have kids, and they all talk about how they just interact with their voice all the time. They're very used to sending voice memos and they're used to interacting with Alexa, or Google Home, etc. It's a normalized behavior for them to talk to the computer. I think in younger generations, they're not going to have the baggage that we bring in of specializing ourselves for how to interact with the computer using a keyboard and a mouse. I think that's one thing.

I think the second thing is once we start to have real agents that can do work, not just the copilot where it's pairing with you over your shoulder, but where you can tell it to go do stuff and it just goes and does stuff, that's coming pretty quickly. Once that's the relationship you have with the computer, it's going to be much more similar to how you work with another human being, where you have a meeting, you talk about what you want to accomplish, and then people go off and start to work on stuff on their own independently and then you come together again to sync up, or to refine. That's the predominant way that people work. They don't pair as much as they work independently. I think that's going to be another catalyst for voice and just naturally speaking and interacting with a computer to become more mainstream.

Then, I think the third one which is the huge one is when the models get smarter and smarter, there's this increasing pressure to embody them within robotics. When we have humanoid robots, that really is going to be like interacting with another human, it's just made out of, I don't know what they're going to make them out of, but carbon fiber. I don't know. I'm not a material scientist. But those humanoids walking around, you're not going to walk up to it and type on its keyboard. You're going to interact with it like you would with another human being in a physical space. I think that's going to be the third major catalyst that really takes the voice and computer vision interface into the mainstream.

Computers in the future, they're not going to look like the computers of the present. They're going to be on any surface, or you'll be able to just call them up onto any surface and they're going to be walking around and interacting with you as well.

[0:44:06] SF: Yeah. I think even that asynchronous work mode for a consumer-facing application is a new type of workflow as well, because other than when you send something that's supposed to be a communication to a person, most of the things that we do with computers from a business perspective is immediately, I get some output from the machine, essentially. I'm not telling it to go do some deep research and then come back to me five days later with the results and stuff. I think that's also, and you alluded to this, will be a real shift in the way that people interact with computers. We're just not used to these asynchronous workflows, where we're telling the machine a task and it comes back with a result at some point.

[0:44:46] Rd: Exactly. Yeah. We're really in the early innings of it, but we're starting to iterate towards that model, especially with 01 and 03 and this test time compute stuff, where we're starting to get accustomed to the model taking some more time, thinking about things, doing maybe some more complex work, and then coming back to us with the result. I think the interface is going to evolve for how you interact with it today. You sit there and watch it think. It says, "Oh, I'm doing this and now I'm doing this."

Another step away from that is going to be like, "Okay. Hey, I'm going to go do some research and tackle this for 10 minutes, or 15 minutes, or 45 minutes and I'll ping you back and let's talk about the results and review what I came up with." You can see the direction that this is going and it's pretty exciting.

[0:45:36] SF: Yeah. I think this is really fascinating stuff. Russ, thanks so much for being here.

[0:45:39] Rd: Oh, yeah. Thanks for having me and I appreciate the conversation.

[0:45:42] SF: Cheers.

[END]