EPISODE 1839

[INTRODUCTION]

[0:00:00] ANNOUNCER: ByteDance is a global technology company operating a wide range of content platforms around the world, and is best known for creating TikTok. The company operates at a massive scale, which naturally presents challenges in ensuring performance and stability across its data centers. It has over a million servers running containerized applications, and this required the company to find a networking solution that could handle high throughput while maintaining stability. eBPF is a technology for dynamically and safely reprogramming the Linux kernel. ByteDance leveraged eBPF to successfully implement a decentralized networking solution that improved efficiency, scalability, and performance. 

Chen Tang is an engineer at ByteDance, where he worked on redesigning the company's container networking stack using eBPF. In this episode, Chen joins the show with Kevin Ball to talk about eBPF, the problems it solves, and how it was used at ByteDance.

Kevin Ball, or KBall, is the Vice President of Engineering at Mento and an independent coach for engineers and engineering leaders. He co-founded and served as CTO for two companies, founded the San Diego JavaScript Meetup, and organizes the AI in Action discussion group through Latent Space. Check out the show notes to follow KBall on Twitter or LinkedIn, or visit his website, kball.llc.

[INTERVIEW]

[0:01:36] KB: Chen, welcome to the show.

[0:01:38] CT: Hi, Kevin.

[0:01:40] KB: I'm excited to get to talk to you.

[0:01:42] CT: Yeah, me too.

[0:01:43] KB: Let's maybe start with, you can introduce yourself. Just tell me a little bit about your background and how you got involved doing cloud native stuff and networking.

[0:01:52] CT: Okay. My name is Chen, and I'm currently a software engineer in ByteDance. My job is focusing on networking products in our data center to keep every service in our data centers, especially deployed in containers, rounding, and to make sure their connectivity and stability. In this field, we use a lot of different technologies. We are using kernel technology and hardware technologies. But in the kernel part, we use eBPF. It's a customized program you wrote, and you can somehow to put it, run inside the kernel without loading a kernel module. I think for us, the eBPF technology has been developed rapidly in the recent decades, and now it's very popular.

You can use it in not just networking, but basically, you can do everything with eBPF. You can just write your program and you want them to run inside the kernel, and you don't have to be afraid that your code might jeopardize the entire system.

[0:02:55] KB: I think that's worth digging into, especially for older developers like me. I remember, to get anything into the kernel when I started, it was this months, or years-long process, where you would go back and forth on email and all these different things. eBPF, as I understand it, lets you essentially run sandboxed code, similar to how you might run JavaScript in a browser. Is that a fair analogy?

[0:03:21] CT: I'm not quite familiar with the JavaScript, but yes, I think you're right. You run virtual machines inside the kernel, and you have the environment prepared for you, and you just rack your code, your program, and you get them run inside the system.

[0:03:37] KB: That is really cool. Before we dive into some of the specific ways that you're using it, maybe we could talk a little bit more about eBPF. What does the programming environment look like for these virtual machines?

[0:03:50] CT: First, BPF means Berkeley Packet Filter. It is a technology developed by the UC Berkeley. It was first, to filter packet inside the kernel. Then, when people found that this mechanism to do something inside the kernel, it can be used in a different way, so that people just expand this entire technology and record the eBPF. The e means extend. Extended Berkeley Packet Filter. We can do more than just filtering packet, and we can run different code inside the kernel. You can filter the packet, you can monitor the entire systems, and you can trigger your custom functions when somebody loads a file. You can do, basically, a lot of things.

The interesting thing is that when we talk about the kernel, we know it's a big system, and it's fragile. If you do something wrong, and you probably blow up the entire system. But eBPF provides you with a mechanism. They check every line of your code. They make sure your code can run safely inside the kernel. This will help provide developers with powerful tools for them to, they don't need to care about the check to verify everything. They just run their code inside the kernel. If the verifier checks, your code is not safe, that might blow up the system, and they will tell you, you cannot load it inside the kernel. But once you pass the check, and you don't need to be afraid of that. Basically, this is what eBPF is about. Yeah.

[0:05:22] KB: To make sure that I understand, it essentially provides a set of APIs, or hooks into the kernel that are more stable than kernel internals. You can hook into networking stack, you can hook into system calls, things like that. On top of that, it does a static analysis verification upfront to make sure that this is going to be safe to run.

[0:05:47] CT: Yes, exactly.

[0:05:49] KB: That makes a ton of sense. I think this is super cool, because there's a lot of times when you need to get down into the kernel, and I know writing application code, crossing the kernel interface, doing a system call, it's expensive.

[0:06:04] CT: Yes.

[0:06:05] KB: What was the motivator for you to start working in eBPF?

[0:06:09] CT: Yes. Let's go back to the networking part. Because why we need eBPF in networking? Because it's simple, because we come back to the container, cloud native things. We run multiple containers on the server, and each container they own a unique network namespace. Means, the network environment of the container is isolated from the host. Each container, they have its own IP address, they probably have its own network interface card. That's caused a problem. If your application inside the container, and you want to send a packet to the outside world, and there is a war between the container and the physical make is the host. You need something to connect between the container and the host interface. That's we use eBPF.

The mechanism is simple. Let's assume we run a customer eBPF program in our kernel stack. When we receive a packet and the eBPF program will capture the packet when we're analysis the packet, and we see, okay, this packet belongs to this container, and we send them to the container. That's simple. That's why we need eBPF in container networking, just for a very simple target.

[0:07:24] KB: Got it. To make sure that I understand, when you're in this container environment inside the container, it doesn't know it's in a container. It wants to treat its network like a network stack. Outside, you can look at the kernel layer that's running all of these containers. You can say, hey, this packet actually is going to another machine, or a container inside of my same virtual machine. There's no need to go through an expensive interrupt stack to go through an actual physical device. I know in software, I can just route it straight to that other container.

[0:07:55] CT: Exactly. It's a lightweight virtual machine. You can think, container is a lightweight virtual machine.

[0:08:01] KB: Yeah, that's really interesting. What were you doing before and how did you realize you needed to do something like that?

[0:08:11] CT: Let's go back to the network virtualized things. Before, we run eBPF for container networking. What we use, we still need the network virtualization for virtual machines, right? We have virtual switch. I'm not sure you're familiar with the virtual switch. It's like a switch running in a software route switch on your node. When you receive a packet and that virtual switch will capture the package and determine where to send it out. Basically, these virtual machines, they need dedicated cores to run them, because to achieve the maximum performance. It's heavy. Virtual switch is heavy. The virtual machine is heavy. This is for the, well, for most cloud provider, because virtual machine is a technology, probably we have 20 years ago. Container is something new. The feature of container is lightweight. We don't want those heavy things, but we want to achieve the same purpose. We want isolation. We want security. We want resource designation to make sure the container is running in a safe environment.

We need two cores of probably 4 gigabytes of RAMs. We make sure the connectivity between the container and our site work. But we don't want that virtual switch. What we need to do next, and let's go inside the kernel to see what we have what the kernel can provide. The kernel provide us with eBPF. At the very beginning, the eBPF is still not functioning as what we have today. There is still a lot of work to do. There's many develops. We gather together to think how to make eBPF more powerful for container networking. It's been developed for five years. Now, I think it's mature technology for container networking.

Basically, you have a very many popular projects, open-source projects aiming to use eBPF to serve for container networking in each aspect. For example, not just connectivity, but security. You bring a lot of cloud native concept inside eBPF.

[0:10:23] KB: That's interesting. Actually, let's maybe walk through that journey a little bit. When you started working on this and you said, hey, we have these heavy virtual switches. We want to get rid of those. You looked at eBPF, you said it was not ready. What was missing and what were the things that you needed to add, or develop to get this to work for you?

[0:10:45] CT: Yeah. The thing is you use eBPX, but fundamentally, you are leveraging the ability from the kernel. eBPF is, they help you to expose some of the key functions from the kernel for the user. At the very beginning, that functions exposed is not enough. You will have these problems. The performance is not good enough. You can write a very simple program. If you write a program, more complicated. Like I said before, we have a verifier. The verifier will reject the program, because it's too complicated for you to recognize if it's safe or not. You will encounter many problems. The people keep working on this, keep optimizing the entire system. Basically, this is what we are, all the eBPF programs have been doing in these five years.

[0:11:40] KB: Yeah. To make sure I understand, they were, one, extending the surface area of what you could do, what the hooks were, things around that. Two, improving the static analysis verifier to be able to handle more complex cases and still prove them to be safe.

[0:11:56] CT: Exactly.

[0:11:57] KB: Got it. Okay. You mentioned pulling a lot of different cloud native tooling into eBPF. What are some of the different ways in which you are using this in your stack today?

[0:12:12] CT: Yeah. Basically, what we are doing now, focusing on two parts. The first part is networking. We have packet direction and we have to enforce different networking security policies. It works as a data plan. We receive policies from the remote controller and the eBPF help you to implement those policies. The second different part is the kernel tracing. Like we said before, the eBPF, it provides a lot of hook points inside the kernel and we call them trace point.

Each trace point, you can - when the trace point is triggered and you can run your custom eBPF program to help you to understand what is going on inside the kernel. For example, when the functions is called and you assume some bugs is happened and you want to know what is going on inside these functions, and you can bury your eBPF program inside this hook point and when the event is triggered, and eBPF will help you to print everything you want, especially the context at this moment and to deeply understand what is going on inside the kernel. Because most of the time, it's a black box. But with the help of eBPF program, you can really understand the kernel.

[0:13:23] KB: It's almost like, you can insert your observability stack down into the kernel and say, oh, we think there's something going on with this function. Print me the context on an entry. Print me the context on exit. How much overhead is there in this tracing?

[0:13:38] CT: The kernel tracing, actually, the cost is very high. We don't use them often. We use them only when there's a box is reported. We need to analyze the kernel. But for packets filtering, because there's different trace point inside the kernel. Some trace point, the cost is low and some is high, especially when you want to bury your customized trace point inside the kernel, the cost is very high. But for most, especially in the networking part of the trace point, we cannot - that is not a trace point. The hook point, the trigger of those hook points, the cost is low. We use it for networking. We direct the tons of millions of packets and the cost can be limited in the, like let's see, the cost is acceptable.

[0:14:25] KB: Yeah. Now, what does it take to roll out, say, you want to turn on tracing? Can you do that live on a running container? Do you need to restart the server? What does that life cycle look like?

[0:14:38] CT: Yes. This is the attracting part. You can run the eBPF, the kernel tracing in a live container. You don't need to restart anything. You just inject your code inside a running system and you gather everything you need. Once it's done and you just cancel it, you can remove the program and the system will come back to normal.

[0:15:02] KB: That's wild to me, that you can inject observability code down into your kernel on a live system with knowledge of safety, run it as long as you need it and just pull it out.

[0:15:13] CT: Yes, exactly. That's why eBPF becomes so popular today.

[0:15:17] KB: That's cool. Diving in maybe a little bit, you mentioned also, there's a set of open-source projects that have been building and bringing these cloud native technologies into eBPF. Are you working on any of those? Are those areas that you're connected to?

[0:15:32] CT: Yes. For now, I've been working on some of the open-source projects, but I'm mainly a user, because I work for a company. My purpose is, first, to serve what the company wants me to do. I will use a lot of open-source tools. When I find there's things I can modify, and yes, I will do is my way to - how to say? This is what community do, right?

[0:15:58] KB: Yes, absolutely. Which projects are you using?

[0:16:02] CT: Since I first worked for the company, so my top priority is help the company to get things done, to get the job done. During this process, I use a lot of open-source projects and I get help from the communities. Especially, there's an eBPF library called eBPF Go. Now, it is most popular eBPF library for goal and developer. I use this library help me to load eBPF program inside the kernel. I think this is the tools I use the most, the open-source project.

[0:16:37] KB: That brings up an interesting area, which is what does the development environment for eBPF look like? Because you're compiling down, as I understand it, to essentially, a bytecode that is what gets analyzed and loaded. What is the environment that eBPF Go, for example, exposes to you?

[0:16:56] CT: Okay, so for most developers like me, we wrote the C program. We want the C program to run inside the kernel. Once the C program is done, first, we need to compile them with bytecode and we load them into the kernel. Before we load them into the kernel, we need to verify to make sure all the code can be run safe inside the kernel. To run a sys call to the kernel to load the program to the bytecode inside the kernel. After that, once the program is inside the kernel, it's still not functioning. You just somehow, this kernel helps you to store the program. If you want them to run, you have to attach the program to a hook point. This is what those libraries help you. You develop your program in a C code, but the library helps you to do the load, to do the verifying, to do the attachment. Everything related. Yeah.

[0:17:52] KB: Got it. The core eBPF program is developed in C. But all of the different attached, detached, load, all of this stuff is what you're using eBPF Go to do.

[0:18:04] CT: Exactly.

[0:18:05] KB: Awesome. I'd love to go back a little bit to the use you used in the network. What was the impact of moving from this big virtual switch approach to the eBPF networking stack?

[0:18:18] CT: Okay. That's a very interesting story, because it is related to the company's development. Let's take a look at the companies like Meta or Amazon. Those companies, when they run their data centers, is probably 20 years ago. By the time the technologies they had for virtualization is virtual machines and virtual switches. This is what they used at the very beginning. Now, we have to know one thing. If you migrate from one technology to the others, especially you have to upgrade the entire data center, the cost is huge. Probably, they will keep using virtual machines. They keep using virtual switches for their data centers.

I work for ByteDance. I don't know if you know ByteDance, but I think you know TikTok is one of the apps developed by our companies. The ByteDance is founded in 2012. It's only 10 years ago. When ByteDance is kept growing and we need to build our data centers, cloud native technology emerged. At this time, we have a better choice to manage our data center. That's why we use cloud native technologies. We use Kubernetes to run. Everything is run as a container in our data centers.  I think this is luck. I don't know, because at this time, we have a better options and we still have the chance to make a choice, and we choose Kubernetes and container.

[0:19:50] KB: That's really an interesting point of how coming later, you can often jump over the learnings that the earlier companies had to do. Maybe we can expand a little bit now into some of these other cloud native areas. What other places do you feel like going cloud native first has made a difference for ByteDance relative to, say, Amazon, AWS, or something like it?

[0:20:16] CT: Okay. Now, most companies, when they embrace cloud native, let's make a statement. For companies like Meta, or Amazon, they run their data centers in a different way. When cloud native technologies emerged, of course, everyone wants to use this, but we use them in a different purpose. For Amazon or Meta, they use cloud native technology as a service, especially for Amazon. They are a cloud provider. They use cloud native technologies mainly as a service for their customers. They don't use themselves. They provide them. For us, we use ourselves.

The difference is since we have the Kubernetes running inside our clusters, and I think the biggest difference is the problem of scalability. For Amazon, they use cloud native as a service for their customers. Their customers, they mainly have a small scale of cluster, because they cannot afford to build their own data centers. That's why they want to buy machines from Amazon. The scale of their clusters, let's assume 1,000 machines. I think that's a lot. For us, we have over a million. The scalability is the major problem. We found that the cloud native technology seems, it's very powerful, but it has a fatal problem is the scalability.

If you run a Kubernetes on a cluster with 1,000 machines, that's enough. You can have a powerful cluster management tool with basically everything you want. But if we have 100,000 machines, and you have the Kubernetes becomes a bottleneck of performance. By the time, if you still want to use Kubernetes, you have to do a lot of modifications. Some of the concept from cloud native ecosystem is powerful, but it's just not so efficient. We have to optimize them. I think this is a major difference from how by times, these cloud native technologies from the others.

[0:22:26] KB: Yeah. There's a scale factor there that is really, blows my mind. You mentioned finding progressive bottlenecks in Kubernetes and cloud native abstractions, as you scaled up to 100,000 and beyond. Where were those and what have you replaced, or improved?

[0:22:48] CT: Yes. For example, there is a very important concept in container networking, we call it service. Because once you have a client container, you want to talk to your server container. You cannot just ask for the IP address of the destination and you send the message to the testing IP address. In Kubernetes, they build a concept called service. You ask for the service and the service will return to you a target and they say, you can connect to this destiny.

The service, the total cost for you to run a service is huge. Because a service, you have to store all those backend containers. You have to choose from one of the real server. If you have 100 servers, that's okay. But if you have 100,000 and the cost to run this service concept is huge. It's not acceptable. We have to remove them. We use a different framework is like a service discovery. We develop our own service discovery framework to help the client container to discover their target. This one difference.

Also, because for us, because we run our own data centers and the cost, things is in recent years, all those IT companies, they're not running. Where you see there's layoffs everywhere. We can see the cost becomes a big problem for all those companies. They want to save money. They want to buy less service, because the server is the biggest cost for our company. Let's see. If you have 1% of improvement for the total cost of the machine. We have over a million servers. The 1% will be a lot. This is what drives us to seek for new solutions to optimize the entire system. We know Kubernetes is powerful. The cloud native concept is very useful, but sometimes it's just not what we want the most at this very moment.

[0:24:51] KB: Yeah. Let me make sure I understand the service example. In Kubernetes, you have the service abstraction, and it assumes a global view. Is going to index all of the different servers, or containers that might respond for this service type. When a client asks, it looks up in its index, who's free, sends them the target. That assumes a bounded set that is small enough to perform quickly in that type of index. When you scaled up large enough, you needed to say, oh, we actually need a much more contained version of service discovery that's not going to be trying to index 100,000 machines.

[0:25:36] CT: Exactly.

[0:25:36] KB: That makes a lot of sense, I think, and points towards a potentially even just class of problems, where you have Kubernetes uses providing this nice abstraction, where it's going to do all the management for you of all these different things, package it up in a single location. That's going to run into scalability concerns as you go far enough. Maybe you bump out of a cache, or a memory, or something like that. Were there other areas that you found needing to almost scope down the abstraction exposed to be able to deal with that level of scale?

[0:26:15] CT: Actually, I'm not expert in this field, because we have a different team that manage the resource orchestration. There are the teams to build Kubernetes systems. I'm focusing on the networking parts. Well, I know about that the service things reaches my knowledge boundary.

[0:26:35] KB: Okay. Looking within networking, then, again, we talked a bit about using eBPF to make a much more efficient networking stack within the physical machine. You connect to other areas of the networking stack that being cloud native from the beginning has really made a difference.

[0:26:57] CT: Yes. We have IP tables that is old technologies being first used when cloud native. Actually, at the very beginning, when container networking becomes a problem to be solved, people use IP tables at first. IP tables, it's slow. Then we use eBPF to replace IP tables. Now, eBPF becomes the majority, and IP tables just emerged from at very beginning. Then it get replaced, because of its bad performance.

[0:27:30] KB: I understood the eBPF approach to network device replacing the virtual switch. Instead, you just drop into this kernel hook that looks in different places. What does it look like for IP tables?

[0:27:43] CT: IP tables is also a function inside the kernel stack. But IP tables works in a chain format. For example, you can write a chain of rules, you have the match and the actions, the packet match and hit these rules and will be proceed with the action. It is a chain. You can parse the packet to another chain. The entire IP tables system is just a bunch of chain, is a chain system. We can say, it is not efficient. Especially, once the rules expand and the cost will not be acceptable. eBPF is just a single program. You write your own program and one program will be enough. That's why people replace IP tables and choose to use eBPF.

When it comes to the virtual switch things, I remember you also asked how eBPF replaced the virtual switch. No, eBPF doesn't replace virtual switch. They just work in different scenarios for different purpose.

[0:28:46] KB: Got it. Let's maybe go in on there, because I misunderstood. Which scenarios are you using the eBPF networking direction, versus the virtual switch?

[0:28:57] CT: Why we need a virtual switch and eBPF, because they serve for a different purpose. eBPF focusing on container networking. Why eBPF can be used in container networking, because it's a lightweight, isolated environment. You can still only one system, one kernel. You can even contain the wrong isolated environment. Actually, the kernel is the same as the host one. The eBPF just play the tricks on the host kernel, help you to redirect the packet from a physical NIC to the container. In the virtual machine, virtual machine is a full isolated environment. They use a different kernel than the host one. It's impossible for the host kernel to talk to the guest machine. You need a different mechanism to redirect the packet to the guest machine. That's why we use virtual switch. Basically, there's two technologies that serve for different purpose.

[0:29:59] KB: To make sure that I understand it, the way that I'm seeing it right now, it's almost like there's layering. If you have containers within a single machine, or virtual machine, you can route between those containers purely with eBPF. As soon as you start to go outside of that machine, you need to go over an actual physical NIC, or a virtual switch if you're going to a virtual machine.

[0:30:24] CT: Yes.

[0:30:26] KB: Okay, that makes sense. I'm curious then, from a performance gains standpoint, how much of the traffic that you're directing stays within that single machine and is able to leverage eBPF, and how much still ends up going past? What types of performance gains do you end up with on an aggregate basis?

[0:30:49] CT: You cannot make a full comparison between eBPF and the virtual switch, because difference is the performance varies between how you use them. The main advantage of eBPF is easy management. If you want to run a virtual machine, and you have to do a lot of preparation for that and you have to reserve a lot of resources, simply for the virtual machine. It cannot allocate those resources to guest host. For EBPF, you don't have to consider that, because all eBPF programs run inside the same one kernel.

All these things make sure it's much easier to manage a container than the virtual machine. I think this is the biggest advantage of using eBPF. But when it comes to the performance comparing between eBPF and virtual machine, it's hard to see which one is better. I think in a scenario, let's see if the machine is full of it. I think the virtual machine can have a better performance than eBPF. That is the fact.

[0:31:58] KB: Interesting. What are the scenarios in which you still want to have virtual machines? It feels like, containers is the cloud native way to do it.

[0:32:07] CT: Okay. When we want to use a virtual machine, let's go back to AWS as a cloud provider, you run the virtual machine not for your own services, you run the virtual machine for your guest. The people buy your virtual machines, they run their own applications. What they want is full isolation and security. That is top priority. They don't want their information to be accessed by - even they run their service on AWS, but they don't want their data to be accessed by the cloud providers.

AWS still have to build a full isolated environment for their customer. That's why they still choose using virtual machine. In ByteDance, the entire data center, we run our own service on the data center. We don't have this requirement for security and isolation. A lightweight method will be enough. What we want to achieve is the easy management.

[0:33:08] KB: That makes sense. Okay. What do you see as the next frontiers in this space? What are you working on for eBPF, or within the networking stack that you think is taking this to the next level?

[0:33:26] CT: Yes. I think what we are considering at this moment is still the cost. We see the eBPF brings a lot of advantages, easy management, but still the cost of the kernel stack is still inevitable. Because if you look into the kernel stack, we have multiple interrupts and the memory copies. When we receive a packet from the NIC, we have the first copied packet from the NIC to the kernel stack. When we go through the kernel stack, we have to copy the packet from the kernel to the user application. This cost is inevitable.

When we want to optimize the entire system, there is no way for us to ignore this cost. This is the bottleneck for eBPF technology. Even though it's popular, it's because it's easier to be used. If we want to save more resources, we have to optimize eBPF. At this moment for us, that's why we have several solutions. First is Netkit. We mentioned before, it helps us to reduce one interrupt when packet is transmitted between the container and the host. Netkit using, let's see, a special mechanism helps us to reduce one interrupt. But that is not enough. We are asking help from the hardware.

Now, what we are doing next is to combine eBPF with hardware uploading. Because we know the difference between eBPF that is powerful. We can write our own custom program in the kernel, but the cost is higher. Once we leverage the ability from, especially now, the smart NIC, we have a hardware interface from different vendors, they help us to offloading packet processing ability from the kernel to the hardware. The problem is it's difficult to use. You cannot write your own program inside the hardware. You can just inject the rules or policies, like what I mentioned, like IP tables. Then, somehow, we need to translate the eBPF program into hardware rules and to load those rules inside the hardware, to using smart NIC help us to process the packet. This is what we are doing at these moments to, let's see, to achieve the best performance.

[0:35:55] KB: Curious there. Is there a standard for how those rules are defined for the hardware offloading? Could you create a compiler, essentially, that takes your eBPF rules into them?

[0:36:06] CT: It's predefined. Yes. Let's see. If we write an eBPF program, everything can be defined by the code, so we can write whatever we want. Let's see a typical hardware offloading rules. It's just like IP table rules. You have a match. You match the header of the packet, and you have an action. Match, action, match, action. You have multiple rules with different priority and each rules, they have a match field and action. Then you need to write a different program to translate the eBPF program to those rules that they behave the same. This is the difficult part.

What we are doing now is we combine these two technologies, because the rules cannot be predefined. If you predefined everything you want into rules, first, we don't think it's possible. It's just too difficult. What we are doing now is we have a separation of the slow path and fast path. The fast path is hardware offloading. Once the packet meets the rules and the packet will go back to the kernel, again, an eBPF program will process the packet. When the eBPF program decides, okay, this packet will allow them to be proceed. We will inject a rule. The match of the rules is the header of this packet. The destination, the source IP address and the destination port and the source port. The action is that we see redirect to container A. We inject this rule to the hardware and the hardware will recognize the following packets. This is what we are doing.

[0:37:58] KB: That's interesting. To make sure that I understand, you're in some ways treating the hardware as a almost caching layer of rules, where a packet comes in - Starting from a blank slate, a packet comes in, you don't have a rule for it. It goes to the networking stack. Your eBPF code picks it up, analyzes it, says, okay, here's where this needs to go. Furthermore, here's the rule that the hardware can use to do that fast next time. It loads that up into the hardware, which then for subsequent packets with similar patterns, or following that rule knows what to do.

[0:38:38] CT: Exactly.

[0:38:41] KB: What's the lifespan of those rules? Are they durable, or is there a timeout? Or, how does that work?

[0:38:49] CT: Every time a packet match the rules, then we have a counter. We see, when this rules is being idle for about 30 seconds and we recycle them. Because there's no way for the rules to delete themselves. We have to delete the rules. Since, if let's assume the rules is in the hardware and all the packets, the kernel cannot capture the packet anymore, because the kernel has been bypassed. How could we know, okay, the session is over, we need to delete them? There's no way for us to know that. What we're doing is that we're running another program on the host, they periodically to fetch all the rules from the hardware and to analyze them. If this rule is being idle for 30 seconds or one minute, we have a timeout setting, and we just recycle the rule to make sure there's no rules leak inside the hardware.

[0:39:45] KB: That makes sense. That's cool. You're able to get all the data you need from the hardware itself, or just there need to be some communication between the eBPF that sets them and the program that's clearing them?

[0:39:58] CT:  Well, for now we use the user. We run another agent to fetch all those data from the hardware. Since there is no way for eBPF to communicate to the hardware, since this part is still missing. Maybe somehow in the future, we can find a way for the eBPF program to talk to your hardware directly, but now there's no way for us to do that.

[0:40:19] KB: How does it set the rules then if it can't talk directly?

[0:40:23] CT: We first let the eBPF program to talk to our agent. There is a channel for the eBPF program to communicate with the host application. The agent will analyze the message from eBPF and the agent will translate the message to our hardware rules. The agent will help us to inject the rules to the hardware. This is what we're currently doing. We see, there is a two-way communication from the host to the user space, and from user space to the hardware. The user space application will periodically fetch data from the hardware. Yet, it does look a bit ugly, but there's no better ways for us. I think, let's see in the future. I hope we can find a way for the eBPF program to talk to your hardware.

[0:41:11] KB: Honestly, that makes sense though, because the agent needs to be somewhat durable. It needs to run every 30 seconds, check things, keep track of it. Whereas, eBPF, as I understand it, is event driven. It's always happening just a thing comes in, we do it. To make sure that I understand the whole thing then, you have this durable agent that is responsible for keeping track of what rules are currently on the hardware and translating when there's an eBPF rule that triggers, move it into a hardware rule. Network packet comes in, misses the hardware, cached logic, goes into eBPF. eBPF applies its logic, puts a rule on, and sends a message to your agent. Your agent then says, "Ah, here's a new rule. Let me push that up into hardware."

[0:42:03] CT: Yes. It's a bit complicated.

[0:42:06] KB: It is, but I think it's quite clever. That's cool. What else are you doing in this space?

[0:42:13] CT: Besides this, since we have rules offloading in the hardware, and now we can leverage the RDMA technique. I'm not sure if you're familiar with RDMA.

[0:42:23] KB: You essentially map user space memory to the NIC, and it can send it directly?

[0:42:28] CT: Yes, exactly. This will, because RDMA is a technology developed by our NIC manufacturers called Mellanox, now it's a part of NVIDIA. Since we're using eBPF, help us to leverage the ability of hardware offloading. Now on top of this, we can use RDMA.

[0:42:48] KB: Uh-huh. Let's talk real quickly about what that looks like. You send, eBPF figures out where it needs to send the network packet. Does it know the memory that is mapped for the NIC to be able to go? Or does that go through your agent? Or what is the flow?

[0:43:06] CT: No, the eBPF doesn't know anything about this. eBPF only knows about the packet redirect. We need to pass this packet, or drop it. We translate the eBPF policy into a hardware rule, and we push them into the NIC, and the Nick, and the subsequent packet can go directly through the hardware and bypass the kernel. At this moment, we use our RDMA, because RDMA provides a different mechanism of packet transmitting help you to, especially is mainly bypassing the kernel. Since the hardware knows where to route the packet, and you can use RDMA to talk directly to the destination. You don't need to involve kernel anymore.

[0:43:57] KB: It's been many years since I did anything with RDMA. If I understand, you need to give it not just a network address, but you need to actually give it the location and memory. Is that correct, or?

[0:44:08] CT: Yes. This is what RDMA is doing. Let's see, if we have two physical machines running NIC support RDMA, you can use RDMA directly, because the NIC is working, they know exactly where the destination is located. In container networking, the NIC doesn't know at the very beginning. This is the trick. You need eBPF to help the NIC to identify the destination. Once you provide the connectivity for the NIC, you can leverage RDMA.

[0:44:39] KB: Got it, got it, got it.

[0:44:40] CT: Yeah. Because previously, RDMA is being used in host machines, all applications that deployed on the host. They never been deployed on a container. Because once you have your application on a container, there's no way for the NIC to locate the destination. They don't know where the destination container is. Since with the help of eBPF, yes, it works.

[0:45:04] KB: That's clever. When you do that fallback, eBPF locates it, sends this information also to your agent, which can then load all of that back up into the hardware and you can bypass the container boundary and do RDMA straight to a container.

[0:45:19] CT: Yes.

[0:45:20] KB: That is super cool.

[0:45:22] CT: Well, I think we've covered a lot. I actually really like the example with the hardware, because it connects not just this kernel piece, but shows how this can be a bridge between container technology and essentially, the old world. Anything that was living outside of the container world didn't understand it. You can sent intercept with eBPF, pass that data off to an agent that understands the both sides and make this bridging connection.

[0:45:56] CT: Yes. You also mentioned about one thing I want to share, that is the relations between the old world and new world. Because we see the Linux kernel is very big and it's been developed for decades. For us, for most people, it's old. Nowadays, we have more technologies emerged from different, from open-source communities, or from hardware manufacturers. The people will say that the kernel is big and there's no way for us to change the kernel to buy our wheel, but we can find a way to bypass them. We see, what actually is on a crossroad, what we will choose next. If we embrace the technology that bypassing kernel, or that we embrace the technology, we still stick with the kernel.

Actually, we are facing a choice. I think it's a very interesting topic. Actually, I don't have the answer yet, because we are still wondering if they can coexist in the future, or they have to be enemies. I think these are very interesting questions. Yeah, I'm just bringing it up. Actually, I don't have answers for that.

[0:47:08] KB: It is a really interesting question, I feel like. It's been around for years of how much can you do in user space? How much can you do without having to jump down into the kernel and face the performance costs that go, going back and forth over a system call? It does seem like eBPF is a nice middle ground, where you can write your own custom code, dynamically load it without having to restart, or do anything with your kernel. Yet, it runs inside the kernel and has access to all the privileged information.

[0:47:43] CT: Yes.

[END]