I’ve been thinking about writing a blog post around the notion of “Engineering for Scale” until that happens, here we go.
When we did our internal testing with LXC’s in late 2011, we were really excited because they allowed us to produce super cheap, root-enabled VMs for our users. Due to the low cost, to us this discovery had the potential to solve the world’s computing problem as we could make VMs affordable to everyone (for development, sandboxing and collaboration).
So we dug in and our internal tests were super positive. In fact, in early 2012 I sat down with Solomon Hykes and we were thinking about bringing Dotcloud and Koding closer by integrating the two platforms. At that time Dotcloud was also tinkering with LXCs, a few months later on they came up with Docker which is based on LXCs.
Docker is an amazingly awesome and successful product. LXC was and is a super interesting piece of technology. But when we used LXC to give 100s of thousands of people a place to do development work, things did not pan out. This is not a problem with LXC per se or a problem with Docker, this is a problem of scale. In general, Docker or LXC do not have orchestration software like Xen and so Koding had to come up with one.
Consider a scenario where you have 16GB/8CPU/1TB host machines, you can basically deploy any number of containers in each. Say we deploy a 100 containers in one of them. Assumption is, they are not production servers, and they will not all use CPU/RAM at once. We had used LXCs CPU and RAM limits (cgroups) to make sure 1 container didn’t block the entire host. In theory this was great but in practice – we saw LXC limits didn’t work all that well. One container could take the entire host down for various reasons which then resulted in the other 99 people on that host complaining about the service not being reliable.
Furthermore, we needed to proxy username.kd.io domains (we give free vms to our free users, they assign these domains to them) with our own proxy software written in Go, otherwise we needed to write our proper DNS servers with DNS protocol, we had to give Public IPs to those containers but this was full of edge cases, newest DNS software was years old. In the meanwhile, we wrote our own SSH and FTP proxy software to let our users reach their containers from outside world via hostnames. (Let me not get into details of that, you certainly do not want to write SSH or FTP proxies, deploy and maintain them in your startup.)
On top of that, the 100 people on that host used files, cloned git repos and over time their LXC based VMs grew by gigabytes. When these users left their sessions, we needed to efficiently (and cheaply) store all those gigabytes of files.Therefore we needed a network file system to be able to move containers from one host to another. In total, our user files exceeded Petabyte levels. We evaluated GlusterFS and some others but chose CEPH as our network file system. It is the best out there to solve the problem we had because it allows you to mount a file system directly to your container. CEPH is a great software, however, when we used it to mount 20,000 boot drives at once, it resulted in a system where you couldn’t predict where it will fail. Let me reiterate here, this is also not the problem of CEPH, it’s a problem of scale. Your servers run out of I/O, you try to mount new drives when your network cables can’t carry anymore data, TCP packets are little soldiers now ramming the gates of your network cards. And things fail. This time, it fails for ALL hosts, because your storage is unusable.
All of this was happening on a hosted service so in order to eliminate the “hosted” variable, we then decided to deploy our own servers in a data center and directly own the network cables involved. We finalized on a network architecture that employed over 2TB of RAM and a multiple petabyte storage OpenStack. This worked great but again, when Koding had a spike of 50,000 concurrent people coming and requesting accounts/VM, our deployed network architecture did not scale at the pace we needed it to.
Our architecture had more than 64 very large hosts and at any given moment, 2-3 of them were down for bitcoin mining, fork-bombing and kernel panics because one cannot control what one’s users will (ab)use the system for. As a fairly technical CEO, having written the first openvz based hypervisor to Koding and using Go as our backend even before we get funded in 2010, I can tell you complexity of this system was beyond me at this point. The amount of things that needed to happen for this system to continue to operate was also beyond any single team member of Koding. When things failed, it usually required 3-4 people to come together and tackle. Our sysadmins would utter words that none of us would understand, we started talking about patched kernels, aufs, tcp dumps, better routers, network switches, data centers, connections between them, Openstack problems, heck we bought Fusion-io drives because obviously SSDs were not fast enough (true story). To add to the complexity, we had team members across the globe, just one simple issue almost always required a coordination between people in New Zealand, Istanbul, Berlin and San Francisco.
We went back to drawing board and realized that Koding was becoming an infrastructure company. Yes, we could have solved how to orchestrate those LXCs, how to assign public IPs to each LXC, move them from one container to another, etc. etc. We could probably work more with the core LXC team to close those security gaps – however, all those problems had already been solved by Amazon and they were called, EC2, EBS, S3, Route53, Cloudwatch, Snapshots and others. We were re-inventing all those, just to have LXCs. Whereas our users just asked for reliable VMs that they can work with, and while re-inventing all those, we were breaking our users workflows. The other important part of this decision was based on the fact that a container provided by LXC is not providing a real VM experience. For example, inside a LXC container it’s not easy to run another container (like Docker). Additionally, with the new strategy, our VMs are truly independent of each other, they are only responsible for themselves and there is a real isolation between host system and the VM itself.
We decided to stop re-inventing. It was hurting us and taking us away from what we needed to do which is to give our users the best development environment that they can get their hands on.
In short, until one day a company will provide LXCs instead of Xen VMs, and that company will promise Koding that it can give 10s of thousands at a time, securely and reliably, Koding will use VMs. Right now that company is Amazon. They are amazing. We have spun up 150,000 VMs just in November with 99.5% uptime. This uptime number is real for our users and surreal to Koding. Something we could never achieve or even come close to with LXCs.
Koding’s mission is to enable developers to have reliable work environments, as users don’t really care about the underlying technology (well, I know some of you do!). Plus with the amazing onset of Docker containers, our decision to use Amazon VMs, enabled our users to be able to use Dockers inside the VMs that we provide, something that is not (easily) possible on a LXC based VM.
This is a story of our decision to use Amazon VMs vs LXCs. Hope it answers your question. Comments are welcome.