The Rise of Platform Engineering

Article Thursday, February 13 2020

The rise of microservices, container orchestration, and the like have introduced novel engineering challenges. Platform engineering teams have formed at a number of organizations to shoulder these responsibilities. In some respects, the role of a platform engineer hasn’t drastically changed from that of other DevOps related roles. There is truth in noting that the title, “Platform Engineer” is nothing but a new title. However, a number of factors are, and continue to, cause the traditional responsibilities of a Site Reliability Engineer (SRE) to shift.

These factors include the increased popularity and extensibility of cloud providers, Kubernetes, and infrastructure as code. Paradigms introduced by these factors unlock many superpowers for an organization, such as service discovery and the ability to horizontally scale with ease, which could potentially lead to more money in the bank.

Mature companies with legacy infrastructure are mobilizing in preparation for the great migration to the cloud, and cloud providers are ready to accept them with open arms. But, with this migration comes a need for expertise in the cloud and container orchestration. So, organizations are beginning to question whether they should form a platform engineering team. Companies born recently before, or during, the cloud doesn’t have as many of these concerns; there are fewer, if any, legacy systems to wrestle. It’s very common for companies to begin, and remain, on cloud providers without ever managing on-prem systems.

As mentioned, the role, “Platform Engineer” is considered by some to just be a different title for a job that has been traditionally performed by an infrastructure team. To understand why this is not exactly true, let’s take a closer look at what platform engineering looks like today.

What is platform engineering?

This is a loaded question. Asking ten engineers this same question would likely yield ten different answers. That said, there would probably be a number of similar themes. The most prominent theme would likely be similar to the idea of bridging the gap between software and hardware. In other words, platform engineers enable application developers to put software into the hands of users in an easier manner. This broad stroke manifests itself in a number of different ways. Some of these ways could be standardizing an organization’s Kubernetes deployments, ensuring infrastructure is auditable, automating various deployment processes, and writing documentation for application developers.

The responsibilities of a platform engineering team should not be confused with those of a DevOps team. They’re similar in some respects, though they vary in others. Examining where platform engineering and DevOps diverge can help to explain the growing popularity of this new team. For one, the concept of DevOps predates that of platform engineering and has matured in sync with technological progress. Originally, DevOps was fairly ad hoc. For example, if a team within an organization wanted to host a new website, coordination between this team and a DevOps team was necessary. Contrast this with the notion of platform engineering. Platform engineers build systems that allow teams to build on. To continue the example, if the same team had a platform that would take care of hosting the website, no coordination would be necessary between this team and the platform engineering team.

Another significant difference is the role of an API boundary, as well as how explicit this boundary is, within the context of each role’s responsibilities. This ties in with the suggestion that DevOps tends to be more ad hoc than platform engineering. DevOps and platform engineering teams are concerned with deployments, service accounts, and infrastructure. However, DevOps teams aren’t building platforms that offer explicit APIs and abstractions that offer flexibility for application developers; platform teams are building these types of platforms.

To further describe the role of platform engineering plays in an organization, let’s consider an example. Suppose an insurance company that was founded in the 1980s has started to shift their infrastructure to the cloud. Now, suppose that within this organization, the software engineers are split into two categories: application development and infrastructure. Before the cloud era, it was common for the infrastructure engineers to resemble a backend team that offered APIs.

These responsibilities are most often fulfilled through the use of infrastructure as code (IaC). Some common infrastructure as code tools are Terraform, Vagrant, Chef, Puppet, and AWS CloudFormation. A number of these tools are open source. Generally, the platform built by platform engineers is composed of these open-source tools. An organization’s platform engineers tailor infrastructure as code tools to the needs of the organization’s application developers. Below is a figure that illustrates how infrastructure as code and platform engineers fit into a development team, as well as how these tools, ultimately, lead to more features.

Infrastructure as code

Infrastructure as code is one of a number of factors that have helped boost the notion of a platform engineer within the collective conscience. But, it underpins many of these additional factors, so it deserves a closer examination. Before the era of infrastructure as code, a human had to manually configure infrastructure.

In retrospect, manually configuring infrastructure is problematic; the element of human error is always a risk. Humans are more error-prone and expensive than computers. Oh, and humans are literally billions of times slower. Infrastructure as code removes the risk of human error, reduces cost, and improves the speed at which teams within an organization can iterate. The fewer humans involved in a systematic process, the better.

One notable benefit of infrastructure as code is its ability to be checked into version control. This is especially beneficial for enterprises that may be ramping up cloud infrastructure quickly. Version control platforms, like GitHub, provide context for a system’s infrastructure by facilitating and keeping records of changes and, arguably equally as important, the reviewal process. GitHub’s pull request reviewal process is a great example of this; it’s a place for discussion to take place. Whether this type of review is ideal for all kinds of pull requests is arguable. But, it is a huge upside for infrastructure as code.

As with many technologies, there are different approaches to infrastructure as code. The most common are declarative and imperative models. Declarative frameworks, such as that offered by Kubernetes, require users to define a desired state. Users don’t specify how this state should be achieved. In a declarative model, the system develops a plan to reach and maintain the specified state. Imperative frameworks require users to specify commands in a particular order, in order to reach a desired state.

At first, the imperative model may be more intuitive, as a number of popular programming languages are considered procedural, like Go. However, it is not the popular approach to infrastructure as code. For one, the imperative model does not scale. Complexity scales exponentially, at best, in relation to the number of components in a system; users have to execute the correct commands in the correct order for more machines. Contrast this with the declarative model: users describe the desired end state. Then, it is the responsibility of the framework to develop and execute a plan to reach this desired state. Complexity scales logarithmically, or at least far better than linearly, in relation to the number of components in a system; the framework takes care of all the heavy lifting.

In some cases, the flexibility an imperative model offers is preferable to a declarative model that abstracts it away. Thankfully, there are tools that offer an imperative approach to infrastructure as code that minimizes complexity, such as Terraform, Vagrant, and CloudFormation. If you’re interested in learning more about these technologies, check out this episode of Software Engineering Daily. This episode is a conversation with Mitchell Hasimoto, the founder of Hashicorp, about application development and why the importance of automation scales with the complexity of infrastructure.

When does an organization need a platform engineering team?

There are tradeoffs organizations often consider when considering building a platform engineering team. On one hand, building a platform engineering team detracts resources from building business logic and developing features. However, a platform engineering team may build tooling and infrastructure that increases engineering productivity. Without a platform team in place, it’s likely that some engineers have taken it upon themselves to assume a platform engineer-like role. Organizationally, this can become a challenge and place a burden on all engineers in an organization. Without a definitive set of engineers responsible for an organization’s platform, rogue engineers acting in a platform engineering capacity will probably not be effective. Put simply, considering building a platform engineering team can be thought of as weighing short-term gains against long-term gains.

Building a platform engineering team is easier said than done. The following may come as a surprise to readers living in the Bay Area: legacy infrastructure is common within enterprise organizations and can result in a lot of confusion about platform engineering. Platform engineering at a startup founded in the cloud era looks very different from platform engineering at a pre-cloud era enterprise. Unlike startups born in the cloud era, many pre-cloud era enterprises have on-prem systems that yet to be migrated to the cloud. Additionally, as if migrating to the cloud wasn’t challenging enough, pre-cloud era enterprises tend to have more red tape and bureaucracy standing in the way of organizational changes. The short-term downsides to creating a platform engineering team can be magnified by these types of organizational barriers.

Organizations should consider the short-term losses that creating a platform engineering team may cause. A strong indication that an organization that a platform engineering team would be beneficial is the observation of different product teams building similar features or trying to accomplish similar tasks. Product teams could experience an increase in productivity, if a platform team is formed. Platform engineering is interesting because it can cause an entire organization’s efficiency to increase. This should be taken into consideration before writing off the need for a platform engineering team.

If an enterprise organization does form a platform team, an effort to continue, or begin, migrating to the cloud is almost inevitable. Migrating to the cloud forces organizations to choose which cloud vendor, or vendors, to use. Let’s examine how the choice between using a single, as opposed to multiple, cloud vendors may affect an organization.

To multi-cloud, or not to multi-cloud, that is the question

The first step in developing a cloud strategy is deciding if a multi-cloud strategy is needed. The best multi-cloud strategy may be to not use services from different vendors and embrace all that one has to offer. Sure, using a single cloud provider has drawbacks, but it can prove to be vastly simpler than any multi-cloud approach. This isn’t to say the benefits of going multi-cloud don’t outweigh those of the simplicity of using a single cloud vendor. Examining both the benefits and drawbacks of multi-cloud architecture can shed light on how going multi-cloud may affect a given organization.

The benefits of going multi-cloud include being able to use the best services available for a given task and limit the risk of outages in any given geographical region. The benefit most frequently touted by multi-cloud advocates is the freedom it provides: you’re not locked into any single vendor’s ecosystem. Vendor lock-in is the idea that the vendor is one of the organization’s dependencies and wouldn’t be able to substitute alternative solutions. The fear is that the work associated with substituting a current dependency for another would be far too costly.

There are different ways one could interpret the notion of cloud-vendor lock-in. On one hand, there’s the risk that an organization using a single vendor deems a critical service as sub-par and wants another option. Or, suppose this critical service begins to cost more than anticipated. In this scenario, it would be difficult to decide what the next step should be. However, it could be argued that this service isn’t actually critical for the organization. Popular cloud vendors offer a wide array of products; it’s possible that the alleged critical service could simply be replaced. Additionally, these cloud vendors have guides on migrating to and from their platforms.

Yes, there are some benefits to going multi-cloud. There are also some downsides. One downside is the fact that there’s more organizational complexity; this is difficult to deny. For example, each platform’s security accounts must be managed. Aside from organizational complexity, it can be difficult to find engineers who are knowledgeable about multiple cloud platforms. More ramp-up time might need to be allocated for new hires if a multi-cloud system is in place. Another downside to a multi-cloud environment is the increased surface area for potential threats. Organizations may weigh this downside differently, depending on the kinds of data an organization deals with. But, generally, it’s safe to assume your organization values security.

To conclude this examination of the multi-cloud option, let’s take a closer look at using only a single cloud. Sure, there are risks of lock-in and not being able to use the best services for a given task. However, in many cases, using one cloud-vendor can greatly simplify a system. Going all-in on a cloud vendor will inevitably have less organizational and engineering overhead. A cloud’s unique products and capabilities is a primary reason to choose a vendor, rather than opt for the least common denominator of multiple cloud vendors. Finally, there’s no point in worrying whether a particular cloud will fail to meet your organization’s needs, at some point in the future. There’s fierce competition amongst the top cloud vendors. Competitive pressures minimize the risk of any single cloud vendor not delivering top-notch products.

Looking ahead

In this episode of Software Engineering Daily, Abby Fuller, a principal technologist at Amazon, noted that “…folks [organizations] should always have a modernization plan.” Broadly, plans for modernization have been ubiquitous across all industries, but they often have implications for upgrading technology being used. The need for modernization amongst enterprise organizations stems from a number of factors, all of which have strong ties to the rise of cloud computing. These factors include the rise of Big Data, increased popularity of microservices, and momentum behind Kubernetes in the container orchestration space.

The notion of container orchestration and its difficulties gave rise to Kubernetes. Kubernetes won the orchestration wars. This has allowed engineers to develop transferable skills. Mindshare continues to build around Kubernetes and a number of pre-cloud organizations have started their migrations to Kubernetes. So, many organizations migrating to the cloud, and that are using microservices, are considering using Kubernetes as their container orchestration framework. Not only does it simplify the container orchestration process, but it can attract more developers who are interested in technologies closer to the bleeding edge. This demographic of developers tends to be attractive to pre-cloud organizations that are interested in moving more of their infrastructure to the cloud.

Forming a platform engineering team is one way an organization could begin modernizing their engineering culture. The aforementioned trends, like the prominence of cloud vendors and the shift to a microservice architecture, make forming a platform engineering team even more appealing. But, modernizing engineering culture can be achieved in other ways, as well. A platform engineering team should not be created for the sake of having a platform engineering team. As with most things in the technology space, it depends.

One article will most likely not be enough for an organization to form a strong opinion about whether a platform engineering team would be beneficial. Building, and maintaining, awareness of the technology landscape can help an organization form a stronger opinion. If you’re interested in learning more about the role of software platforms within organizations, check out this episode of Software Engineering Daily covering Cruise’s approach to platform engineering.