Shoreline: Pod Debugging and Management Automation
Kubernetes has dominated the cloud application development landscape. First developed in 2015, Kubernetes won the container orchestration wars against rivals such as Mesosphere and Docker Swarm to become the de-facto standard for container orchestration. However as cloud computing has become the de facto standard for container orchestration, improvements in debugging and fixing issues in native Kubernetes applications are lacking. When developers encounter similar issues with known solutions across multiple pods, they have two primary solutions. One, go pod-by-pod and find or fix the problem, similar to how developers of lore used to SSH into individual machines or two, try to automate that process by investing significant time and manual labor. Shoreline provides a real-time debugging and automated repair tool that enables developers to efficiently find and debug errors across thousands of pods and develop automation so that those errors can be fixed automatically in the future.
As Kubernetes has gained popularity, more and more tooling has been developed to create and manage production-level Kubernetes clusters and extend their functionality. However, these tools often focus on preventive measures for a specific area of expertise. Deployment tools like Helm serve to create an IaaC, or Infrastructure as Code, type environment where creating and managing application commits on Kubernetes is repeatable and debuggable. Visualization tools like Lens and K9s serve to increase the observability of your cluster.
However, none of these focus on incident management and resolutions. Shoreline focuses on incident management and resolution, or “Day 2” operations by solving two problems.
The first problem Shoreline solves is creating an intuitive and powerful Kubernetes native environment that enables developers to monitor and maintain specific pods in an aggregated manner to diagnose and solve issues.
The second problem it tries to solve is to enable developers to easily create automation such that if the issues reappear or require ongoing maintenance, it can be effectively automated.
Shoreline had to solve several hard technical challenges. Notably, there are five main problems that give Shoreline its uniqueness.
1. Data has to be trusted. Existing tools like Splunk collect data through batch processing which results in time delays and duplicate logs. By collecting data in real-time Shoreline solves these problems and provides more value. Shoreline leverages the Prometheus exporter ecosystem that is both open-source and best-in-class in combination with custom integration code in order to get accurate, per-second metrics with no lag.
2. The second problem that Shoreline solves is creating an easy way to define queries and automate repairs. This is done primarily through the Shoreline Op language which will be touched upon later.
3. It needs to be fast. With large Kubernetes clusters having thousands of nodes and even more pods, Shoreline has been designed to take automation and turn them into a distributed execution graph that runs in parallel.
4. It needs to work even when other things are broken. Shoreline has been designed from the bottom up to be fault-tolerant and even work locally when you lose connectivity.
5. It needs to be safe, Shoreline provides controls to limit the scope of both manual and automated commands.
Shoreline has created a DSL, or domain-specific language called Op. Op attempts to fluently integrate three primitives: resources, Linux commands, and metrics.
Resources are simply the pods you want to act upon. They can be defined by namespaces, regex commands, or Kubernetes tags. Linux commands are Linux commands like a top, Kubectl, Netstat, or any other command you might find useful in debugging a pod. Metrics take the output of many different outputs of pods and combine them into a single number. This can be as simple as averaging together numeric outputs or something more exotic. A fairly simple example that many enterprises may find useful is selecting all application pods by tag, running top on them, stripping the output, and monitoring to see if the NodeJs process is taking more than 90% CPU utilization. Though this is a simple example, it gives you an idea of how Ops can be used. By using familiar Linux commands, the Op DSL is intuitive yet powerful. Once you have constructed a query, you can then trigger a “bot” when that query condition is met. This “bot: then executes an action also defined in Op to fix the issues. All of this together creates a simple and powerful remediation loop that checks for issues, collects diagnoses, and automatically applies repairs. This enables Ops teams to meaningfully solve repetitive incidents once an issue is fixed, it can be fixed forever.
Shoreline is continuing to grow and develop features. Their goal is to radically increase system availability and reduce operations toil through incident automation. They currently support AWS with Azure and GCP support to come. They have a large library of “Op Packs” that solve commonplace issues seen by Kubernetes operators.
All things break. The question is how do you get back to normal as quickly as possible. For those looking for an integrated solution that allows you to quickly restore critical functions at scale, look no further than Shoreline. If you’re an up-and-coming SaaS company that has ticket pain and that ticket pain is growing and growing fast, take a look at Shoreline at shoreline.io. Right now Shoreline is in the early GA stage, so they can become a partner with your company, taking in feedback, and together you can design a product that reduces your ops pain.
This sponsored article is based on an interview with Anurag Gupta, Founder and CEO of Shoreline. If you are interested in the full interview, you can find it here.