Episode Summary for Data Mechanics: Data Engineering with Jean-Yves Stephan
Apache Spark is a unified analytics engine for large-scale data processing. In computing environments like data warehouses, Spark Applications play a crucial role. Spark has a central place for the general framework for big data, distributed computing. The main usage cases of Spark are Spark streaming and ETL. It is also a general-purpose tool of data science and data engineering workflows. In a comparison of volume and data engineer procedures like extracting, transferring, and loading (ETL) are the main usage cases. Usage of Spark has been very popular in the sector since its release in May 2014. Recently, it is also possible to use Spark on top of Kubernetes and allows standard architecture for big data workflows like data sets over 100 gigabytes, which is the main usage case that makes it such a useful tool for data science.
That brings us to the question of what are the main differences between the applications of Spark and the applications of the data warehouses. Data warehouses generally interact with data by using the only SQL, and the main problem with this approach is that it is so hard to manage. On the other hand, the usage of Spark gave the flexibility that the engineers strive for.
A data lake is a centralized storage repository that holds a massive amount of structured and unstructured data. Using Spark on a data lake provides the flexibility of the programming language, which allows usage of Java, Scala, Python, and R. To implement particular business logic in a somewhat complex way, Spark provides you additional flexibility, especially when you start managing a lot of ETLs and do not want to implement everything in SQL. The reason for this is after some point SQL queries become hard to manage. Spark lets users create functions and modularize their code and that is one of the main reasons why people choose Spark over data warehouses. In terms of cost-effectiveness, no one would like to store all data in a Data warehouse, instead one would choose Amazon Simple Storage Service (Amazon S3), Google Cloud Service (GCS), Microsoft Azure Data Lake. Data Warehouse could be still used for low latency data use cases, for instance, if you do not want to store the entire data there but more of a subset of pre-aggregated data. There could be some applications in which both data warehouse and data lake make sense like Spark plus Snowflake or BigQuery.
Databricks is widely known as a Spark company. At this point, the question becomes: Why is there a need for another company like Data Mechanics that is based around Spark? Firstly, the Spark ecosystem is getting bigger. Secondly, it depends. In general, Databricks is great for heavy data science use cases, since Databricks is not also Spark infrastructure, and they sell development environments with hosted notebooks, hosted job schedulers. On the other hand, in the Data Mechanics side of data engineering, Data Mechanics focus on workflows that are easy to develop and cost-effective. One of the most important things that differentiates Databricks from Data Mechanics is that solutions of Data Mechanics can work on top of Kubernetes while Databricks does not do that. That was a big technological change that allowed Data Mechanics to start their business, even though it is competing with big players in the market. So, it can be said that the average user of Data Mechanics is more technical than the average user of Databricks. Data Mechanics is the Spark backend for customers and provides a tool of choice for data engineering. It provides data flexibility thanks to low costand low latency.
The starting point of Databricks comes from data engineering use cases like people or companies managing pipelines properly. Nonetheless, what they are not usually experts at is managing Spark and that is what Data Mechanics provides. It can be called a sub-segment of the market. Although Data Mechanics is tiny compared to Databricks, that is the only way the startup can compete with those big players, which is fundamentally to focus on a particular sub-segment. DataMechanics has customers all over the world, including the US and Europe, most of which are startups and mid-market companies. It does not target enterprise companies yet. Numerous people are not interested in Databricks because they do not like the development environment they sell, since it is very difficult to manage. As a result, these companies end up with Amazon EMR or Google Dataproc. However, these platforms are pretty painful to manage.
Data Mechanics offers a Spark platform without the bells and whistles of the Databricks platform. The Data Mechanics system runs Spark pipelines on schedule and whenever a pipeline that runs processes some logs and flags a memory issue or if there any infrastructure problem, or when does the system suffers from lack of Parallelism, then, automatically, in the next run of the pipeline, Data Mechanics tunes some Spark configurations for the system such as the size of the container, unshuffle, parallelism, etc. As a result, this has a big impact on stability and performance and that is something Databricks does not have. Similarly, Data Mechanics deploys Spark on top of Kubernetes, so users control the overall image, and it is much more flexible than the Databricks platform. But of course, Data Mechanics’ working principle is not as broad as Databricks and does not have hosted notebook ML flow integration, since its main purpose is easy and cost-effective serverless integration.
Spark infrastructures are not used effectively enough. That can be due to various reasons. For average Spark applications, 80% of computing infrastructure is idle and not used by Spark. Hence, 80% of the bill of the company is for the server that is just idle. What Data Mechanics does is the implementation of Operations Research tools to make the system highly optimized for Spark. Automation, auto-scaling, and autotuning helps to reduce the cost of the company’s customers by around 50% to 75%. Hence, cost-saving is possible while using Spark thanks to these advanced optimization techniques.
Most of the newcomers of the Data Mechanics are already using Spark, ⅔ of customers already use Spark for some apps, but the last third just started from scratch. The Data Mechanics platform is available on Amazon Web Services, Google Cloud Platform (GCP), and Microsoft Azure. Therefore, customers of Databricks must be on the cloud. They cannot deploy their system to non-promising infrastructure. If a company immigrated from Amazon EMR, then Data Mechanics will have two to three weeks for proof of concept where they deploy the platform with their Amazon account. Data Mechanics helps them migrate as well. Since Spark is used in both places, the migration process is handled without any hassle. The Data Mechanics platform automatically tunes some configurations. So in 2-3 weeks, the company could go from zero to a couple of pipelines running in production.
Data Mechanics infrastructure is built entirely on Kubernetes. Until 2018 people could not deploy Spark on top of Kubernetes. After Spark released 2.3 (initial release), there were many new steps toward a big improvement. The biggest difference to Databricks or Yarn from other competitors is that people have a single Kubernetes infrastructure inside the cluster and they can run many apps with different Spark versions or Python versions. They can create their dependencies. Hence, customers stop thinking of clusters, and they send Dockerized applications, and then the remaining Data Mechanics jobs are to automatically scale and make sure to bin pack containers on the nodes efficiently and increase infrastructure and efficiency, decreasing the customers’ costs.
The biggest challenge for Data Mechanics was getting Spark and Kubernetes to perform well. For instance, paying attention to the Kubernetes version, spot notes, and clouds. The engineers in Data Mechanics do not just manage their own system but also manage it for many other customers and monitoring them is challenging for Data Mechanics. Customers for Data Mechanics mostly use data engineering and machine learning applications.
Customers don’t even need experience with Kubernetes to make this happen. Once the cluster is deployed, the service of Data Mechanics starts running. Algorithms optimize how it runs, and compared to Databricks it is a much better developer experience. However, the working principle of Data Mechanics is not just a machine learning algorithm. Instead of an automated recommendation, data engineers use algorithms like Bayesian algorithm or heuristics. Therefore, Data Mechanics does not do hundreds of configurations. It’s more a list of about 5 to 10.
In the last couple of years, Spark has evolved very quickly and there has been a huge improvement. In general, the software engineering world has evolved a lot, with DevOps, Kubernetes, Docker containers, and CI/CDs. While the data side has stayed the same for the last ten years. Data Mechanics wanted to help this transaction by bringing the DevOps best practices to data engineering work. People would like to use Spark with Kubernetes, so Data Mechanics is in a good position in the market. With this kind of advancement, we say goodbye to inefficiencies within the cloud. Data Mechanics is the simplest way to run Apache Spark.
This summary is based on an interview with Jean Yves Stephan, Co-Founder, and CEO at Data Mechanics. To listen to the full interview, click here