Software Engineering Best Practices Applied to Data

Article Monday, July 11 2022

Sherlock Holmes once said, “It is a capital mistake to theorize before one has data”. Though from a fictional character in the early 1900s, the adage rings true today. Companies are increasingly leveraging data to make and guide decisions. As they should, data-driven decisions are proven to be more consistent, more reliable, and more accurate. As data becomes more integral to decision making, data management becomes a critical competency. The evolution of the data team is a signal that businesses are placing more and more importance on data management strategies and scaling data usage.

The modern data team has evolved beyond analysts leveraging Excel to create fancy dashboards for executives. But it’s not just a bunch of software engineers. It contains a range of increasingly specialized individuals all along the technical spectrum. The data team now sits alongside other business teams to enable more effective decision making and more efficient products. Technical roles like data engineers and software engineers are needed to manage a data warehouse/lake at scale. Data scientists and analysts are needed to derive insights from the data collected and present them in dashboards and charts that can drive smarter decisions. Business analysts can be used to translate business needs into data needs

6 Function of a Data Team: Sisense

However, the demands and requirements of a data team are still rapidly evolving and a variety of models have emerged based on differing circumstances. Organizationally, a company can choose between a centralized, federated, or hybrid model. Technologically, specific tools and stacks can change the data team’s composition and purpose. When it comes to customer data, vertically integrated solutions aim to enable non-technical users and reduce the need for dedicated data engineers or for borrowing software engineering time to support data management. However, developer-friendly tooling like Rudderstack enables a more fully-featured data platform. This may require more development resources, especially upon initial implementation, but a robust solution owned by engineering unlocks modern users cases and can easily scale as requirements become more sophisticated. For a more detailed explanation of why IT and Engineering should own the data platform check out this article by Rudderstack.

Rudderstack Data Platform Diagram: RudderStack

As the industry matures and companies become more proficient with data, they naturally begin to migrate to more complex architectures. Rather than using one solution to meet all their needs, they construct a data infrastructure by combining several different products. This can be seen by the proliferation of new tooling filling specific data niches. Rather than software being defined as “a data analytics software”, there are now data science platforms, data visualization platforms, data monitoring software, and data loaders.

Data Infrastructure: BSP

This segmentation of the data stack is similar to how software engineering matured. Originally, everything was written in HTML, CSS, and JS. Then full-stack web applications become popular and the LAMP stack became dominant. Now software engineering is so diverse that we have frameworks for everything from queuing software for distributed applications, serverless software for cloud-native apps, and even more front-end frameworks. A similar thing is happening in the data industry. Legacy players provided all in one solutions while new entrants focused on differentiation and interoperability.

Components of the Customer Data Stack: The Future of Customer Data Platforms: To Bundle or Not

While there is room to debate the “unbundling” of the data stack, the conversation itself is proof that the industry is maturing.

As data stacks grow in complexity, maintaining uptime and reliability becomes key. The data industry is moving to meet this need. Data monitoring and data quality software is now available to ensure that data quality can be maintained and that organizations can respond quickly if a problem arises. Common practices from software engineering like CI/CD, IaaC, and reusable components are making their way to data teams and infrastructure. Data transformations and ETL pipelines can be written in code, versioned, and reused with tools like Rudderstack, enabling better reliability and making these services more tangible

Data as code is a new trend based on Infrastructure as Code that focuses on incrementally increased data sets while maintaining version control. This ensures data quality and enables data teams to quickly and efficiently solve data poisoning issues. All of these trends show that the data discipline is maturing in step with the growing need to manage data at scale.

The organization of complex data teams, development of data more complex infrastructure, and increasing the reliability of data platforms are all hard challenges, but the fact that these challenges are arising is a good thing. It is a sign that the data space is maturing and thriving. The industry has solved for many initial problems. Now it’s moving on to new problems to unlock more complex use cases. As the industry progresses, we expect to see an acceleration of the adoption and adaptation of more software engineering principles.

Ashvin Nihalani

San Francisco, CA

Education: B. Eng, EECS, University of California

Originally from Texas. Graduated from Berkeley with an B.Eng in EECS. Interested in basically anything, well anything interesting. More recently focused on Machine Learning, Blockchain, and Embedded Systems.