Databases: Fundamental Answers

Databases Week began with a set of fundamental questions.

What is a database?

One definition: “an application component for storing and retrieving data”. All of the different databases companies have this functionality.

But similarities end there.

  • RethinkDB pushes data to the application
  • MemSQL is a faster, proprietary version of MySQL
  • VoltDB wants to replace your disintegrated ZooKeeper-Kafka-Storm-Cassandra data pipeline
  • PipelineDB augments PostgreSQL with continuous views
  • InfluxDB’s event data model focuses entirely on time-series data
  • Neo4j represents the database as a graph, allowing algorithmic queries like Dijkstra and Ford-Fulkerson
  • IPFS leverages blockchain for distributed file storage

A more generous definition of a database could include any of the above functionality. But then–what is an application?

Why is functionality moving from the application to the database?

The scope of responsibilities for a programmer keeps increasing.

To counterbalance this, we notice recurrences over time and apply compaction to take advantage of that repetition:

  • Design patterns codify loose architectural themes (examples: MVC, CQRS)
  • Developer tools bundle up and tightly couple recurrent functionality that was previously a series of ad hoc processes (examples: SVN, Jenkins CI)

New databases fit in the second camp.

Paul Dix from InfluxDB implemented time-series databases a few times prior to starting a company around the idea.

Slava Akhmechet from RethinkDB encountered the broken request/response model enough times to realize people were repeatedly solving this problem.

New databases are data-centric developer tools.

What are the database types desired by forward-thinking platforms like Meteor?

Consumer webapps are real-time and don’t want the request/response paradigm. Meteor rejects the request/response paradigm in favor of caching layers and database watchers.

RethinkDB sings with Meteor. Slava from RethinkDB writes:

Meteor, Volt, and webalchemy frameworks run on top of databases, so they’re ultimately constrained by the realtime functionality and scalability of existing database systems. We’ve been collaborating with the Meteor team to ensure our design will work well with these and other similar projects.

But this says nothing about data driven backends, which are no less “forward-thinking” platforms than Meteor.

VoltDB has push functionality as well.

Do we need new databases to support the growth of data science?

DevOps engineers and data scientists translate data from a dashboard or Tableau into value by proposing business actions inducted from that data.

The more integrity that data has, the more value the data scientists produce.

Data integrity is a function of the speed of that data pipeline and the structure of it. The presentation layer (dashboard or Tableau) can render the data with more fidelity if the data has better in-database shape.

The databases covered this week exemplified the above in several ways.

  • providing new data models (event data in InfluxDB)
  • abstract away the whole data pipeline into the database (VoltDB)
  • expand existing database functionality to accommodate dashboards (PipelineDB’s continuous views)
  • pushing received changes from the database to the view layer (RethinkDB)
  • being fast (MemSQL)

Is onboarding or migrating to a new database always something scary that has a risk of large technical debt?

Regarding databases, technical debt seems closely related to lock-in.

Databases from this week fell into two* camps:

  • augmentative functionality: InfluxDB doesn’t usually replace anything, it adds a new data lake to wire plumbing to
  • smooth, compatible replacement: PipelineDB offers a superset of PostgreSQL functionality; MemSQL is fully over-the-wire compliant with MySQL

The augmentative systems have lock-in only to the degree that their added functionality becomes something unique which people build off of.

The replacement systems have lock-in to the degree that they often provide obvious business functionality, so in some sense they pay for themselves.

If these systems begin to provide value, that value-add is probably going to be obvious.

*ignoring IPFS

How does the movement from batch to streaming affect database architecture?

From the MemSQL blog:

While MemSQL and Hadoop are both data stores, they fill different roles in the data processing and analytics stack. The Hadoop Distributed File System (HDFS) enables businesses to store large volumes of immutable data, but by design, it is used almost exclusively for batch processing. Moreover, newer execution frameworks, that are faster and storage agonistic, are challenging MapReduce as businesses’ batch processing interface of choice.

Lambda Architecture

A number of MemSQL customers have implemented systems using the Lambda Architecture (LA). LA is a common design pattern for stream-based workloads where the hot, recent data requires fast updates and analytics, while also maintaining long-term history on cheaper storage. Using MemSQL as the real-time path and HDFS as the historical path has been a winning combination for many companies. MemSQL serves as a real-time analytics serving layer, ingesting and processing millions of streaming data points a second. MemSQL gives analysts immediate access to operational data via SQL. Long-term analytics and longer running, batch-oriented workflows are pushed to Hadoop.

The VoltDB blog discusses the implementation of a batch + streaming data pipeline.

Are databases only ever written in highly performant languages like Rust, Go, and C++?

Garbage collection is both a blessing and a curse.

InfluxDB is written in Go, and leverages GC. RethinkDB is written in C++ partially to avoid GC. Neo4j is written in Java.

Slava from RethinkDB mentioned Rust as an interesting language for writing databases. Rust provides memory safety without a garbage collector.

What are the biggest database pain points for high throughput, data driven application developers?

VoltDB’s John Hugg writes about the multiplicity of problems stemming from the distributed systems:

A recent talk from a Twitter engineer described a problem his team solves with Storm. They count unique users of mobile apps for given time periods. The tricky part is the volume: reportedly 800,000 messages per second, making this a poor fit for more traditional systems. The stack they use involves Kafka, Storm, Cassandra, and of course, ZooKeeper.

We can assume ZooKeeper, Kafka, Storm and Cassandra all use at least three nodes each to run with reasonable safety. Three is a bit of a magic number in distributed systems; two node clusters have a harder time agreeing on the state of the cluster in failure scenarios. So now we’re operating and monitoring four systems, at least twelve nodes, and the interops/glue between all of the above. If a network between two Cassandra nodes fails, are the symptoms the same as if the network between Storm nodes failed, or ZooKeeper nodes failed? Each of these systems has different failure semantics, and can cause different symptoms or cascading failures in other parts of the stack.

While four systems wouldn’t be odd in an HDFS-based batch system, the crucial difference here is that user data ownership is passed between systems, rather than being wholly the responsibility of HDFS. Sometimes state only has meaning if data from multiple systems are combined. Development and testing stacks with four systems isn’t 33% harder than stacks with three systems either; it can be as much as four times as hard, depending on how the systems connect with each other.