What are the differences between Druid and AWS Redshift?

From Eric Tschetter’s answer via Quora:

The difference you are asking about though is ParAccel vs. Druid.  ParAccel is the software that Amazon is licensing for RedShift.

Aside from just potential differences in performance, there are some functional differences (these are all based on a cursory understanding of what ParAccel does, I’ve read what I could find on it, but a lot of my understanding is extracted from interpretations of marketing text, which can be a mixed bag):

1) ParAccel is a full on database with all kinds of SQL support including things like joins and insert/update statements.  Druid is intended as an analytical data store, it’s write semantics aren’t as fluid and it doesn’t do joins.

2) Data distribution model

ParAccel’s data distribution model is hash-based.  Expanding the size of your cluster requires re-hashing the data across the nodes, making it difficult to do without taking downtime.  From Amazon’s text, scaling up your redshift cluster is actually a multi-step process:

a) set cluster into read-only mode
b) copy data from cluster to new cluster that exists in parallel
c) redirect traffic to new cluster

They do not indicate if they are nice enough to not charge you for the extra machines consumed during the copy due to their own software’s limitations, but even if you were scaling up to 100 of the big nodes and you were copying 20TB at 2GB/s for the cluster, that would only be 3 hours or ~$2k so they probably figure that said cost doesn’t really matter.

Druid’s data distribution, on the other hand, is based on segments that already exist on some sort of highly available “deep” storage, like S3.  You can lose all of your compute nodes and still reload everything (as long as your deep storage is still there).

3) Replication strategy

ParAccel’s hash-based distribution also generally means that the replication strategy is necessarily via hot spares.  From what I can tell when you lose one node in ParAccel, you are covered by a hot spare that can come in and take it’s place, but when you lose that node as well, I’m not sure if there is some mechanism to protect you from losing data.  They probably re-load from a backup on S3 or something, which does significantly mitigate your risk.  Aside from just that though, a hot spare replication strategy often doesn’t lend itself to being able to utilize that spare copy for read queries, meaning that you only have one node serving the data at any one point in time, which can become hotspot/bottleneck.  Allowing for reads on all of the replicas at the same time greatly complexifies mutations, they might have that based covered, but I do not know.

Druid’s distribution is on the segment-level meaning that you can add more nodes and have the data rebalance without doing a staged swap.  The replication strategy also makes all replicas available for querying, so if you have a base replication factor of 2, you have two machines that are serving read queries against that segment.  Druid doesn’t implement this yet, but hopefully by the end of the year we will be automatically adjusting replication factors of specific segments based on demand for the segment (i.e. dynamically responding to hotspots by adding replicas).  There’s nothing stopping different segments being replicated at different levels, the thing left to implement is the communication mechanism to figure out when a segment should be scaled up/down.

4) Indexing strategy

I’m not sure if they’ve added it, but last I looked at ParAccel they didn’t have indexing strategies in place for the data, instead they only relied on column-orientation and brute force to process queries.  Indexing structures do increase storage overhead of the data (and make it more difficult to allow for mutation), but they can also significantly speed up queries.  Druid uses indexing structures to speed up query execution when a filter is provided.

5) Pluggable query execution engine

I’m not sure to what degree ParAccel allows for UDFs, but if you wanted to, you can plug in different query types to Druid and have it run some completely different set of functionality in a scatter-gather across the segments in your cluster.  Right now, in the open source offering we have timeseries queries, groupBy, “search”, timeBoundary and segmentMetadata.  Timeseries produces results that groupBy could also produce, but it is optimized for queries that are just returning a timeseries and don’t need to include any dimensions (it runs in half the time as the equivalent groupBy query).  segmentMetadata just looks at the various segments and reports back statistics like segment-local cardinalities of dimensions and expected input data size assuming it was tsv input.  The mechanisms for extending this are completely pluggable (Metamarkets maintains some proprietary query extensions that support our dashboard product that are implemented as modular plugins).

Similarly Druid allows for implementing your own storage types, none of these are included in the open source offering yet, but one example of something we maintain at Metamarkets for our dashboard product is a HyperLogLog column type that can be used for approximation of uniques.

I’m guessing that ParAccel only allows you to do things that are representable via SQL (or maybe even some PL/SQL-style “stored procedure” library).  Their website does indicate that they have an “Extensibility Framework” but I haven’t really read the details of what that does and does not allow you to do.

6) Real-time data ingestion

Druid supports loading and aggregating data in real-time and separating the concerns of that load from the “historical” processing concerns of the “data warehouse.”  I do not know how ParAccel loads data, but I assume that for high volume streams they would recommend doing a batch ingestion at a regular interval.

So, in summary, the differences in functionality can be summed up into the use case.  Druid was built as a front-office system to power an always-on SaaS offering.  Downtime is a naughty word in our world and we’ve architected so that we do not have to take downtime, ever.  We’ve also made design decisions that favor operating the infrastructure as a service instead of optimizing for complex joins, so things like allowing us to replicate more to handle hotspots and scaling by adding/removing machines rather than staging a new parallel cluster.  ParAccel is built as a back-office system to power internal BI.  It offers more “database” functionality and integration with tools built on top of ODBC/JDBC+SQL.

Druid’s sweet spot is its ability to power a data-based product and scale out in the ways you need a service that is directly visible to your customers to scale.  ParAccel is a direct competitor with the likes of Vertica, Teradata, Greenplum, etc.  And Redshift adds a lot of operational simplification to that equation, but it is too early to say if those simplifications will make it a contender as a proper customer-facing service or if they are just tools that greatly simplify the management and operation of a back-office data warehouse.