Someone could write a 5000 page book on this subject but I’ll do my best at a high level.
I break these down into to three basic groups: Traditional, MPP, columnar, and an emerging technology called NewSQL.
These are the usual databases that we’ve seen for years. Some vendors might includeMySQL, PostgreSQL, SQL Server (product), Sybase, Oracle Database, etc. They comply with SQL standards and provide full ACID transactions. That is, you can do a lot of data transformation directly in the platform and it is guaranteed that different people looking at the same data will see the same values. They are also relational in that data in different tables are typically joined together. Most often these databases will run in a single node or a few nodes in a grid (some exceptions). These databases are generally used for business applications that process atomic (lowest granularity) level transactions. They are also used when processing and storing data that involves something of physical value. For example, your company’s ERP and Supply Chain systems will use these databases for their data storage. They’re also pretty configurable to serve as good data warehousing solutions when tuned, configured, and modeled correctly.
MPP (Massively Paralleled Processing)
Generally speaking provide most of the same features as traditional databases but scale to hundreds or thousands of nodes. Vendors in this area include Oracle Exadata, Teradata, EMC Greenplum, and IBM Netezza. These are commonly used within large enterprises as the backbone of their data warehouses. Some vendors also provide specialized approaches to MPP databases. For example, platforms like Vertica Database, Actian Matrix, andSybase IQ, provide column oriented storage (see below) in an MPP solution. SAP HANAand EXASOL are examples MPP databases that operate primarily in memory.
Again, very similar to MPP and Traditional in that they are ACID compliant and support SQL standards, except that these solutions tend to rely very heavily or entirely on in-memory data storage. They also tend to scale horizontally. This is as opposed to the other solutions which primarily use disk as the storage mechanism. The trade-off here is that NewSQL tends to store less data since disks can hold more data than RAM. And in reality, even the MPP solutions tend to use a mixture of magnetic disk, solid state, and memory. These tend to be very popular for OLAP solutions as they can make data changes very quickly to large data sets. Examples of NewSQL are Oracle Timesten and VoltDB. In some cases, database vendors are blending these capabilities into their existing platform. Oracle Database 12c allows you to “pin” tables to memory.
Basically like it sounds, data is stored in columns. If you think about how data is typically stored in a traditional SQL DB, you have a single row of data with different kinds of data types. Just thinking about a user table, you may have an id, a first name, last name, address info, demographics, etc. When storing this data on a disk, its hard to compress it a consistent way. When data is stored in columns rather than rows, using the same user table example, all IDs are stored together, all names are stored together, etc. This allows the database to greatly compress data as now similar types of data are now stored together, resulting in less physical disk reads when querying large volumes of records.
The trade-off is that the more columns you select, the slower your query becomes as more data has to be read from disk. This is in contrast to traditional SQL where adding more columns to a select query typically has only a small impact on query performance. They also are not well suited for use cases involving high concurrency of queries as a single query will use a large amount of available resources. They also are designed to be batch loaded and DML operations are very slow. This makes them very ill-suited for OLTP type workloads. Examples of columnar databases are MonetDB, Sybase IQ, and Actian Vector. Additionally, most MPP and other database vendors provide a mechanism for columnar compression.
There are so many of these that is hard to really classify these generally but I try to do it by describing 5 main “types”: Key-Value, Document, Graph, Elastic Search and Time Series. The common denominator of NoSQL databases is that they are all designed to scale horizontally across multiple servers using an indexed and sharded key governed by a “gateway” node. Just about all NoSQL datastores are based on this concept and then apply some level of data structuring, indexing, or storage strategy to further provide specific advantages for some use cases.
They are generally used when you need to be able to read and write a lot of data concurrently. This is different from the traditional databases in that you can usually only tune those to be good at one or the other (again, in very broad terms). It’s also important to note that NoSQL is in reference to the fact that these databases don’t use SQL as their querying language and isn’t intended as a rejection of traditional SQL.
You have to keep in mind that without SQL compliance, most of these databases can’t transform data within their engine. You have to extract it somewhere else, write some code, and then put the transformed data wherever you need it. Pretty much all NoSQL databases also lack ACID compliance and instead rely on “eventual consistency”. At a high level, this means that different people querying the exact same data may see different values. You also can’t join data together directly in these databases.
These basically provide a very simple method of storing and retrieving binary data. For each record, you define it’s key and store a value. Usually people will create large programming objects, serialize them, and then store it. Examples of these are Oracle (company) NoSQL & BerkeleyDB, Dynamo, and Riak. I also put Cassandra in this group but in reality, it provides more structure within each value than just an open binary store. This can be useful for storing something like a video game’s user state, such as your Farmville farm (especially if that state is many megabytes+)
These are similar to Key-Value stores except that instead of value just being a simple binary object, there is more structure behind the data. Most commonly data is stored in a JSON format and a few of them use XML. Again, for each key, you will retrieve the document value. Documents tend to be very large objects and are usually accessed one at a time. Examples of this are MongoDB, CouchDB, and BaseX. These are useful for storing data with some structure such as legal documents or your entire online profile for a website.
Somewhat different than the Key-Value and Doc Store, these databases are designed to create “graphs” of data. This basically means you can define very complex relationships between different points of data. You can also easily traverse the data defined by relationships, even if its many nodes away. Think about it as a social network, where you have friends, interests, check-ins, likes, etc, and you want to be able to define the relationships between these things and then query it. Examples include Neo4j andSPARQL.
These types of NoSQL databases tend to get a lot of attention with misinformation as to how they can be used. In the most basic sense, these data stores index “words” and provide searching capabilities similar to typing something into Google. They’re very good at finding things based on a key search over a very large distributed set of data, a kind of “enterprise search”. There is an upfront cost to creating the index, and they aren’t designed to operations on large scans of data (i.e.: analytical workloads). Examples include Apache Solr, Elasticsearch, and Kibana.
These data stores are based on the same basic principal of an indexed and sharded keys. The difference, however, is that a timestamp is always part of a concatenated key. Records are appended using an “insert” operation only. The primary use case for this type of platform is for providing a persisted data store for data use cases where you will primary want to retrieve data as a trend. For example, a sensor with a unique ID that is constantly streaming data in. The data for any given unique ID is stored on the same node and sorted by its most recent timestamp. This makes it efficient for specific point in time or a time range retrieval of data for a specific ID. Many of these data stores also provide additional capabilities based on time series such as summary rollups, time based aggregations, and visual trending analysis. Examples of these include Druid, Influxdb, and OpenTSDB. I could also make a strong argument for putting HBase in this category.