Why Has The Number of Database Products Exploded?
The result of high number of database products is due to the amount of Data we generate.
Yad Faeq via Quora
You’ve hinted to the term long tail for databases, which leads to a very interesting discussion. Chris Anderson explains the long tail among the entertainment industry in this talk, the same basis may apply to technology and specifically data.
Here are just a few applications of data that I can think of at the moment which might have led to the increase of data:
- Turn old available text and physical data into digital data. (Books, Public Records, etc..)
- Using available data to generate more meaningful data. (Machine Learning & Analytic products)
- Keeping record of the present for future applications of that data. (User experience and integrity)
- From Moore’s law ‘s perspective, we have amplified the data generation as well for each piece of chip/technology built.
Here is an example of that Long Tail graph
The Long Tail here applies when there is a need for different sorts of databases to different types of data. “Moore’s Law removes space as a constraint”, every day to day related work have a home on the web with data, while there is an audience that will likely require access to this data with ease of use.
Now this reveals the fact that when there is a new database solution aroud the corner, it’s most likely to offer a functionality to top of a demand in a specific niche. Of course there is a point where the market just becomes flooded with clones of product with less of a difference. Though we are still not their yet.
The databases that have been mentioned:
All three fall into different categories as data storage
- RethinkDB: Is more on the document data model side, since it provides a neat way for real-time querying. Brings in the features such as table joins and group by within documents.
- PipelineDB: Data streaming is a hot commodity for different mainpulations of data. So PipelineDB brings serious data streaming to available data. Interesting fun fact to quote here:
PipelineDB can also make the infrastructure that surrounds it more efficient. After meeting with over a hundred different data-driven companies to learn about their pain points, we discovered that one of the most beneficial aspects of PipelineDB’s design is that it eliminates the necessity of an ETL stage for many data pipelines. ETL was often described as the most complex and burdensome part of these companies’ infrastructure, so they were eager to imagine how it could be simplified or removed altogether. It is indeed difficult to envision a future in which leading organizations are still primarily running periodic, batch ETL jobs.
Source: Blog-To Be Continuous
- InfluxDB: falls on to the Time-Series data storage type. In my 3rd point earlier, I mentioned using data record for future applications and in the second about analytic products. InfluxDB is an example of supply for that sorts of demand. Company A needs to store click through, website crashes, traffic anomalies etc. Sure you can store that on regular sql or no-sql database, but it becomes 10x better when the product specifically offer a distributed time series storage.
See where I’m going with this here?
- Lot’s of data, we need variety of data products of these different types of data.
- In some cases we need to access the data in the matter of seconds.
Gremlin (BigQuery) offers:
Dremel Can Scan 35 Billion Rows Without an Index in Tens of Seconds
source: BigQuery Paper
- Sometimes the data needs some specific standard or shape and form to be applied to, in the case of time-series or map-databases, hence we need those types as well.
The list goes on.
Some might argue on the need of these solutions, the same way a few years back the rise of different programming frameworks was frustrating lots of folks. But now, 2015, looking back on those days one could tell how the competition was low and it didn’t set the bar high enough for the products offered.
Joydeep Sen have a spot on point on this as well:
The second factor here is directly related to the explosive growth and funding of the internet/mobile/social sector.
Only YC have funded about a dozen of Database related startups in the past 3 years if you look them up.
If you want to be mind = blown, check out this Github repository for most of the existing Big Data databases. It’s just daunting to know this amount of products exist, because it somehow correlates to the amount of data being manipulated with these solutions
On a different note, Stephen Pimentel earlier this year talked about the concept of:
The rise of the multimodel database
By mapping documents, graphs, and relational tables to a collection of keys and values, a single data store can support multiple data models
source: The rise of the multimodel database
He explains how these different sources of Sql and NoSql conjugates together and could be an alternate, or already existing solution to offer a hybrid data storage & manipulation. It’s interesting to see how we are still looking ahead of to create more database products, so it’s not stopping here or slowing down anytime soon.