Big Data: Fundamental Answers

Fundamental questions as big as data itself loomed at the beginning of Big Data Week.

Some answers:

How do customers of multiple managed big data companies deal with the heterogeneity?

Confluent provides Kafka, Rocana provides ops, Databricks gives you data science, Cloudera and Hortonworks give you everything else.

Each company has a proprietary layer meshed with open-source software. Generally, the more proprietary software you are running, the more you will need to consult with tech support.

My hypothesis coming into the week was that the integration of multiple proprietary systems leads to nasty problems and lots of calls with tech support. It was hard for me to discern if this issue is real or dreamed up by me. I asked most of my guests about it, and the answers varied.

I still don’t have a good answer to this, but I suspect that my hypothesis was off base. Systems integration always sucks and it will get worse as we move towards a Dockerized, Mesospheric world.

If you use the commercial solution, at least you will have tech support.

If you roll a data center with your own Kafka, your own Hadoop, your own Spark-to-Tableau communication layer, you may find yourself out of luck when Stack Overflow fails to answer your problem.

The optimistic interpretation: these companies are built to integrate with each other. Everyone is cut from the same open-source cloth.

It seems that the Age of Collaboration extends to the world of managed open-source.

Are there enough knowledgeable support technicians at managed big data companies to handle the customers?

I asked this question a couple times and the consensus response seemed to be “wtf are you talking about?”

Such is the consequence of reporting on Big Data stacks that I haven’t worked closely with. Sometimes I ask the wrong questions.

Why is this a bad question?

I’m not entirely sure. I don’t understand the relationship between Cloudera/Hortonworks and their customers.

“Support technician” in some cases would be an engineer at managed big data’s Customer Company (ie Citibank) who has gone through Cloudera/Hortonworks training. In other cases, it would be a solutions architect from Cloudera/Hortonworks who shows up at Customer Company to debug some horrendous data center Byzantium.

Are there enough of these “support technicians” to debug the numerous problems of tomorrow/today? IDK, but I am bullish on DevOps and bullish on Rocana for this reason.

How does a big data customer augment a batch pipeline with streaming?

A Big Data architecture is a set of user-visible values (UVV) {A, B, C…}. These values are abstractions above the lowest database layer.

A Big Data’s streaming-and/or-batch workflow dictates how the diff is resolved between the UVV layer and the database layer.

Some UVVs need to be updated frequently, some UVVs can have a longer, eventually consistent time to update. A UVV’s optimal update aggression is related to the return-on-byte for the users who can see that UVV.

In the past, all UVV resolutions were batch Hadoop processes. Now we have the option to configure some of that batch as Streaming. The biggest cost to this is the translative migration from Hadoop code to Spark/Storm/Samza/etc code.

How do you convert your batch Hadoop queries to streaming queries?” is the more eloquent way to put this question.

Spark appears to be the best API for this migration. Spark generalizes MapReduce.

An example: Hive running on Hadoop batch takes a SQL query, translates and expands it to a series of Java mapreduce jobs. Each mapreduce job takes 3 steps: map, shuffle, and reduce. What a pain! 3N operations, where N is the number of mapreduce subproblems derived from the query.

Run the Hive job on a Storm API or a Spark API and it goes much faster, because instead of 3N operations, the overall job is simplified to 3N/X where X is the operative generalization divisor for pick-your-streaming-system.

Many older Hadoop systems currently have Hive configured with Hadoop, for batch queries.

These systems want to migrate away from Hadoop batch to a faster query interpretation tier (Spark/Storm/Samza).

Overwhelmingly, they are choosing Spark, because Spark has the most user-friendly API and migration process. I don’t know the details of this migration process. It probably involves calling up Hortonworks/Cloudera.

Are Hadoop queries mostly written in Pig or Hive?

A more appropriate question: “do people write raw mapreduce jobs?” The answer is “mostly no”.

Hadoop jobs are the bytecode of the backend. Presto, Pig, and Hive get you the same answers with much less pain.

Is a measurement of Big Data throughput the new Moore’s Law?

Moore’s Law says that the transistor density on a chip doubles every 18 months. This isn’t happening any more.

But processors are getting more cores, and our software is getting better at distributing work across those cores. Shouldn’t we have a new metric for describing advances at the multi-core tier?

Maybe, but it would be an equally contrived (albeit useful) benchmark for technological progress. Let’s just stick to Moore’s Law.

The Eli Collins interview contains a good discussion of this.

Where does Kafka fit in?

Apache Kafka is the plumbing of a big data system, enabling cross-platform data transfer and low-latency stream processing:

We built Apache Kafka at LinkedIn with a specific purpose in mind: to serve as a central repository of data streams. But why do this? There were two motivations.

The first problem was how to transport data between systems. We had lots of data systems: relational OLTP databases, Hadoop, Teradata, a search system, monitoring systems, OLAP stores, and derived key-value stores. Each of these needed reliable feeds of data in a geographically distributed environment. I’ll call this problem “data integration”, though we could also call it ETL.

The second part of this problem was the need to do richer analytical data processing—the kind of thing that would normally happen in a data warehouse or Hadoop cluster—but with very low latency. I call this “stream processing” though others might call it “messaging” or CEP or something similar.

-Jay Kreps, “Putting Apache Kafka to Use: A Practical Guide to Building a Stream Data Platform

What is the difference between Spark, Storm, Flink, Samza, and other streaming technologies?

From Fabian Hueske on Stack Overflow (emphasis mine):

Apache Flink is a framework for unified stream and batch processing. Flink’s runtime natively supports both domains due to pipelined data transfers between parallel tasks which includes pipelined shuffles. Records are immediately shipped from producing tasks to receiving tasks (after being collected a buffer for network transfer). Batch jobs can be optionally executed using blocking data transfers.

Apache Spark is a framework that also supports batch and stream processing. Flink’s batch API looks quite similar and addresses similar use cases as Spark but differs in the internals. For streaming, both systems follow very different approaches (mini-batches vs. streaming) which makes them suitable for different kinds of applications. I would say comparing Spark and Flink is valid and useful, however Spark is not the most similar stream processing engine to Flink.

Coming to the original question, Apache Storm is a data stream processor without batch capabilities. In fact, Flink’s pipelined engine internally looks a bit similar to Storm, i.e., the interfaces of Flink’s parallel tasks are similar to Storm’s bolts. Storm and Flink have in common that they aim for low latency stream processing by pipelined data transfers. However, Flink offers a more high-level API compared to Storm.

I didn’t hear much about Samza during the reporting, but Apache describes as “a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.”

How has the database world been affected by Big Data?

The volume of data has driven a reframing of what it means to be a database.

  • RethinkDB: Is more on the document data model side, since it provides a neat way for real-time querying. Brings in the features such as table joins and group by within documents.
  • PipelineDB: Data streaming is a hot commodity for different mainpulations of data. So PipelineDB brings serious data streaming to available data. Interesting fun fact to quote here:

    PipelineDB can also make the infrastructure that surrounds it more efficient. After meeting with over a hundred different data-driven companies to learn about their pain points, we discovered that one of the most beneficial aspects of PipelineDB’s design is that it eliminates the necessity of an ETL stage for many data pipelines. ETL was often described as the most complex and burdensome part of these companies’ infrastructure, so they were eager to imagine how it could be simplified or removed altogether. It is indeed difficult to envision a future in which leading organizations are still primarily running periodic, batch ETL jobs.
    Source: Blog-To Be Continuous

  • InfluxDB: falls on to the Time-Series data storage type. In my 3rd point earlier, I mentioned using data record for future applications and in the second about analytic products. InfluxDB is an example of supply for that sorts of demand. Company A needs to store click through, website crashes, traffic anomalies etc. Sure you can store that on regular sql or no-sql database, but it becomes 10x better when the product specifically offer a distributed time series storage.

See where I’m going with this here?

  1. Lot’s of data, we need variety of data products of these different types of data.
  2. In some cases we need to access the data in the matter of seconds.
    Gremlin (BigQuery) offers:

    Dremel Can Scan 35 Billion Rows Without an Index in Tens of Seconds

    source: BigQuery Paper

  3. Sometimes the data needs some specific standard or shape and form to be applied to, in the case of time-series or map-databases, hence we need those types as well.

The list goes on.

-Yad Faeq discusses new database technology

What is at the intersection of JavaScript and Big Data?

Piers Hollott has a good answer to this on Quora (emphasis mine):

Data Visualization is very important in both of these areas. The Big Data ecosystem uses visualization tools like Tableau to make sense out of massive amounts of data; the JavaScript ecosystem uses visualization libraries like d3.js for a similar purpose; in the Cloud ecosystem, Office 360/Excel does something similar.

Mobile Web and Social Graphs. A major reason we create massive amounts of data is because we have mobile and ubiquitous access in social settings like Facebook or Instagram, and mobile access breaks roughly down into native applications and RWD or hybridized access, which use JavaScript. One approach to information is to expose everything; an alternative is to control the flow of information at the source, in the device, so there is a strong connection between Mobile Development as an information source and Big Data applications as an information target. Big Data would not exist as it does if not for Ajax.

People. This is something I have observed, which I think really cuts to the heart of this question – people who like declarative programming and JavaScript often also appreciate Information Architecture, Ontologies, Semantic concerns, and therefore Big Data. The same people I know who love highly scalable XML and JSON databases are also excellent JavaScript developers.

NoSQL Databases. Okay, Hadoop, Scala, Netezza, MapReduce and so forth are all important buzzwords related to Big Data; and so are NoSQL databases like MongoDB and CouchBase, and when you are working with these, you are working with JavaScript, at least as much as you are working with JSON.

What is at the intersection of Bitcoin and Big Data?

Andreas Antonopoulos answered this question for me during my Bitcoin interview with him. Paraphrasing him, Bitcoin is the financial manifestation of Big Data. The blockchain is a lot of data, distributed to a lot of people.

This will be discussed more in the following week.

Software Daily

Software Daily

 
Subscribe to Software Daily, a curated newsletter featuring the best and newest from the software engineering community.