Is Scala a better choice than Python for Apache Spark?

From Marcin Mejran’s answer via Quora:

If you mean the API then it depends.

First of all, performance won’t most likely matter since it’s almost all Scala under the hood for Spark and you can always use more machines to make up for anything else. Learning curves can be overcome and Spark’s Scala API is rather simple. Ease of use is a toss up honestly and probably the key point to consider.

Spark is written in Scala so knowing Scala will let you understand and modify what Spark does internally. Without Scala once you hit the wall of what’s possible out of the box you’re done, need a new type of RDD? Too bad. Spark’s documentation is not perfect and like much of open source the code is the real documentation. So unless you can understand Scala you’ll never be able to really understand what is happening when you run some code.

That said, Scala lacks the same amount of Data Science libraries and tools as Python. There isn’t good visualization, there isn’t good local data transformations, there aren’t good local tools, etc, etc. People have ported most of the core parts of R to Python and there’s easy ways to call R directly from Python. Scala/Java have not had anywhere near as much effort on that front. There’s some things but they’re not very comprehensive or cohesive. For example, Databricks seems to be basing it’s proprietary offerings on Python rather than Scala (basically an IPython notebook for Spark).

So if you want to do simple to moderately complex analysis using Spark then Python seems the way to go. If you want to build production systems then Scala is the way to go. If you want to do really complicated things then Scala with possibly a final Python layer.

Comments