Podcast: Play in new window | Download
“I normally try to sit together or very close to a product team or engineering team. And by doing so, I get very close to the source of all kinds of challenging problems.”
“We want people to be able to pick up whatever tool it is and really push themselves to get something done with it in a short amount of time, because that’s ultimately what they need to do as a data engineer in the industry.”
“Changing anything changes everything.”
Technical debt, referring to the compounding cost of changes to software architecture, can be especially challenging in machine learning systems.
“[As adults] we get overly serious, and a lot of the fun goes out of learning, so I think what we’re trying to do is make Treehouse delightful.”
“There’s not enough data scientists out there, and every company wants them to do everything. So, you really have to focus on ‘How can I be most impactful with the limited time and resources I have?’ ”
“Sometimes there’s a misconception that Genie is a job scheduling platform… Genie really represents our extraction layer, from what our computational resources are, to our end user jobs.”
Genie is an open-source tool that provides job and resource management for the Hadoop ecosystem in the cloud.
Current infrastructure makes it difficult for data scientists to share analytical models with the software engineers who need to integrate them. Yhat is an enterprise software company tackling the challenge of how data science gets done. Their products enable companies and users to easily deploy data science environments and translate analytical models into production code.
Data science competitions are an effective way to crowdsource the best solutions for challenging datasets. Kaggle is a platform for data scientists to collaborate and compete on machine learning problems with the opportunity to win money from the competitions’ sponsors.
“A lot of data science teams – if you ask them what their ten most important questions are… a lot of people can’t even come up with those.”
Many companies find themselves drowning in data. The quantity of data matters far less than the right questions in the pursuit of actionable insights.
Data science is a broad topic with numerous subfields such as data engineering and machine learning. Yad Faeq returns to the podcast to discuss data science at a high level, and rescue Software Engineering Daily from the threat of the hype vortex.
There is a need for more data scientists to make sense of the vast amounts of data we produce and store. Dataquest is an in-browser platform for learning data science that is tackling this problem.
Vik Paruchuri is the founder of Dataquest. He was previously a machine learning engineer at EdX and before that a U.S. diplomat.
Data science is saving and improving lives by leveraging sensor data and machine learning. Pivotal makes software platforms and database products to enable enterprises to make use of their data.
Sarah Aerni is principal data scientist at Pivotal.
Dima Korolev, Engineer and Data Scientist via Quora Here are the two approaches to data science, which I call Sysadmin approach and Scientist approach. Sysadmin approach: Use the knowledge obtained by reading Apache logs, nginx logs, systemd logs, cron logs, etc.. A good sysadmin would open the log file, press page down and watch it, stopping and scrolling back on anomalies. A great sysadmin would make a couple iterations of
http://traffic.libsyn.com/sedaily/matei_spark.mp3Podcast: Play in new window | Download Apache Spark is a fast and general engine for big data processing. Matei Zaharia created Spark, and is the co-founder of Databricks, a company using Spark to power data science. Questions: What was the motivation behind creating Spark? How much faster is a Spark job than a Hadoop job? What is the relationship between streaming and batch processing? Is Spark’s core advantage over Storm