Podcast: Play in new window | Download
“I normally try to sit together or very close to a product team or engineering team. And by doing so, I get very close to the source of all kinds of challenging problems.”
“We want people to be able to pick up whatever tool it is and really push themselves to get something done with it in a short amount of time, because that’s ultimately what they need to do as a data engineer in the industry.”
“Changing anything changes everything.”
Technical debt, referring to the compounding cost of changes to software architecture, can be especially challenging in machine learning systems.
“There’s not enough data scientists out there, and every company wants them to do everything. So, you really have to focus on ‘How can I be most impactful with the limited time and resources I have?’ ”
“Sometimes there’s a misconception that Genie is a job scheduling platform… Genie really represents our extraction layer, from what our computational resources are, to our end user jobs.”
Genie is an open-source tool that provides job and resource management for the Hadoop ecosystem in the cloud.
Current infrastructure makes it difficult for data scientists to share analytical models with the software engineers who need to integrate them. Yhat is an enterprise software company tackling the challenge of how data science gets done. Their products enable companies and users to easily deploy data science environments and translate analytical models into production code.
“A lot of data science teams – if you ask them what their ten most important questions are… a lot of people can’t even come up with those.”
Many companies find themselves drowning in data. The quantity of data matters far less than the right questions in the pursuit of actionable insights.
Data science is a broad topic with numerous subfields such as data engineering and machine learning. Yad Faeq returns to the podcast to discuss data science at a high level, and rescue Software Engineering Daily from the threat of the hype vortex.
There is a need for more data scientists to make sense of the vast amounts of data we produce and store. Dataquest is an in-browser platform for learning data science that is tackling this problem.
Vik Paruchuri is the founder of Dataquest. He was previously a machine learning engineer at EdX and before that a U.S. diplomat.
http://traffic.libsyn.com/sedaily/guozhang_kafka.mp3Podcast: Play in new window | DownloadApache Kafka is a publish-subscribe messaging system rethought as a distributed commit log. Kafka serves as the central repository for data streams in a distributed system. Guozhang Wang is an engineer at Confluent, which offers a stream data platform built using Kafka. Questions include: What is a central repository for data streams? How does Kafka improve transportation between systems? How does Kafka allow for richer