Episode Summary for Data Exploration with a New Python Library with Doris Lee
Doris Jung-Lin Lee is currently a graduate research assistant and a Ph.D. student in the Information Management and Systems department at the University of California, Berkeley. Her main research areas are the intersection of databases, data management, and human-computer interaction. She works on developing Lux which is a Python library for accelerating and simplifying the process of data exploration.
Data exploration uses visual exploration to understand what is in a data set and the characteristics of the data. Data scientists explore data to understand things like customer behavior and resource utilization. Some common programming languages used for data exploration are Python, R, and MATLAB. Scientists use many automated assistance tools for interactive data exploration. Interactive data exploration has become an area of interest in the field of machine learning. Usage of automated assistance in the process of machine learning development could be fully automated robots. However, that is not the case since there are different phases of this automation.
The three main phases of this process can be very well explained with an example from cars. Cars could be fully automated, half automated, or, like our current cars, mostly manual. However, even our current cars also have some level of automation built-in. For instance, as the driver of the car, a driver does not need to think about how the gas piston in our engines works or how the gas pedal works. Hence, there is still some level of automation. Current cars could be thought of as the current standing point of the current machine learning tools like the Scikit-learn Python library or other packages. People manually develop these tools, and they develop the pipelines for some particular end objective just like current car manufacturers try to implement more automation with every new model. The end goal is a fully automated machine learning system.
H20 framework and other automated machine learning tools are great examples of this trend. They introduce more levels of automation and these more automated tools work at a very high level. They allow users to specify what objective those users are trying to achieve. Is that a classification task or prediction task? What are the variables that you’re interested in predicting? After questions are answered by the user, the system performs some sort of search and automation to figure out what is the best machine learning pipeline or what is the best workflow for the given task that the user is interested in achieving.
One of the projects Doris Lee works on is Lux. Lux is a platform for easy data exploration and a Python application programming interface for visual discovery. Automated intelligent data discovery is a very significant issue. There are numerous decisions and questions people have to make when they want to learn more about their data set. Some of these questions are:
- What are the relevant paths of exploration I should take?
- How can I process my data correctly?
- How do I visualize my data?
- How do I look at my data in a way that allows me to extract meaningful insights?
A lot of work Lee does in her research is to figure out how to provide a level of assistance or automation to help people more easily discover these insights without thinking too much about several step sequences, or the sequence of operations that users need to perform on their data to get to those insights.
With Lux, users can simply print out data frames in Jupyter Notebook. Lux would recommend a set of interesting visualizations that might be useful for data analysis. And these visualizations are displayed as a Jupyter widget, which is directly inside a notebook and that provides many advantages. These visualizations are essentially recommended for free to the users without needing to write any additional lines of code or change any existing pandas or DataFrame commands that they might already be using. When users print out the data frame, they have this alternative visual way of looking at a data frame.
According to Doris Lee, the visualization shouldn’t be something that happens at the end of your analysis. Often people find anomalies or unexpected behaviors in their data by simply just looking at the visualizations and analyzing them. The goal of Lux is to help you think about your overall notebook workflow and provide an alternative and visual view of experimenting and understanding data.
Numerous businesses including retail, insurance, media companies, and healthcare use Lux for their workflow. People generally use Lux alongside their favorite plotting tools like Matplotlib or Seaborn. Sometimes they have directly used via pandas DataFrame plots as well. Lux is built on top of ipywidgets which are interactive HTML widgets for Jupyter notebooks and the Python kernel. Ipywidgets could be used for building things like sliders or buttons and also handles some communication with the notebook itself. Their design principle has been to help users get to these visualizations as soon as possible during their exploration to minimize the activation energy that is required to do that.
Lux also allows the ability to take a visualization that is automatically recommended and export it into code so that users can do fine-tuning within libraries like Altair and Matplotlib. Nevertheless, the design principle of Lux is not to get the best visualization that the user could ideally build in some of the other business intelligence tools. The goal of Lux is to get something good enough for exploration and be able to communicate some sort of quick insight from your data. Lux is a high-level way of helping guide users toward relevant analysis.
Data science is constantly changing and it leans toward interactive data science. Interactive data sciences are composed of data analysis, data cleaning, and machine learning of users. Doris Lee thinks there’s a lot of potential in that. We can already see how the Jupyter community has contributed to the open-source ecosystem of tools that allows end-users and data scientists to interactively work with their data in an accessible and intuitive way.
The other change is that the computational notebook itself is becoming a window to data science. It’s highly interactive and accessible. Generally, it’s an entry point for people who are starting and learning about data science to be able to learn quickly. The standard tools like pandas and Scikit-learn are also windows to data science as entry points. There is a shift toward consolidation and essentially a convergence toward notebooks as the interactive computing platform. The enterprise offerings of notebooks on the cloud are great examples of that.
It is exciting to see the effects of these changes on a data science workflow. It can inspire domain experts, small and mid-size enterprises (SMEs), and others with domain knowledge but who are not necessarily well-trained in computer or data science to more easily derive meaningful insights from their data through these accessible and intuitive computational notebooks and platforms.
This summary is based on an interview with Doris Jung-Lin Lee, Graduate Research Assistant at the University of California, Berkeley. To listen to the full interview, click here.