Introduction to Automated Machine Learning (AutoML)

Article Wednesday, May 15 2019

Machine learning is undoubtedly one of the biggest strides in technology. Its methods are employed in fields ranging from biomedical industry to agriculture, from personalized assistants to self-driving vehicles. Ranked as the 2nd most important hard skill to have according to LinkedIn, machine learning and AI require careful study and understanding of different algorithms, model types, their advantages and disadvantages, and use cases.

In what is called a machine learning pipeline, there are several steps:

Data preprocessing: scaling, missing value imputation
Feature engineering: feature selection, feature encoding
Model selection
Hyperparameter optimization

Source: https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b

A machine learning engineer, or a data scientist, when building the machine learning pipeline for a specific task has to carefully design each of these steps. These steps are usually co-dependent. To give an example, consider a problem where the use of SVMs are desirable in building the model. Then, since SVMs cannot work natively with categorical features, these have to be transformed in some way, for example by one hot encoding, to numerical features. In this case, the model selection affects how certain features are encoded.

Designing and optimizing these steps require a deep knowledge on a wide range of algorithms, their strengths and weaknesses, hyperparameters of algorithms, and the encoding of data for these algorithms to work well. In a technological landscape where AI is being integrated into many fields, there exists a deficit of data scientists with enough expertise to analyze diverse sets of data and build machine learning models.

In an effort to make machine learning more accessible, to reduce the human expertise required, and to improve model performance, automated machine learning emerged as an exciting new area of active research.

Figure from Microsoft Azure Machine Learning AutoML

Automated machine learning, or AutoML, is an umbrella term for a particular approach to machine learning that aims to automate any part of the process of building a machine learning model from raw data.

AutoML caught the spotlight after Google announced its AutoML suite, Google Cloud AutoML, and Microsoft announced AutoML in Azure Machine Learning. Google’s start with AutoML came in the form of AutoML Vision for image recognition. As the first tech giant to offer AutoML to developers around the world, Google is continuing to expand on AutoML, with new tools around Cloud AutoML announced at Google Next ‘19.

Current AutoML tools like Auto-WEKA and auto-sklearn focus on automating the steps of model selection and hyperparameter optimization. This subset of automation problem is coined as CASH, Combined Algorithm Selection and Hyperparameter Optimization problem. The aim of CASH is to find the joint algorithm and hyperparameter settings that minimizes loss of the training dataset, given a set of algorithms and hyperparameters of these algorithms.

CASH problem from Efficient and Robust Automated Machine Learning by Feurer et. al.

An important point to consider in AutoML applications is the budget: the developer has to specify the limits of the resources being used in the AutoML optimization process. This budget usually consists one or the combinations of CPU/GPU usage, running time, and memory usage.

Hyperparameter Optimization

In numerous machine learning models and algorithms, there exist two sets of parameters that are sometimes confused: model parameters and hyperparameters. Model parameters can also be known as weights in linear regression and deep learning. These model parameters are learned by the model from the data during training.

Hyperparameters, on the other hand, are different. Their values are set by the developer before the training stage starts. They are not learned from the data during training, like model parameters, and so hyperparameters are usually constant during the training phase.

To give some concrete examples for hyperparameters:

Learning rate (η)
Hidden layers and hidden units in deep learning models
Number of neighbors k in kNN

Hyperparameter selection is crucial to the performance of a machine learning model. For example, in a neural network model, if the learning rate is set too high, the gradient descent might overshoot the local minima; if the learning rate is set too low, the training might take a long time, since the steps taken during gradient descent are too small.

Source: https://www.jeremyjordan.me/nn-learning-rate/

Hyperparameter optimization is the process of searching for the best hyperparameter combinations for a model to achieve desired performance and accuracy. In an AutoML perspective, hyperparameter optimization is the most basic, fundamental task to be completed.

The problem is not easy, however. For any given machine learning model, there can be numerous hyperparameters. Each of these parameters can have different domains: real-valued, binary, categorical, or integer-valued. In the case of real- and integer-valued hyperparameters, the feasible domains are unknown: the layers of a deep learning model, an integer-valued hyperparameter, can virtually take values between 1 and hundreds.

The configuration space becomes exceedingly complex as the number of hyperparameters increase. Every hyperparameter to be considered needs to have a combination with every other hyperparameter configuration for an exhaustive search. Another problem that arises when more hyperparameters are considered is selecting which hyperparameters to optimize for. Not all HPs have the same effect on the performance of a model, and we don’t want to waste time optimizing hyperparameters that will give us only a marginal performance increase.

Thankfully, the optimization problem has been studied, and feasible solutions exist.

The first solution is quite straightforward: grid search. In grid search, the developer declares a set of values to be considered for each hyperparameter to be optimized. Then the model is trained with different combinations involving each hyperparameters, with a Cartesian product, and the hyperparameter configuration from the best performing model is selected.

However, grid search suffers from the curse of dimensionality, as each additional hyperparameter exponentially increases the number of times the loss function must be evaluated. Another problem is the initialization: if the developer has not specified the optimal values in the set of each hyperparameter, the optimum can never be reached.

An improvement is random search. As the name suggest, random search takes random configurations of hyperparameters and records the results until a specified budget is exhausted. Random search solves the curse of dimensionality, since we do not need to increase the number of search points whenever a new dimension is added. Random search performs better when some hyperparameters are more important in the performance of the model, resulting in a low effective dimensionality. In theory, given enough budget, random search can find the optimal configuration.

Grid search and random search, from Random Search for Hyper-Parameter Optimization by Bergstra and Bengio

However, grid search has its downsides as well. Reaching the optimum is not guaranteed, and the replicability depends on a random seed. Is there a better and more rigorous method?

The answer, and the most widely-used solution to hyperparameter optimization problem is Bayesian optimization. Bayesian optimization is a sequential model-based approach to find the optimal configuration for any given argmax or argmin function. It consists of two main parts: a probabilistic surrogate model and an acquisition/loss function. The surrogate model has a prior distribution that we think is close to the unknown objective function, while the acquisition function allows us to decide which point to evaluate next.

Bayesian optimization starts by taking a point in the multi-dimensional space of hyperparameter configurations, gets the corresponding objective function value, and then selects a new point that minimizes the acquisition function. This point is used to augment our data set, and becomes a historical observation to be used in future point selections.

Bayesian optimization algorithm, from Taking the Human Out of the Loop: A Review of Bayesian Optimization by Shahriari et. al.

Bayesian optimization is designed to trade off exploration and exploitation. The acquisition functions’ values are lower where uncertainty in the surrogate model is large, to encourage exploration. The acquisition function also gives lower values where model prediction is also low, utilizing the historical knowledge we have of the true objective function behavior, to encourage exploitation.

The performance of Bayesian optimization rests on selecting an appropriate surrogate model and acquisition function. The traditional surrogate model utilizes Gaussian processes, but further improvements are suggested, such as random forests as in SMAC framework, or Tree-structured Parzen Estimator (TPE) approaches.

Bayesian optimization through 3 iterations for an argmax task, from Taking the Human Out of the Loop: A Review of Bayesian Optimization by Shahriari et. al.

While Bayesian optimization is harder to wrap your head around and visualize compared to grid and random searches, it’s the most common hyperparameter optimization method used in the current AutoML libraries.

Case Study: auto-sklearn

From Efficient and Robust Automated Machine Learning by Feurer et. al.

auto-sklearn is a popular automated machine learning toolkit, built on the widely used scikit-learn library for machine learning. auto-sklearn as a project, inspired by Auto-WEKA, expands upon the methods used by AutoML frameworks.

The core of the model is straightforward: taking into consideration 15 classification algorithms, 14 feature preprocessing methods, and 4 data preprocessing methods from scikit-learn, and taking suitable combinations, a parameter space of 110 hyperparameters are created. Since there is a conditionality between some preprocessing methods and classifiers, the number of hyperparameters is not 14 * 15 * 4 = 840, but is rather constrained at 110. This core ML framework is then optimized using Bayesian optimization to find the best possible combinations of preprocessors, classifier, and hyperparameters.

The innovative part of auto-sklearn comes in two methods: using meta-learning to warmstart Bayesian optimization for increased performance, and using ensemble methods with the resulting top classifiers to increase robustness and reduce overfitting.

Meta-learning is a field of machine learning that focuses on learning to learn. It’s based on the approach of systematically observing how ML approaches perform on a wide range of learning tasks, and using this knowledge in the form of meta-data on approaches to learn new tasks much faster.

In auto-sklearn, meta-learning is used to collect meta-features and performance metrics on datasets to identify the characteristics of the dataset that can suggest efficient algorithm and hyperparameter instantiation. Any newly encountered dataset goes through a stage of computation for its meta-features, and the result is compared with stored dataset meta-features to select k ML framework instantiations to be considered in the Bayesian optimization stage.

The other improvement, automated ensemble construction, takes advantage of the fact that there might be more than one model that performs well on the given dataset. If there are models that are close to the best performing model, instead of discarding them, they can be used to construct an ensemble.

auto-sklearn is extremely simple to use, as one would expect from an AutoML library. The only thing necessary is a dataset, and the An example using auto-sklearn for a regression task:

X, y = sklearn.datasets.load_boston(return_X_y=True)
feature_types = (['numerical'] * 3) + ['categorical'] + (['numerical'] * 9)
X_train, X_test, y_train, y_test = \
    sklearn.model_selection.train_test_split(X, y, random_state=1)

automl = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder='/tmp/autosklearn_regression_example_tmp',
    output_folder='/tmp/autosklearn_regression_example_out',
)
automl.fit(X_train, y_train, dataset_name='boston',
           feat_type=feature_types)

print(automl.show_models())
predictions = automl.predict(X_test)
print("R2 score:", sklearn.metrics.r2_score(y_test, predictions))

You can see the full example here, and more examples here. Using auto-sklearn is as simple as calling the appropriate classifier or regressor with a specified budget and input/output folders. With a single call to the fit method of auto-regressor, auto-sklearn finds the best performing model for the dataset and task at hand.

Open-Source Libraries for AutoML

Auto-WEKA: Based on the open-source WEKA project, Auto-WEKA is the first open source AutoML tool to be developed, dating back to 2013. Auto-WEKA 2.0 was released subsequently in 2016, adding support for regression, parallelism, and optimization for new metrics. Auto-WEKA 2.0 is available in WEKA as a package, and is quite accessible to tinkerers and developers alike.

Auto-Keras: An open source library for automated neural network learning. Trying to tackle the neural architecture search (NAS) problem, Auto-Keras utilizes network morphism and Bayesian optimization.

TPOT (Tree-based Pipeline Optimization Tool): Taking a different approach from the aforementioned tools, TPOT is an AutoML tool that uses genetic programming for its optimization procedure. Implemented upon scikit-learn, TPOT is offered as a Python library.

Limitations and Conclusion

AutoML is still an active research area, and there’s progress to be made. Current approaches include solving tasks like classification and regression, and can configure neural networks. However, AutoML solutions have their limitations:

Problems such as semi-supervised learning, unsupervised learning, and reinforcement learning are not yet tackled by the AutoML community.
AutoML algorithms rely on the data being clean and relevant. Data cleaning and feature engineering are not yet supported by any of the AutoML approaches.
AutoML jobs can take quite a long time, in the magnitude of days, to come up with a well-performing solution, even with a warmup step.

Machine learning and AI are becoming more accessible with each passing year. While high-level libraries like Keras hide the underlying complexity of deep learning models, AutoML approaches take one step further, and are able to provide feasible machine learning models just from a dataset as an input. This provides a smooth pathway into machine learning for non-experts. AutoML can provide production-ready models for small startups that cannot dedicate enough budget to hiring ML experts.

This does not mean that AutoML is only directed towards non-experts. Techniques used in AutoML libraries can provide powerful tools for automated optimization for developers, and the results of AutoML searches can provide valuable intuition towards model choices and hyperparameter configurations. AutoML also does not mean that there’ll be no need for machine learning experts – collecting data, ingesting data, cleaning and preprocessing, monitoring and evaluating are important parts of any ML pipeline, and require expertise.

At the end, towards the aim of making AI more available to the general public, developments in AutoML constitute a huge stride in the right direction. With the recent rise it’s seen as a research interest, AutoML can revolutionize the way ML is practiced.

Gokhan Simsek

Eindhoven, The Netherlands

Gokhan is a computer science graduate, currently pursuing a MSc. degree in Data Science at Eindhoven University of Technology. He’s interested in big data, NLP, and machine learning.