Tuning Machine Learning Models
Tuning is the process of maximizing a model’s performance without overfitting or creating too high of a variance. In machine learning, this is accomplished by selecting appropriate “hyperparameters.”
Hyperparameters can be thought of as the “dials” or “knobs” of a machine learning model. Choosing an appropriate set of hyperparameters is crucial for model accuracy, but can be computationally challenging. Hyperparameters differ from other model parameters in that they are not learned by the model automatically through training methods. Instead, these parameters must be set manually.
Many methods exist for selecting appropriate hyperparameters. This post focuses on three:
- Grid Search
- Random Search
- Bayesian Optimization
Grid Search, also known as parameter sweeping, is one of the most basic and traditional methods of hyperparametric optimization. This method involves manually defining a subset of the hyperparametric space and exhausting all combinations of the specified hyperparameter subsets. Each combination’s performance is then evaluated, typically using cross-validation, and the best performing hyperparametric combination is chosen.
For example, say you have two continuous parameters α and β, where manually selected values for the parameters are the following:
Then the pairing of the selected hyperparametric values, H, can take on any of the following:
Grid search will examine each pairing of α and β to determine the best performing combination. The resulting pairs, H, are simply each output that results from taking the Cartesian product of α and β. While straightforward, this “brute force” approach for hyperparameter optimization has some drawbacks. Higher-dimensional hyperparametric spaces are far more time consuming to test than the simple two-dimensional problem presented here. Also, because there will always be a fixed number of training samples for any given model, the model’s predictive power will decrease as the number of dimensions increases. This is known as Hughes phenomenon.
Random search methods resemble grid search methods but tend to be less expensive and time consuming because they do not examine every possible combination of parameters. Instead of testing on a predetermined subset of hyperparameters, random search, as its name implies, randomly selects a chosen number of hyperparametric pairs from a given domain and tests only those. This greatly simplifies the analysis without significantly sacrificing optimization. For example, if the region of hyperparameters that are near optimal occupies at least 5% of the grid, then random search with 60 trials will find that region with high probability (95%).
To illustrate, imagine a 15 x 30 grid of two hyperparameter values and their resulting scores ranging from 0-10, where 10 is the most optimal hyperparametric pairing (Table 1).
Table 1 – Grid of Hyperparameter Values and Scores
Highlighted in green are the 21 pairings with the highest scores out of the 450 total combinations. Let’s take these 21 pairings to be our desired target range. What if we were to sample points from this grid to see if any lands within the target? Each random draw has a 21/450 ≈ 4.67% of doing so. If we randomly select 60 points, all independent of one another, then the probability that none of them land in the target, or in other words all of them miss, is
Therefore, the probability that at least one of them succeeds in hitting the desired interval is 1 minus that quantity.
In this particular example, sampling just 60 points from our hyperparameter space yields over a 94% chance of selecting a hyperparameter value within our desired interval near the maximum value. In other words, in a scenario with a 5% desired interval around the true maximum, sampling just 60 points will yield a sufficient hyperparameter pairing 95% of the time.
There are two main benefits to using the random search method. The first is that a budget can be chosen independent of the number of parameters and possible values. Based on how much time and computing resources you have available, random search allows you to choose a sample size that conforms to a budget but still allows for a representative sample of the hyperparameter space. The second benefit is that adding parameters that do not influence performance does not decrease efficiency.
The idea behind Bayesian Optimization is fundamentally different from grid and random search. This process builds a probabilistic model for a given function and analyzes this model to make decisions about where to next evaluate the function. There are two main components under the Bayesian optimization framework.
- A prior function that captures the behavior of the unknown objective function and an observation model that describes the data generation mechanism.
- A loss function, or an acquisition function, that describes how optimal a sequence of queries are, usually taking the form of regret.
The most common selection for a prior function in Bayesian Optimization is the Gaussian process (GP) prior. This is a particular kind of statistical model where observations occur in a continuous domain. In a Gaussian process, every point in the defined continuous input space is associated with a normally distributed random variable. Additionally, every finite linear combination of those random variables has a multivariate normal distribution.
There are a number of options when choosing an acquisition function. Choosing an acquisition function requires choosing a trade-off in exploration of the entire search space vs. exploitation of current promising areas.
Probability of Improvement
One approach is to choose an improvement-based acquisition function, which favors points that are likely to improve upon an incumbent target. This strategy involves maximizing the probability of improving (PI) over the best current value. If using a Gaussian posterior distribution, this can be calculated as follows:
In each iteration, the probability of improving is maximized to select the next query point. Although the probability of improvement can perform very well when the target is known, using this method for an unknown target causes the PI to lose reliability.
Another strategy involves the case of attempting to maximize the expected improvement (EI) over the current best. Unlike the probability of improvement function, the expected improvement also incorporates the amount of improvement. Assuming a Gaussian process, this can be calculated as follows:
Gaussian Process Upper Confidence Bound
Another method takes the idea of exploiting lower confidence bounds (upper when considering the maximization) to construct acquisition functions that minimize regret over the course of their optimization. This requires the user to define an additional tuning value, . This lower confidence bound (LCB) for a Gaussian process is defined as follows:
There are a few limitations to consider when choosing Bayesian Optimization over other hyperparameter optimization methods. The power of the Gaussian process depends highly on the covariance function, and it is not always clear what the appropriate covariance function choice should be. Another factor to consider is that the function evaluation itself may involve a time-consuming optimization procedure. It’s important to find the best hyperparameters for your model, but in many cases, the complexity associated with finding the best hyperparameters using Bayesian Optimization may exceed the project’s established budget. If possible, one should always consider utilizing parallel computing when performing this technique to maximize computing resources and cut back on time.
Choosing an appropriate set of hyperparameters is crucial for machine learning model accuracy. We have discussed three different approaches for selecting hyperparameter values and the trade-offs associated with choosing one optimization method over another. Time, budget, and computing abilities are all factors to consider when choosing a method. Small hyperparameter spaces and lax restraints for budget and computing resources may make Grid Search the best option. For larger hyperparameter spaces or more computing constraints, a simple random search with a sufficient sample size or a Bayesian optimization technique may be more appropriate.