Hyperparameters optimization

April 13, 2018

Stefano Martina

The construction of machine learning models models demands big effort on hyperparameters tuning. Usually the hyperparameters are manually engineered and are the product of time-consuming experimentations. Hyperparameters choose is really important in model performance.

We can use two methods in order to overcome this manual intervention:

automatic hyperparameters optimization;
development of models with reduced hyperparameters number.

Hyperparameters optimization

Problem definition

Given a machine learning algorithm $A$ with hyperparameters $\lambda_1,\dots,\lambda_n$ in $\Lambda_1,\dots,\Lambda_n$, the hyperparameters space is defined as $\Lambda=\Lambda_1\times\cdots\times\Lambda_n$. For each hyperparameter setting $\vec{\lambda}\in\Lambda$, we use $A_\vec{\lambda}$ to indicate $A$ using this setting. We use $\loss(A_\vec{\lambda}, \data_{train}, \data_{valid})$ to denote the validation loss of $A_{\lambda}$ using $\data_{valid}$ validation set, when is trained using $\data_{train}$ training set.

Using $k$-fold cross validation we divide the dataset in $k$ splits $\data_{train}^{(1)},\dots,\data_{train}^{(k)}$ and $\data_{valid}^{(1)},\dots,\data_{valid}^{(k)}$. The hyperparameter optimization problem is:

$\begin{equation} \argmin_{\vec{\lambda}\in\Lambda}\ f(\vec{\lambda}), \label{eq:problem} \end{equation}$

where $f$ is the loss:

$\begin{equation} f(\vec{\lambda}) = \frac{1}{k}\sum_{i=1}^k\loss(A_{\vec{\lambda}},\data_{train}^{(i)},\data_{valid}^{(i)}). \label{eq:loss} \end{equation}$

Hyperparameters can be continuous, integer or categorical. An hyperparameter $\lambda_i$ is conditional on another hyperparameter $\lambda_j$ if $\lambda_i$ is active only when $\lambda_j$ takes values in a specific set $V_i(j)\subset\Lambda_j$.

Review

Historically the standard in hyperparameters optimization was grid search. This method consists in:

define a grid $V\subset\Lambda$ on hyperparameters space, respect to certain ranges;
evaluate \eqref{eq:loss} for all $\vec{\lambda}\in V$;
take the $\vec{\lambda}$ with the smaller value of loss.

A more advanced approach consists in starting with a coarse grid and refining the best result with a fine one around it.

other approaches include:

gradient search when some differentiability and continuity conditions of the training criterion are satisfied [Bengio2000] ;
evolutionary computation techniques [Guo2006] ;
racing algorithms [Maron1994] .

It has been proven that a random search is more effective than grid search on models with low effective dimensionality [Bergstra2012] .

$Grid search versus random search in $\Lambda$ with two dimensions.$

Fig 1: Grid search versus random search in $\Lambda$ with two dimensions.

This improvement is visible in Fig 1 where an example bidimensional parameter space is shown. The green area identifies an important parameter, the yellow one an unimportant one. Thus the effective dimensionality of the example is closer to one respect to two. In both grid and random search settings we use a cardinality of nine for $V\subset\Lambda$, but in the grid layout we test the important parameter only on three values instead of nine.

Recently more sophisticated Sequential Model-based Bayesian Optimization (SMBO) [Hutter2011] methods have been developed. Those methods grant better results on \eqref{eq:problem} and are capable of optimize when $\Lambda$ has far more dimensions respect to other methods.

Sequential Model-based Bayesian Optimization (SMBO)

SMBO are generic framework for optimizing blackbox functions. It uses a probabilistic model $\model$ to model $f$ based on point evaluation and any prior. It uses $\model$ to evaluate promising inputs and update $\model$ itself.

In Fig 2 is visible the algorithm that resume the optimization. In order to determine the candidate $\vec{\lambda}$ in line 3, SMBO uses an acquisition function $a_\model:\Lambda\rightarrow\mR$. The purpose of this function is to quantify the usefulness of chosing a configuration $\vec{\lambda}\in\Lambda$. The objective is to maximise:

$\begin{equation*} \argmax_{\vec{\lambda}\in\Lambda}\ a_\model(\vec{\lambda}). \end{equation*}$

The function $a_\model$ must be selected according to the tradeof between the exploitation (concentrate in regions where the performances are good) and exploration (trying unexplored regions) of the hyperparameters space $\Lambda$. The most used function is the Expected Improvement (EI):

$\begin{equation} EI(\vec{\lambda}) = \int_{-\infty}^{f_{min}}\max\{f_{min}-f, 0\}\cdot p_\model(f\mid\vec{\lambda})\ df \label{eq:ei} \end{equation}$

that at each value $\vec{\lambda}$ integrates the improvements over $f_{min}$ accordingly to a posterior $p_\model(f\mid\vec{\lambda})$ over the model $\model$.

Using a gaussian posterior with mean $\mu_\vec{\lambda}$ and variance $\sigma_\vec{\lambda}^2$, \eqref{eq:ei} simplifies to:

$\begin{equation*} EI(\vec{\lambda}) = \sigma_\vec{\lambda}\cdot(u\cdot\Phi(u)+\phi(u)), \end{equation*}$

where $u=\frac{f_{min}-\mu_\vec{\lambda}}{\sigma_\vec{\lambda}}$ and $\phi$ and $\Phi$ are the PDF¹ and CDF² of the standard normal distribution.

The following bayesian optimization packages have been used in literature:

Sequential Model-based Algorithm Configuration (SMAC) [Hutter2011] uses random forests to model $p_\model(f\mid\vec{\lambda})$ as a gaussian distribution whose mean and variance are the empirical mean and variance over predictions of the forest’s tree.
Spearmint [Snoek2012] uses a Gaussian Process to model $p_\model(f\mid\vec{\lambda})$.
Tree Parzen Estimator (TPE) [Bergstra2011] models $p(\vec{\lambda}\mid f)$ and $p(f)$ instead of $p(f\mid\vec{\lambda})$. That is:
$\begin{equation*} p(\vec{\lambda}\mid f) = \left\{ \begin{array}{ll} l(\vec{\lambda})&\mathrm{if}\ \ f<f^*\\ g(\vec{\lambda})&\mathrm{otherwise} \end{array} \right. \end{equation*}$
where $l(\vec{\lambda})$ is the PDF formed by using observations $\{\vec{\lambda}^{(i)}\}$ such that $f(\vec{\lambda}^{(i)})<f^*$ and $g(\vec{\lambda})$ using the remaining observations. $f^*$ is chosen to be a quantile $\gamma$ of the observed $f$ such that $p(f<f^*)=\gamma$.

Spearmint works for problems with few hyperparameters.

References

[Hutter2015] Hutter F., Lücke J., Schmidt-Thieme L. (2015). Beyond Manual Tuning of Hyperparameters. KI - Künstliche Intelligenz, Volume 29, Issue 4, pp 329–337.

[Bengio2000] Bengio, Y (2000). Gradient-Based Optimization of Hyperparameters. Neural Computation, Volume 12, Issue 8, p.1889-1900.

[Guo2006] Guo X.C., Liang Y.C., Wu C.G., Wang C.Y. (2006). PSO-Based Hyper-Parameters Selection for LS-SVM Classifiers. Neural Information Processing. ICONIP 2006.

[Maron1994] Maron O., Moore A. W. (1994). Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation. Advances in Neural Information Processing Systems 6, 59--66.

[Bergstra2012] Bergstra J, Bengio Y. (2012). Random search for hyper-parameter optimization. The Journal of Machine Learning Research, Volume 13 Issue 1, January 2012, Pages 281-305.

[Hutter2011] Hutter F., Hoos H.H., Leyton-Brown K. (2011). Sequential Model-Based Optimization for General Algorithm Configuration. Learning and Intelligent Optimization. LION 2011.

[Snoek2012] Snoek J., Larochelle H., Adams R. P. (2012). Practical Bayesian optimization of machine learning algorithms. NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2, Pages 2951-2959.

[Bergstra2011] Bergstra J., Bardenet R., Bengio Y., Kégl B. (2011). Algorithms for hyper-parameter optimization. NIPS'11 Proceedings of the 24th International Conference on Neural Information Processing Systems, Pages 2546-2554.

Footnotes

Probability Density Function ↩︎
Cumulative Density Function ↩︎

Trianam's notes a blog about machine learning and artificial intelligence

Hyperparameters optimization

Hyperparameters optimization

Problem definition

Review

Sequential Model-based Bayesian Optimization (SMBO)

References

Footnotes

Trianam's notes
a blog about machine learning and artificial intelligence