Bayesian Optimisation Overview

A glimpse on Bayesian Optimisation

Jevgenij Gamper
TIA Lab Meeting 10/11/2017

Slides available at jgamper.github.io/GettingFeetWetInBayesianOptimisation
for all the links and useful resources click below

Video Lectures in order of decreasing awesomnes:

N. Freitas (Oxford,DeepMind) lecture on Bayesian Optimisation and Multi-Armed Bandits
Javier Gonzalez (Amazon): Introduction to Bayesian Optimisation
Gilles Louppe (CERN) presentation on scikit-optimise and a bit on Bayes Opt using RF

Bayesian Optimisation Software (Python)

scikit-optimise

Most straightforward to use, see third video above
Allows custom regressors (in addition to GP) for surrogate underlying function modelling: Random Forests, Neural Netwokrs, etc..

GPflowOpt

Built on top of GPflow [which was inspired by GPy(Sheffield)]; built on top of TensorFlow
Implements sparse GP approximations >> scalable
More customisation: kernel design
Early stages, ontly continious parameters (can easily add categorical though: hamming distance)

GPyOpt

Lots of examples
More customisation
Supports discrete and categorical parameters

Interesting literature:

Brochu, Cora, and de Freitas, “A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning.”
Shahriari, Swersky, and Freitas, “Taking the Human Out of the Loop: A Review of Bayesian Optimization”
Snoek, Larochelle, and Adams, “Practical Bayesian Optimization of Machine Learning Algorithms.”
Swersky, Snoek, and Adams, “Freeze-Thaw Bayesian Optimization.”

"Civilisation advances by the number of important operations which we can perform without thinking of them" (Alfred North Whitehead)

We are interested in smarter automation!

Automatic model configuration
Automate the design of physical experiment

Motivating examples

Baltz et al., “Achievement of Sustained Net Plasma Heating in a Fusion Experiment with the Optometrist Algorithm.”

Motivating examples

Snoek, Larochelle, and Adams, “Practical Bayesian Optimization of Machine Learning Algorithms.”

$$ x^{*} = \underset{x}{\arg\max} \ \mathcal{f}(x)$$

$\mathcal{f}$ is a black box function (no closed form, no gradients)
$\mathcal{f}$ is expensive to evaluate
most likely we only have access to noisy observations

$$ y = \mathcal{f}(x) + \epsilon$$

How do people usually optimise?

Grid search
Random Search
Grad student descent
Other black magic, and heuristics

What do we do when we have no access to the underlying function?

We try to fit a model to it! Regression!

Gaussian Process (or other regressor) as surrogate model to $\mathcal{f}$

$p(\begin{bmatrix} \mathbf{y} \\ \mathbf{f}^{*} \end{bmatrix} | \sigma^{2}) = \mathcal{N}(0, \begin{bmatrix} \mathbf{C} + \sigma^{2}\mathbf{I} & \mathbf{R}\\\mathbf{R}^{T} & \mathbf{C}^{*}\end{bmatrix}),$

$p(\mathbf{f}^{*} | \mathbf{y}, \sigma^{2}) = \mathcal{N}(\mathbf{\mu}^{*}, \mathbf{\Sigma}^{*}),$

$\mathbf{\mu}^{*} = \mathbf{R}^{T} (\mathbf{C} + \sigma^{2}\mathbf{I})^{-1} \mathbf{y}$

$\mathbf{\Sigma}^{*} = \mathbf{C}^{*} -\mathbf{R}^{T} (\mathbf{C} + \sigma^{2}\mathbf{I})^{-1} \mathbf{R}$

Okay. Regression, so what?

Bold move - switch to GPSS slides!

Use surrogate model of $\mathcal{f}$ to carry out the optimisation

Define a utility/loss/acquisition function $\mathcal{u}(.)$ to collect new data points satisfying some optimality criterion: optimisation as decision

Study the decision problem as inference using the surrogate model

Uncertainty quantification: Making informed decisions

Utility function

Utility should represent our study design goal:

Active Learning and experimental design: reduce the uncertainty in the model

Optimisation: Minimize the loss in a sequence $x_{1}, ... , x_{n}$

$$r_{N} = \sum_{n=1}^{N} \mathcal{f}(x_{n}) - N\mathcal{f}(x_{M})$$

(1) does a lot of exploration, whereas (2) encourages exploitation about the minimum of surrogate $\mathcal{f}$

GP upper (lower) confidence bound

Direct balance between exploration and exploitation:

$$\alpha_{LCB}(\mathbf{x}; \theta, \mathcal{D}) = - \mu(\mathbf{x}; \theta, \mathcal{D}) + \beta_{t} \sigma(\mathbf{x}; \theta, \mathcal{D})$$

Bayesian Optimisation - methodology to perform global optimisation of multi-modal black box functions

Choose some prior over the space of possible functions $\mathcal{f}$

Combine prior and the likelihood to get posterior

Use the posterior to decide where to take the next sample according to some acquisition function

Augment the data

Iterate between 2 and 4 until the budget is over

Acquisition function $u(x)$ guides the optimisation by determining which $x_{t+1}$ to observe next

Exploration: prefer high variance regions
Exploitation: prefer high mean regions

Is there a free lunch though?

Choice of $u(x)$ heavily affects optimisation results

Scalability in number of parameters

Up to 10 is okay..

But there are solutions!

Use ensemble of acquisition functions (see resources)
Use random forest regressor (Fusion example)

Swersky, Snoek, and Adams, “Freeze-Thaw Bayesian Optimization.”

$\mathcal{N}(\mathbf{\mu}, \mathbf{\Sigma}),$

Snoek, Larochelle, and Adams, “Practical Bayesian Optimization of Machine Learning Algorithms.”

Multi-task learning
Expected Improvement per second