A glimpse on Bayesian Optimisation



Jevgenij Gamper
TIA Lab Meeting 10/11/2017

Slides available at jgamper.github.io/GettingFeetWetInBayesianOptimisation
for all the links and useful resources click below

Click here for resources

Video Lectures in order of decreasing awesomnes:

Bayesian Optimisation Software (Python)

  • scikit-optimise
    • Most straightforward to use, see third video above
    • Allows custom regressors (in addition to GP) for surrogate underlying function modelling: Random Forests, Neural Netwokrs, etc..
  • GPflowOpt
    • Built on top of GPflow [which was inspired by GPy(Sheffield)]; built on top of TensorFlow
    • Implements sparse GP approximations >> scalable
    • More customisation: kernel design
    • Early stages, ontly continious parameters (can easily add categorical though: hamming distance)
  • GPyOpt
    • Lots of examples
    • More customisation
    • Supports discrete and categorical parameters

Interesting literature:

"Civilisation advances by the number of important operations which we can perform without thinking of them" (Alfred North Whitehead)

We are interested in smarter automation!

  • Automatic model configuration
  • Automate the design of physical experiment

Motivating examples

Baltz et al., “Achievement of Sustained Net Plasma Heating in a Fusion Experiment with the Optometrist Algorithm.”

Motivating examples

Snoek, Larochelle, and Adams, “Practical Bayesian Optimization of Machine Learning Algorithms.”

$$ x^{*} = \underset{x}{\arg\max} \ \mathcal{f}(x)$$


  • $\mathcal{f}$ is a black box function (no closed form, no gradients)
  • $\mathcal{f}$ is expensive to evaluate
  • most likely we only have access to noisy observations

$$ y = \mathcal{f}(x) + \epsilon$$

How do people usually optimise?


  • Grid search
  • Random Search
  • Grad student descent
  • Other black magic, and heuristics

What do we do when we have no access to the underlying function?

We try to fit a model to it! Regression!

Gaussian Process (or other regressor) as surrogate model to $\mathcal{f}$

$p(\begin{bmatrix} \mathbf{y} \\ \mathbf{f}^{*} \end{bmatrix} | \sigma^{2}) = \mathcal{N}(0, \begin{bmatrix} \mathbf{C} + \sigma^{2}\mathbf{I} & \mathbf{R}\\\mathbf{R}^{T} & \mathbf{C}^{*}\end{bmatrix}),$

$p(\mathbf{f}^{*} | \mathbf{y}, \sigma^{2}) = \mathcal{N}(\mathbf{\mu}^{*}, \mathbf{\Sigma}^{*}),$

$\mathbf{\mu}^{*} = \mathbf{R}^{T} (\mathbf{C} + \sigma^{2}\mathbf{I})^{-1} \mathbf{y}$

$\mathbf{\Sigma}^{*} = \mathbf{C}^{*} -\mathbf{R}^{T} (\mathbf{C} + \sigma^{2}\mathbf{I})^{-1} \mathbf{R}$

Okay. Regression, so what?

Bold move - switch to GPSS slides!

  1. Use surrogate model of $\mathcal{f}$ to carry out the optimisation

  2. Define a utility/loss/acquisition function $\mathcal{u}(.)$ to collect new data points satisfying some optimality criterion: optimisation as decision

  3. Study the decision problem as inference using the surrogate model

Uncertainty quantification: Making informed decisions

Utility function

Utility should represent our study design goal:

  1. Active Learning and experimental design: reduce the uncertainty in the model

  2. Optimisation: Minimize the loss in a sequence $x_{1}, ... , x_{n}$
  3. $$r_{N} = \sum_{n=1}^{N} \mathcal{f}(x_{n}) - N\mathcal{f}(x_{M})$$

(1) does a lot of exploration, whereas (2) encourages exploitation about the minimum of surrogate $\mathcal{f}$

GP upper (lower) confidence bound

Direct balance between exploration and exploitation:

$$\alpha_{LCB}(\mathbf{x}; \theta, \mathcal{D}) = - \mu(\mathbf{x}; \theta, \mathcal{D}) + \beta_{t} \sigma(\mathbf{x}; \theta, \mathcal{D})$$

Bayesian Optimisation - methodology to perform global optimisation of multi-modal black box functions

  1. Choose some prior over the space of possible functions $\mathcal{f}$

  2. Combine prior and the likelihood to get posterior

  3. Use the posterior to decide where to take the next sample according to some acquisition function

  4. Augment the data

Iterate between 2 and 4 until the budget is over

Acquisition function $u(x)$ guides the optimisation by determining which $x_{t+1}$ to observe next

  • Exploration: prefer high variance regions
  • Exploitation: prefer high mean regions

Is there a free lunch though?

Choice of $u(x)$ heavily affects optimisation results

Scalability in number of parameters

Up to 10 is okay..

But there are solutions!

  • Use ensemble of acquisition functions (see resources)
  • Use random forest regressor (Fusion example)

Swersky, Snoek, and Adams, “Freeze-Thaw Bayesian Optimization.”

$\mathcal{N}(\mathbf{\mu}, \mathbf{\Sigma}),$

Snoek, Larochelle, and Adams, “Practical Bayesian Optimization of Machine Learning Algorithms.”

  • Multi-task learning
  • Expected Improvement per second