#### DMCA

## Practical bayesian optimization of machine learning algorithms (2012)

### Cached

### Download Links

Citations: | 117 - 15 self |

### Citations

251 | Learning multiple layers of features from tiny images
- Krizhevsky
- 2009
(Show Context)
Citation Context ...ly exploring model architechtures, numerous hyperparameters, such as regularisation parameters, remain. In this empirical analysis, we tune nine hyperparameters of a three-layer convolutional network =-=[22]-=- on the CIFAR-10 benchmark dataset using the code provided 5 . This model has been carefully tuned by a human expert [22] to achieve a highly competitive result of 18% test error on the unaugmented da... |

228 | A taxonomy of global optimization methods based on response surfaces
- Jones
- 2001
(Show Context)
Citation Context ...blems. A good choice is Bayesian optimization [1], which has been shown to outperform other state of the art global optimization algorithms on a number of challenging optimization benchmark functions =-=[2]-=-. For continuous functions, Bayesian optimization typically works by assuming the unknown function was sampled from a Gaussian process and maintains a posterior distribution for this function as obser... |

209 | Learning structural svms with latent variables. ICML
- Yu, Joachims
- 2009
(Show Context)
Citation Context ...Structured Support Vector Machines In this example, we consider optimizing the learning parameters of Max-Margin Min-Entropy (M3E) Models [18], which include Latent Structured Support Vector Machines =-=[19]-=- as a special case. Latent structured SVMs outperform SVMs on problems where they can explicitly model problem-dependent hidden variables. A popular example task is the binary classification of protei... |

193 | Online learning for latent dirichlet allocation
- Hoffman, Blei, et al.
- 2010
(Show Context)
Citation Context ...cted graphical model for documents in which words are generated from a mixture of multinomial “topic” distributions. Variational Bayes is a popular paradigm for learning and, recently, Hoffman et al. =-=[17]-=- proposed an online learning approach in that context. Online LDA requires 2 learning parameters, τ0 and κ, that control the learning rate ρt = (τ0 + t) −κ used to update the variational parameters of... |

190 | Bayesian calibration of computer models
- Kennedy, O’Hagan
- 2001
(Show Context)
Citation Context ...el. Gaussian processes have proven to be useful surrogate models for computer experiments and good practices have been established in this context for sensitivity analysis, calibration and prediction =-=[6]-=-. While these strategies are not considered in the context of optimization, they can be useful to researchers in machine learning who wish to understand better the sensitivity of their models to vario... |

144 | Multi-column deep neural networks for image classification
- Ciresan, Meier, et al.
(Show Context)
Citation Context ...d translations, similarly improving on the expert from 11% to 9.5% test error. To our knowledge this is the lowest error reported, compared to the 11% state of the art and a recently published 11.21% =-=[24]-=- using similar methods, on the competitive CIFAR-10 benchmark. 5 Conclusion We presented methods for performing Bayesian optimization for hyperparameter selection of general machine learning algorithm... |

141 | Multi-task Gaussian process prediction
- Bonilla, Chai, et al.
- 2008
(Show Context)
Citation Context ...el ln c(x) alongside f(x). In this work, we assume that these functions are independent of each other, although their coupling may be usefully captured using GP variants of multi-task learning (e.g., =-=[14, 15]-=-). Under the independence assumption, we can easily compute the predicted expected inverse duration and use it to compute the expected improvement per second as a function of x. 3.3 Monte Carlo Acquis... |

117 | Gaussian process optimization in the bandit setting: No regret and experimental design
- Srinivas, Krause, et al.
- 2010
(Show Context)
Citation Context ...s are observed. To pick the hyperparameters of the next experiment, one can optimize the expected improvement (EI) [1] over the current best result or the Gaussian process upper confidence bound (UCB)=-=[3]-=-. EI and UCB have been shown to be efficient in the number of function evaluations required to find the global optimum of many multimodal black-box functions [4, 3]. 1Machine learning algorithms, how... |

99 | Sequential model-based optimization for general algorithm configuration
- Hutter, Hoos, et al.
- 2011
(Show Context)
Citation Context ...onsidered in the context of optimization, they can be useful to researchers in machine learning who wish to understand better the sensitivity of their models to various hyperparameters. Hutter et al. =-=[7]-=- have developed sequential model-based optimization strategies for the configuration of satisfiability and mixed integer programming solvers using random forests. The machine learning algorithms we co... |

98 |
Algorithm 659: Implementing sobol's quasirandom sequence generator
- Bratley, Fox
- 1988
(Show Context)
Citation Context ...imum of the multimodal acquisition function a(x ; {xn, yn}, θ) in a continuous domain, first discrete candidate points are densely sampled in the unit hypercube using a low discrepancy Sobol sequence =-=[2]-=-. Each of these candidates is then subjected to a bounded optimization over the integrated acquisition function. Precisely, the minimum of the acquisition function, averaged over GP hyperparameter sam... |

73 | Semiparametric latent factor models
- Teh, Seeger, et al.
- 2005
(Show Context)
Citation Context ...el ln c(x) alongside f(x). In this work, we assume that these functions are independent of each other, although their coupling may be usefully captured using GP variants of multi-task learning (e.g., =-=[14, 15]-=-). Under the independence assumption, we can easily compute the predicted expected inverse duration and use it to compute the expected improvement per second as a function of x. 3.3 Monte Carlo Acquis... |

54 | Slice sampling covariance hyperparameters of latent Gaussian models
- Murray, Adams
- 2010
(Show Context)
Citation Context ...by projecting the observations to the unit hypercube, as defined by bounds of the optimization. Gaussian process hyperparameters, θ, are sampled using the slice sampling algorithm of Murray and Adams =-=[1]-=-. In order to find the maximum of the multimodal acquisition function a(x ; {xn, yn}, θ) in a continuous domain, first discrete candidate points are densely sampled in the unit hypercube using a low d... |

48 | Selfpaced learning for latent variable models
- Kumar, Packer, et al.
- 2010
(Show Context)
Citation Context ...se. Latent structured SVMs outperform SVMs on problems where they can explicitly model problem-dependent hidden variables. A popular example task is the binary classification of protein DNA sequences =-=[18, 20, 19]-=-. The hidden variable to be modeled is the unknown location of particular subsequences, or motifs, that are indicators of positive sequences. Setting the hyperparameters, such as the regularisation te... |

43 | On random weights and unsupervised feature learning
- Saxe, Koh, et al.
(Show Context)
Citation Context ...rameters. Multi-layer convolutional neural networks are an example of such a model for which a thorough exploration of architechtures and hyperparameters is beneficial, as demonstrated in Saxe et al. =-=[21]-=-, but often computationally prohibitive. While Saxe et al. [21] demonstrate a methodology for efficiently exploring model architechtures, numerous hyperparameters, such as regularisation parameters, r... |

42 | Selecting receptive fields in deep networks
- Coates, Ng
(Show Context)
Citation Context ...ed 5 . This model has been carefully tuned by a human expert [22] to achieve a highly competitive result of 18% test error on the unaugmented data, which matches the published state of the art result =-=[23]-=- on CIFAR10. The parameters we explore include the number of epochs to run the model, the learning rate, four weight costs (one for each layer and the softmax output weights), and the width, scale and... |

35 | Convergence rates of efficient global optimization algorithms
- Bull
- 2011
(Show Context)
Citation Context ...an process upper confidence bound (UCB)[3]. EI and UCB have been shown to be efficient in the number of function evaluations required to find the global optimum of many multimodal black-box functions =-=[4, 3]-=-. 1Machine learning algorithms, however, have certain characteristics that distinguish them from other black-box optimization problems. First, each function evaluation can require a variable amount o... |

34 |
A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise
- Kushner
- 1964
(Show Context)
Citation Context ...e standard normal, and φ(·) will denote the standard normal density function. Probability of Improvement One intuitive strategy is to maximize the probability of improving over the best current value =-=[12]-=-. Under the GP this can be computed analytically as aPI(x ; {xn, yn}, θ) = Φ(γ(x)), γ(x) = f(xbest) − µ(x ; {xn, yn}, θ) . (1) σ(x ; {xn, yn}, θ) Expected Improvement Alternatively, one could choose t... |

31 | Handling sparsity via the horseshoe
- Carvalho, Polson, et al.
- 2009
(Show Context)
Citation Context ...rior for 1the mean, m, and width 2 top-hat priors for each of the D length scale parameters. As we expect the observation noise generally to be close to or exactly zero, ν is given a horseshoe prior =-=[3]-=-. The covariance amplitude θ0 is given a zero mean, unit variance lognormal prior, θ0 ∼ ln N (0, 1). Algorithm 1 Selecting the next point to evaluate Input: Observations {xn, yn} N n=1 {Generate a set... |

20 |
and Yoshua Bengio. Random search for hyper-parameter optimization
- Bergstra
- 2012
(Show Context)
Citation Context ...ly, Bergstra et al. [5] have explored various strategies for optimizing the hyperparameters of machine learning algorithms. They demonstrated that grid search strategies are inferior to random search =-=[9]-=-, and suggested the use of Gaussian process Bayesian optimization, optimizing the hyperparameters of a squared-exponential covariance, and proposed the Tree Parzen Algorithm. 2 Bayesian Optimization w... |

20 |
Nando de Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning
- Brochu, Cora
- 2009
(Show Context)
Citation Context ... algorithm — then it is easy to justify some extra computation to make better decisions. For an overview of the Bayesian optimization formalism and a review of previous work, see, e.g., Brochu et al. =-=[10]-=-. In this section we briefly review the general Bayesian optimization approach, before discussing our novel contributions in Section 3. There are two major choices that must be made when performing Ba... |

19 |
Rémi Bardenet, Yoshua Bengio, and Bálázs Kégl. Algorithms for hyperparameter optimization
- Bergstra
(Show Context)
Citation Context ... machine learning algorithms. We argue that a fully Bayesian treatment of the underlying GP kernel is preferred to the approach based on optimization of the GP hyperparameters, as previously proposed =-=[5]-=-. Our second contribution is the description of new algorithms for taking into account the variable and unknown cost of experiments or the availability of multiple cores to run experiments in parallel... |

16 | Max-margin min-entropy models
- Miller, Kumar, et al.
- 2012
(Show Context)
Citation Context ...fraction of the number of experiments. 4.3 Motif Finding with Structured Support Vector Machines In this example, we consider optimizing the learning parameters of Max-Margin Min-Entropy (M3E) Models =-=[18]-=-, which include Latent Structured Support Vector Machines [19] as a special case. Latent structured SVMs outperform SVMs on problems where they can explicitly model problem-dependent hidden variables.... |

9 |
Vytautas Ties̆is, and Antanas Z̆ilinskas. The application of Bayesian methods for seeking the extremum. Towards Global Optimization
- Mockus
- 1978
(Show Context)
Citation Context ...to be automated. Specifically, we could view such tuning as the optimization of an unknown black-box function and invoke algorithms developed for such problems. A good choice is Bayesian optimization =-=[1]-=-, which has been shown to outperform other state of the art global optimization algorithms on a number of challenging optimization benchmark functions [2]. For continuous functions, Bayesian optimizat... |

6 |
Dealing with asynchronicity in parallel Gaussian process based global optimization. http://hal. archives-ouvertes.fr/hal-00507632
- Ginsbourger, Riche
- 2010
(Show Context)
Citation Context ...sition and use this to select the next point. Figure 2 shows how this procedure would operate with queued evaluations. We note that a similar approach is touched upon briefly by Ginsbourger and Riche =-=[16]-=-, but they view it as too intractable to warrant attention. We have found our Monte Carlo estimation procedure to be highly effective in practice, however, as will be discussed in Section 4. 4 Empiric... |

3 |
Firas Hamze, and Nando de Freitas. Adaptive mcmc with bayesian optimization
- Mahendran, Wang
- 2012
(Show Context)
Citation Context ...eatment as their expensive nature necessitates minimizing the number of evaluations. Bayesian optimization strategies have also been used to tune the parameters of Markov chain Monte Carlo algorithms =-=[8]-=-. Recently, Bergstra et al. [5] have explored various strategies for optimizing the hyperparameters of machine learning algorithms. They demonstrated that grid search strategies are inferior to random... |