Approximate Bayesian Gaussian process (GP) classication techniques are powerful nonparametric learning methods, similar in appearance and performance to support vector machines. Based on simple probabilistic models, they render interpretable results and can be embedded in Bayesian frameworks for model selection, feature selection, etc. In this paper, by applying the PAC-Bayesian theorem of McAllester (1999a), we prove distributionfree generalisation error bounds for a wide range of approximate Bayesian GP classication techniques. We also provide a new and much simplied proof for this powerful theorem, making use of the concept of convex duality which is a backbone of many machine learning techniques. We instantiate and test our bounds for two particular GPC techniques, including a recent sparse method which circumvents the unfavourable scaling of standard GP algorithms. As is shown in experiments on a real-world task, the bounds can be very tight for moderate training sample sizes. To the best of our knowledge, these results provide the tightest known distribution-free error bounds for approximate Bayesian GPC methods, giving a strong learning-theoretical justication for the use of these techniques.
|
4514
|
Statistical Learning Theory
– Vapnik
- 1998
|
|
1890
|
Matrix Analysis
– Horn, Johnson
- 1985
|
|
1410
|
Convex Analysis
– Rockafellar
- 1970
|
|
1364
|
A theory of the learnable
– Valiant
- 1984
|
|
1103
|
A Tutorial on Support Vector Machines for Pattern Recognition
– Burges
- 1998
|
|
961
|
Learning with kernels
– Schölkopf, Smola
- 2002
|
|
727
|
Spline Models for Observational Data
– Wahba
- 1990
|
|
630
|
An introduction to Support Vector Machines and other Kernel-based learning methods
– Cristianini, Shawe-Taylor
- 2000
|
|
589
|
Information Theory and Statistics
– Kullback
- 1959
|
|
428
|
A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations
– Chernoff
- 1995
|
|
389
|
The perceptron: A probabilistic model for information storage and organization in the brain
– Rosenblatt
- 1958
|
|
374
|
Information Theory: Coding Theorems for Discrete Memoryless Systems
– Csiszár, Körner
- 1982
|
|
267
|
Nonparametric regression and Generalized Linear Models
– Green, Silverman
- 1994
|
|
265
|
Stochastic Simulation
– Ripley
- 1987
|
|
208
|
Structural risk minimization over data-dependent hierarchies
– Shawe-Taylor, Bartlett, et al.
- 1996
|
|
204
|
Sparse bayesian learning and the relevance vector machine
– Tipping
|
|
137
|
Prediction with Gaussian processes: From linear regression to linear prediction and beyond
– Williams
- 1997
|
|
123
|
Seeger M.: Using the Nyström Method to Speed Up Kernel Machines
– Williams
- 2001
|
|
113
|
A Family of Algorithms for Approximate Bayesian Inference
– Minka
- 2001
|
|
92
|
Bounds on the sample complexity of Ba.yesian learning using information theory and the VC dimension
– Haussler, Kearns, et al.
- 1991
|
|
78
|
Maximum entropy discrimination
– Jaakkola, Meila, et al.
- 1999
|
|
74
|
carlo implementation of gaussian process models for bayesian regression and classification
– Neal
- 1997
|
|
70
|
Stability and generalization
– Bousquet, Elisseeff
|
|
70
|
Empirical margin distributions and bounding the generalization error of combined classifiers
– Koltchinskii, Panchenko
|
|
67
|
Some pac-bayesian theorems
– McAllester
- 1998
|
|
64
|
Bayesian Gaussian Processes for Regression and Classification
– Gibbs
- 1997
|
|
58
|
A Measure of Asymptotic Eciency for Tests of a Hypothesis Based on the Sum of Observations
– Cherno
- 1952
|
|
56
|
Bayes factors and model uncertainty
– Kass, Raftery
- 1995
|
|
54
|
Sparse greedy gaussian process regression
– Smola, Bartlett
- 2001
|
|
51
|
A bayesian committee machine
– Tresp
|
|
50
|
Fast sparse gaussian process methods: The informative vector machine
– Lawrence, Seeger, et al.
- 2002
|
|
50
|
Directional statistics
– Mardia, Jupp
- 2000
|
|
47
|
Pac-bayesian model averaging
– McAllester
- 1999
|
|
46
|
Sparse on-line Gaussian processes
– Csató, Opper
- 2002
|
|
41
|
Relating data compression and learnability
– Littlestone, Warmuth
- 1986
|
|
38
|
Hybrid adaptive splines
– Luo, Wahba
- 1997
|
|
38
|
Learning Kernel Classifiers
– Herbrich
- 2002
|
|
31
|
PAC-Bayesian stochastic model selection
– McAllester
- 2003
|
|
30
|
Bayesian model selection for support vector machines, Gaussian processes and other kernel classifiers
– Seeger
- 2000
|
|
25
|
Mutual information, metric entropy and cumulative relative entropy risk.” The Annals of Statistics
– Haussler, Opper
- 1997
|
|
19
|
Algorithmic luckiness
– Herbrich, Williamson
- 2002
|
|
16
|
From margin to sparsity
– Graepel, Herbrich, et al.
- 2001
|
|
12
|
Learning curves for Gaussian processes
– Sollich
- 1999
|
|
10
|
Information Theory for Continuous Systems, World Scientific, Singapore. Downloaded from http://ijr.sagepub.com at
– Ihara
- 1993
|
|
9
|
Learning Kernel Classi
– Herbrich
- 2001
|
|
9
|
Bounds for averaging classifiers
– Langford, Seger
|
|
8
|
and André Elisseeff. Stability and generalization
– Bousquet
- 2002
|
|
6
|
Thore Graepel. A PAC-Bayesian margin bound for linear classi Why SVMs work
– Herbrich
- 2001
|
|
6
|
Bounds for averaging classi
– Langford, Seeger
- 2001
|
|
6
|
Generalized Linear Models. Number 37
– McCullach, Nelder
- 1983
|