#### DMCA

## Gaussian processes for ordinal regression (2004)

### Cached

### Download Links

- [www.gatsby.ucl.ac.uk]
- [www.gatsby.ucl.ac.uk]
- [www.gatsby.ucl.ac.uk]
- [www.jmlr.org]
- [jmlr.csail.mit.edu]
- [jmlr.org]
- [mlg.eng.cam.ac.uk]
- [mlg.eng.cam.ac.uk]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Machine Learning Research |

Citations: | 114 - 4 self |

### Citations

12893 | The Nature of Statistical Learning Theory
- Vapnik
- 1995
(Show Context)
Citation Context ...ased on binary classifiers. Cohen et al. (1999) considered general ranking problems in the form of preference judgements. Herbrich et al. (2000) applied the principle of Structural Risk Minimization (=-=Vapnik, 1995-=-) to ordinal regression leading to a new distribution-independent c○2005 Wei Chu and Zoubin Ghahramani.sChu and Ghahramani learning algorithm based on a loss function between pairs of ranks. Shashua a... |

3039 |
Generalized Linear Models
- McCullagh, Nelder
- 1983
(Show Context)
Citation Context ... improvement on the performance compared with the on-line algorithm proposed by Crammer and Singer (2002). In the statistics literature, most of the approaches are based on generalized linear models (=-=McCullagh and Nelder, 1983-=-). The cumulative model (McCullagh, 1980) is well-known in classical statistical approaches for ordinal regression, in which they rely on a specific distributional assumption on the unobservable laten... |

2379 | Generalized Additive Models
- HASTIE, TIBSHIRANI
- 1990
(Show Context)
Citation Context ...ribed Bayesian inference on parametric models for ordinal data using sampling techniques. Tutz (2003) presented a general framework for semiparametric models that extends generalized additive models (=-=Hastie and Tibshirani, 1990-=-) by incorporating nonparametric parts. The nonparametric components of the regression model are fitted by maximizing penalized log likelihood, and model selection is carried out using AIC. Gaussian p... |

791 | Bayesian Learning for Neural Networks - Neal - 1996 |

764 | Learning the kernel matrix with semi-definite programming - Lanckriet, Cristianini, et al. - 2004 |

686 |
Learning with Kernels: Support Vector
- Schölkopf, Smola
- 2001
(Show Context)
Citation Context ...nce matrix for any finite set of zero-mean random variables {f(xi)}. The covariance between the functions corresponding to the inputs xi and xj can be defined by Mercer kernel functions (Wahba, 1990; =-=Schölkopf and Smola, 2001-=-), e.g. Gaussian kernel which is defined as � Cov[f(xi),f(xj)] = K(xi,xj) = exp − κ d� (x 2 ς i − xς j )2 � (1) where κ>0andx ς i denotes the ς-th element of xi. 1 Thus, the prior probability of these... |

548 | A limited memory algorithm for bound constrained minimization
- Byrd, Lu, et al.
- 1995
(Show Context)
Citation Context ...003) and the two versions of our approach, the MAP approach with Laplace approximation (MAP) and the EP algorithm with variational methods (EP). In our implementation, 6 we used the routine L-BFGS-B (=-=Byrd et al., 1995-=-) as the gradient-based optimization package, and started from the initial values of hyperparameters to infer the optimal values in the criterion of the approximate evidence (11) for MAP or the variat... |

489 | A practical bayesian framework for backpropagation networks
- MacKay
- 1992
(Show Context)
Citation Context ...for ordinal variables which can be regarded as a generalization of the probit function. Two Bayesian inference techniques are applied to implement model adaptation by using the Laplace approximation (=-=MacKay, 1992-=-) and the expectation propagation (Minka, 2001) respectively. Comparisons of the generalization performance against the support vector approach (Shashua and Levin, 2003) on some benchmark and real-wor... |

445 |
Multivariate Statistical Modelling Based on Generalized Linear Models, Springer Series in Statistics
- Fahrmeir, Tutz
- 1994
(Show Context)
Citation Context ...del parameters such as those that control the kernel shape and the noise level. The GPs are also different from the semiparametric approach of Tutz (2003) in several ways. First, the additive models (=-=Fahrmeir and Tutz, 2001-=-) are defined by functions in each input dimension, whereas the GPs can have more general non-additive covariance functions; second, the kernel trick allows to use infinite basis function expansions; ... |

426 | K.: Large margin rank boundaries for ordinal regression - Herbrich, Graepel, et al. - 2000 |

417 | Gene expression correlates of clinical prostate cancer behavior - Singh, Febbo, et al. - 2002 |

404 | Learning to order things - Cohen, Schapire, et al. - 1999 |

358 | A family of algorithms for approximate bayesian inference
- Minka
- 2001
(Show Context)
Citation Context ...a generalization of the probit function. Two Bayesian inference techniques are applied to implement model adaptation by using the Laplace approximation (MacKay, 1992) and the expectation propagation (=-=Minka, 2001-=-) respectively. Comparisons of the generalization performance against the support vector approach (Shashua and Levin, 2003) on some benchmark and real-world data sets, such as movie ranking and gene e... |

326 |
Regression Models for Ordinal Data
- McCullagh
- 1980
(Show Context)
Citation Context ...n-line algorithm proposed by Crammer and Singer (2002). In the statistics literature, most of the approaches are based on generalized linear models (McCullagh and Nelder, 1983). The cumulative model (=-=McCullagh, 1980-=-) is well-known in classical statistical approaches for ordinal regression, in which they rely on a specific distributional assumption on the unobservable latent variables and a stochastic ordering of... |

265 | Improvements to Platt's SMO algorithm for SVM classifier design
- Keerthi, Shevade, et al.
- 2001
(Show Context)
Citation Context ... values of hyperparameters to infer the optimal values in the criterion of the approximate evidence (11) for MAP or the variational lower bound (13) for EP respectively. 7 The improved SMO algorithm (=-=Keerthi et al., 2001-=-) was adapted to implement the SVM approach (refer to Chu and Keerthi (2005) for detailed description and extensive discussion), 8 and 5-fold cross validation was used to determine the optimal values ... |

217 | Pranking with ranking
- Crammer, Singer
- 2001
(Show Context)
Citation Context ...ding multiple thresholds to define parallel discriminant hyperplanes for ordinal scales, and reported that the performance of the support vector approach is better than that of the on-line algorithm (=-=Crammer and Singer, 2002-=-). The problem size in the large-margin ranking algorithm of Herbrich et al. (2000) is a quadratic function of the training data size making the algorithmic complexity O(n 4 )–O(n 6 ). This makes the ... |

192 | Weighted low rank approximation - Srebro, Jaakkola - 2003 |

189 |
Splines Models of Observational Data, volume 59
- Wahba
- 1990
(Show Context)
Citation Context ...g the covariance matrix for any finite set of zero-mean random variables {f(xi)}. The covariance between the functions corresponding to the inputs xi and xj can be defined by Mercer kernel functions (=-=Wahba, 1990-=-; Schölkopf and Smola, 2001), e.g. Gaussian kernel which is defined as � Cov[f(xi),f(xj)] = K(xi,xj) = exp − κ d� (x 2 ς i − xς j )2 � (1) where κ>0andx ς i denotes the ς-th element of xi. 1 Thus, the... |

175 | Bayesian classification with gaussian processes
- Williams, Barber
- 1998
(Show Context)
Citation Context ...t using AIC. Gaussian processes (O’Hagan, 1978; Neal, 1997) have provided a promising non-parametric Bayesian approach to metric regression (Williams and Rasmussen, 1996) and classification problems (=-=Williams and Barber, 1998-=-). The important advantage of Gaussian process models (GPs) over other non-Bayesian models is the explicit probabilistic formulation. This not only provides probabilistic predictions but also gives th... |

171 |
On extensions of the Brunn–Minkowski and Prékopa–Leindler theorems, including inequalities for log concave functions, and with an application to the diffusion equation
- Brascamp, Lieb
- 1976
(Show Context)
Citation Context ...by Pratt (1981), the convexity of the loss function follows, because the integral of a log concave function with respect to some of its arguments is a log concave function of its remaining arguments (=-=Brascamp and Lieb, 1976-=-, Cor. 3.5). 2.3 Posterior probability Based on Bayes’ theorem, the posterior probability can then be written as P(f|D) = 1 P(D) n� P(yi|f(xi)) P(f) (8) i=1 where the prior probability P(f) is defined... |

169 | Fast sparse gaussian process method: Informative vector machines - Lawrence, Seeger, et al. |

146 | Monte Carlo implementation of Gaussian process models for Bayesian regression and classification
- Neal
- 1997
(Show Context)
Citation Context ...ametric parts. The nonparametric components of the regression model are fitted by maximizing penalized log likelihood, and model selection is carried out using AIC. Gaussian processes (O’Hagan, 1978; =-=Neal, 1997-=-) have provided a promising non-parametric Bayesian approach to metric regression (Williams and Rasmussen, 1996) and classification problems (Williams and Barber, 1998). The important advantage of Gau... |

107 | Unifying collaborative and content-based filtering
- Basilico, Hofmann
- 2004
(Show Context)
Citation Context ... training, and then tested on the remaining movies. At each size, the random selection was carried out 20 times independently. Pearson correlation coefficient is the most popular correlation measure (=-=Basilico and Hofmann, 2004-=-), which corresponds to a dot product between normalized rating vectors. For instance, if applied to the movies, we can define the so-called z-scores as z(v,u) = r(v,u) − µ(v) σ(v) where u indexes use... |

94 | Ranking with large margin principle: Two approaches
- Shashua, Levin
- 2003
(Show Context)
Citation Context ...tation by using the Laplace approximation (MacKay, 1992) and the expectation propagation (Minka, 2001) respectively. Comparisons of the generalization performance against the support vector approach (=-=Shashua and Levin, 2003-=-) on some benchmark and real-world data sets, such as movie ranking and gene expression analysis, verify the usefulness of this approach. The paper is organized as follows: in Section 2, we describe t... |

93 | A Bayesian committee machine - Tresp |

86 | D.: Constraint classification: A new approach to multiclass classification. Algorithmic Learning Theory - Har-Peled, Roth, et al. - 2002 |

85 | Bayesian methods for backpropagation networks - MacKay - 1994 |

83 | A simple approach to ordinal classification - Frank, Hall - 2001 |

73 | New approaches to support vector ordinal regression - Chu, Keerthi - 2005 |

72 | Bayesian Gaussian process models: PAC-Bayesian generalisation error bounds and sparse approximations
- Seeger
- 2005
(Show Context)
Citation Context ...sis vectors are determined on-line for sparse representation. Lawrence et al. (2003) proposed a greedy selection with criteria based on information-theoretic principles for sparse Gaussian processes (=-=Seeger, 2003-=-). Tresp (2000) proposed the Bayesian committee machines to divide and conquer large datasets, while using infinite mixtures of Gaussian Processes (Rasmussen and Ghahramani, 2002) is another promising... |

39 | Curve fitting and optimal design for prediction (with discussion - O’Hagan - 1985 |

38 | Prediction of ordinal classes using regression trees - Kramer, Widmer, et al. - 2001 |

32 | Efficient approaches to gaussian process classification
- Csató, Fokoué, et al.
- 1999
(Show Context)
Citation Context ...(D|f)P(f) df. Apopular idea for computing the evidence is to approximate the posterior distribution P(f|D) as a Gaussian, and then the evidence can be calculated by an explicit formula (MacKay, 1992; =-=Csató et al., 2000-=-; Minka, 2001). In this section, we describe two Bayesian techniques for model adaptation by using the Laplace approximation and the expectation propagation respectively. 3.1 MAP approach with Laplace... |

28 | Warped Gaussian processes
- Snelson, Rasmussen, et al.
- 2004
(Show Context)
Citation Context ... parameters control the covariance length-scale of the Gaussian process along each input dimension. 8 � (17)sGaussian Processes for Ordinal Regression In metric regression, warped Gaussian processes (=-=Snelson et al., 2004-=-) assume that there is a nonlinear, monotonic, and continuous warping function relating the observed targets and some latent variables in a Gaussian process. The warping function, which is learned fro... |

21 | Concavity of the log-likelihood - Pratt - 1981 |

20 | Sparse on-line gaussian processes. Neural Computation 14(3):641–668. delta data set. http://www.dcc.fc.up.pt/ ˜ltorgo/Regression/. bank data set. http://www.cs.toronto.edu/ ˜delve/. cpuact data set. http://www.cs.toronto.edu/ ˜delve/. elevator data set. h - Csató, Opper |

17 | The em-ep algorithm for gaussian process classification
- Kim, Ghahramani
- 2004
(Show Context)
Citation Context ...can be regarded as an extension of assumed-density-filter (ADF). The EP algorithm has been applied in Gaussian process classification along with variational methods for model selection (Seeger, 2002; =-=Kim and Ghahramani, 2003-=-). In the setting of Gaussian processes, EP attempts to approximate P(f|D) as a product distribution in the form of Q(f) = �n i=1 ˜ti(f(xi))P(f) where ˜ti(f(xi)) = si exp(− 1 2pi(f(xi)−mi) 2 ). The pa... |

12 | Notes on Minka’s expectation propagation for Gaussian process classification (Technical Report
- Seeger
- 2002
(Show Context)
Citation Context ... 2001), which can be regarded as an extension of assumed-density-filter (ADF). The EP algorithm has been applied in Gaussian process classification along with variational methods for model selection (=-=Seeger, 2002-=-; Kim and Ghahramani, 2003). In the setting of Gaussian processes, EP attempts to approximate P( f |D) as a product distribution in the form of Q( f) = ∏ n i=1 ˜ti( f(xi))P( f) where ˜ti( f(xi)) = si ... |

6 | Learning the kernel matrix with semidefinite programming - Jordan - 2004 |

6 | Generalized semiparametrically structured ordinal models - Tutz - 2003 |

5 | Ordinal data modeling. Statistics for social science and public policy - Johnson, Albert - 1999 |

1 | Ghahramani Amnon Shashua and Anat Levin. Ranking with large margin principle: two approaches - Chu - 2003 |