Results 1  10
of
15
Network Information Criterion  Determining the Number of Hidden Units for an Artificial Neural Network Model
 IEEE Transactions on Neural Networks
, 1994
"... The problem of model selection, or determination of the number of hidden units, can be approached statistically, by generalizing Akaike's information criterion (AIC) to be applicable to unfaithful (i.e., unrealizable) models with general loss criteria including regularization terms. The relatio ..."
Abstract

Cited by 182 (8 self)
 Add to MetaCart
(Show Context)
The problem of model selection, or determination of the number of hidden units, can be approached statistically, by generalizing Akaike's information criterion (AIC) to be applicable to unfaithful (i.e., unrealizable) models with general loss criteria including regularization terms. The relation between the training error and the generalization error is studied in terms of the number of the training examples and the complexity of a network which reduces to the number of parameters in the ordinary statistical theory of the AIC. This relation leads to a new Network Information Criterion (NIC) which is useful for selecting the optimal network model based on a given training set. 3 IEEE Transactions on Neural Networks, Vol. 5, No. 6, pp. 865872, November 1994 y Department of Mathematical Engineering and Information Physics, Faculty of Engineering, University of Tokyo, 731 Hongo, Bunkyoku, Tokyo 113, Japan. 1 Introduction In engineering fields, one of the most important applicati...
InputDependent Estimation of Generalization Error under Covariate Shift
 STATISTICS & DECISIONS, VOL.23, NO.4, PP.249–279, 2005
, 2005
"... A common assumption in supervised learning is that the training and test input points follow the same probability distribution. However, this assumption is not fulfilled, e.g., in interpolation, extrapolation, active learning, or classification with imbalanced data. The violation of this assumption— ..."
Abstract

Cited by 61 (32 self)
 Add to MetaCart
A common assumption in supervised learning is that the training and test input points follow the same probability distribution. However, this assumption is not fulfilled, e.g., in interpolation, extrapolation, active learning, or classification with imbalanced data. The violation of this assumption—known as the covariate shift— causes a heavy bias in standard generalization error estimation schemes such as crossvalidation or Akaike’s information criterion, and thus they result in poor model selection. In this paper, we propose an alternative estimator of the generalization error for the squared loss function when training and test distributions are different. The proposed generalization error estimator is shown to be exactly unbiased for finite samples if the learning target function is realizable and asymptotically unbiased in general. We also show that, in addition to the unbiasedness, the proposed generalization error estimator can accurately estimate the difference of the generalization error among different models, which is a desirable property in model selection. Numerical studies show that the proposed method compares favorably with existing model selection methods in regression for extrapolation and in classification with imbalanced data.
Subspace information criterion for model selection
 Neural Computation
, 2001
"... The problem of model selection is considerably important for acquiring higher levels of generalization capability in supervised learning. In this paper, we propose a new criterion for model selection called the subspace information criterion (SIC), which is a generalization of Mallows ’ C L. It is a ..."
Abstract

Cited by 58 (31 self)
 Add to MetaCart
The problem of model selection is considerably important for acquiring higher levels of generalization capability in supervised learning. In this paper, we propose a new criterion for model selection called the subspace information criterion (SIC), which is a generalization of Mallows ’ C L. It is assumed that the learning target function belongs to a specified functional Hilbert space and the generalization error is defined as the Hilbert space squared norm of the difference between the learning result function and target function. SIC gives an unbiased estimate of the generalization error so defined. SIC assumes the availability of an unbiased estimate of the target function and the noise covariance matrix, which are generally unknown. A practical calculation method of SIC for least mean squares learning is provided under the assumption that the dimension of the Hilbert space is less than the number of training examples. Finally, computer simulations in two examples show that SIC works well even when the number of training examples is small.
The Subspace Information Criterion for Infinite Dimensional Hypothesis Spaces
 Journal of Machine Learning Research
, 2002
"... A central problem in learning is selection of an appropriate model. This is typically done by estimating the unknown generalization errors of a set of models to be selected from and then choosing the model with minimal generalization error estimate. In this article, we discuss the problem of mode ..."
Abstract

Cited by 15 (14 self)
 Add to MetaCart
A central problem in learning is selection of an appropriate model. This is typically done by estimating the unknown generalization errors of a set of models to be selected from and then choosing the model with minimal generalization error estimate. In this article, we discuss the problem of model selection and generalization error estimation in the context of kernel regression models, e.g., kernel ridge regression, kernel subset regression or Gaussian process regression.
Performance Prediction for Exponential Language Models
"... We investigate the task of performance prediction for language models belonging to the exponential family. First, we attempt to empirically discover a formula for predicting test set crossentropy for ngram language models. We build models over varying domains, data set sizes, and ngram orders, an ..."
Abstract

Cited by 14 (3 self)
 Add to MetaCart
(Show Context)
We investigate the task of performance prediction for language models belonging to the exponential family. First, we attempt to empirically discover a formula for predicting test set crossentropy for ngram language models. We build models over varying domains, data set sizes, and ngram orders, and perform linear regression to see whether we can model test set performance as a simple function of training set performance and various model statistics. Remarkably, we find a simple relationship that predicts test set performance with a correlation of 0.9997. We analyze why this relationship holds and show that it holds for other exponential language models as well, including classbased models and minimum discrimination information models. Finally, we discuss how this relationship can be applied to improve language model performance. 1
Trading Variance Reduction with Unbiasedness  The Regularized Subspace Information Criterion for Robust Model Selection in Kernel Regression
 NEURAL COMPUTATION
, 2004
"... A wellknown result by Stein (1956) shows that in particular situations, biased estimators can yield better parameter estimates than their generally preferred unbiased counterparts. This paper follows the same spirit as we will stabilize the unbiased generalization error estimates by regularizati ..."
Abstract

Cited by 12 (10 self)
 Add to MetaCart
(Show Context)
A wellknown result by Stein (1956) shows that in particular situations, biased estimators can yield better parameter estimates than their generally preferred unbiased counterparts. This paper follows the same spirit as we will stabilize the unbiased generalization error estimates by regularization and finally obtain more robust model selection criteria for learning. We trade a small bias against a larger variance reduction which has the beneficial effect of being more precise on a single training set. We focus on the subspace information criterion (SIC), which is an unbiased estimator of the expected generalization error measured by the reproducing kernel Hilbert space norm. SIC can be applied to the kernel regression and it was shown in earlier experiments that a small regularization of SIC has a stabilization effect. However,
Learning under Nonstationarity: Covariate Shift Adaptation by Importance Weighting
 IN J. E. GENTLE , W. HÄRDLE , Y. MORI (EDS), HANDBOOK OF COMPUTATIONAL STATISTICS: CONCEPTS AND METHODS, 2ND EDITION. CHAPTER 31, PP.927–952, SPRINGER, BERLIN
, 2012
"... The goal of supervised learning is to estimate an underlying inputoutput function from its inputoutput training samples so that output values for unseen test input points can be predicted. A common assumption in supervised learning is that the training input points follow the same probability dist ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
The goal of supervised learning is to estimate an underlying inputoutput function from its inputoutput training samples so that output values for unseen test input points can be predicted. A common assumption in supervised learning is that the training input points follow the same probability distribution as the test input points. However, this assumption is not satisfied, for example, when outside of the training region is extrapolated. The situation where the training and test input points follow different distributions while the conditional distribution of output values given input points is unchanged is called covariate shift. Since almost all existing learning methods assume that the training and test samples are drawn from the same distribution, their fundamental theoretical properties such as consistency or efficiency no longer hold under covariate shift. In this chapter, we review recently proposed techniques for covariate shift adaptation. 1
Statistical Theory of Learning Curves
"... Behaviors of a learning machine depends on its complexity and the number of training examples. A learning curve shows how fast a neural network or a general learning machine can improve its behavior as the number of training examples increases. This is also related to the complexity of a learning ma ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Behaviors of a learning machine depends on its complexity and the number of training examples. A learning curve shows how fast a neural network or a general learning machine can improve its behavior as the number of training examples increases. This is also related to the complexity of a learning machine. The characteristic of the learning curve is studied from the statisticalmechanical, information theoretic and statistical points of view. The present paper summarizes universal as well as specific properties of learning curves of both deterministic and stochastic pattern classifiers from the statistical point of view. 1 Introduction Learning is an important subject of research studied by various methods of approach such as algorithm theory, stochastic gradient method, statistical mechanics, information theory, statistics, etc. Statistical mechanics and information theory have proved its wide applicability in the field of machine learning. The present paper intends to elucidate the ch...
1 Improving Precision of the Subspace Information Criterion
"... Evaluating the generalization performance of learning machines without using additional test samples is one of the most important issues in the machine learning community. The subspace information criterion (SIC) is one of the methods for this purpose, which is shown to be an unbiased estimator of t ..."
Abstract
 Add to MetaCart
(Show Context)
Evaluating the generalization performance of learning machines without using additional test samples is one of the most important issues in the machine learning community. The subspace information criterion (SIC) is one of the methods for this purpose, which is shown to be an unbiased estimator of the generalization error with finite samples. Although the mean of SIC agrees with the true generalization error even in small sample cases, the scatter of SIC can be large under some severe conditions. In this paper, we therefore investigate the causes of degrading the precision of SIC, and discuss how its precision could be improved.