Results 1 - 10
of
156
Boosting the margin: A new explanation for the effectiveness of voting methods
- In Proceedings International Conference on Machine Learning
, 1997
"... Abstract. One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show ..."
Abstract
-
Cited by 606 (49 self)
- Add to MetaCart
Abstract. One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example is simply the difference between the number of correct votes and the maximum number of votes received by any incorrect label. We show that techniques used in the analysis of Vapnik’s support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the bias-variance decomposition. 1
Selective sampling using the Query by Committee algorithm
- Machine Learning
, 1997
"... We analyze the "query by committee" algorithm, a method for filtering informative queries from a random stream of inputs. We show that if the two-member committee algorithm achieves information gain with positive lower bound, then the prediction error decreases exponentially with the number of queri ..."
Abstract
-
Cited by 256 (6 self)
- Add to MetaCart
We analyze the "query by committee" algorithm, a method for filtering informative queries from a random stream of inputs. We show that if the two-member committee algorithm achieves information gain with positive lower bound, then the prediction error decreases exponentially with the number of queries. We show that, in particular, this exponential decrease holds for query learning of perceptrons.
Scale-sensitive Dimensions, Uniform Convergence, and Learnability
, 1997
"... Learnability in Valiant's PAC learning model has been shown to be strongly related to the existence of uniform laws of large numbers. These laws define a distribution-free convergence property of means to expectations uniformly over classes of random variables. Classes of real-valued functions enjoy ..."
Abstract
-
Cited by 175 (1 self)
- Add to MetaCart
Learnability in Valiant's PAC learning model has been shown to be strongly related to the existence of uniform laws of large numbers. These laws define a distribution-free convergence property of means to expectations uniformly over classes of random variables. Classes of real-valued functions enjoying such a property are also known as uniform Glivenko-Cantelli classes. In this paper we prove, through a generalization of Sauer's lemma that may be interesting in its own right, a new characterization of uniform Glivenko-Cantelli classes. Our characterization yields Dudley, Gin'e, and Zinn's previous characterization as a corollary. Furthermore, it is the first based on a simple combinatorial quantity generalizing the Vapnik-Chervonenkis dimension. We apply this result to obtain the weakest combinatorial condition known to imply PAC learnability in the statistical regression (or "agnostic") framework. Furthermore, we show a characterization of learnability in the probabilistic concept model, solving an open problem posed by Kearns and Schapire. These results show that the accuracy parameter plays a crucial role in determining the effective complexity of the learner's hypothesis class.
A Model of Inductive Bias Learning
- Journal of Artificial Intelligence Research
, 2000
"... A major problem in machine learning is that of inductive bias: how to choose a learner's hypothesis space so that it is large enough to contain a solution to the problem being learnt, yet small enough to ensure reliable generalization from reasonably-sized training sets. Typically such bias is suppl ..."
Abstract
-
Cited by 100 (0 self)
- Add to MetaCart
A major problem in machine learning is that of inductive bias: how to choose a learner's hypothesis space so that it is large enough to contain a solution to the problem being learnt, yet small enough to ensure reliable generalization from reasonably-sized training sets. Typically such bias is supplied by hand through the skill and insights of experts. In this paper a model for automatically learning bias is investigated. The central assumption of the model is that the learner is embedded within an environment of related learning tasks. Within such an environment the learner can sample from multiple tasks, and hence it can search for a hypothesis space that contains good solutions to many of the problems in the environment. Under certain restrictions on the set of all hypothesis spaces available to the learner, we show that a hypothesis space that performs well on a sufficiently large number of training tasks will also perform well when learning novel tasks in the same environment. Exp...
Bounds on the Sample Complexity of Bayesian Learning Using Information Theory and the VC Dimension
- Machine Learning
, 1994
"... In this paper we study a Bayesian or average-case model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the l ..."
Abstract
-
Cited by 98 (12 self)
- Add to MetaCart
In this paper we study a Bayesian or average-case model of concept learning with a twofold goal: to provide more precise characterizations of learning curve (sample complexity) behavior that depend on properties of both the prior distribution over concepts and the sequence of instances seen by the learner, and to smoothly unite in a common framework the popular statistical physics and VC dimension theories of learning curves. To achieve this, we undertake a systematic investigation and comparison of two fundamental quantities in learning and information theory: the probability of an incorrect prediction for an optimal learning algorithm, and the Shannon information gain. This study leads to a new understanding of the sample complexity of learning in several existing models. 1 Introduction Consider a simple concept learning model in which the learner attempts to infer an unknown target concept f , chosen from a known concept class F of f0; 1g-valued functions over an instance space X....
Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension
, 1992
"... : Let V ` f0; 1g n have Vapnik-Chervonenkis dimension d. Let M(k=n;V ) denote the cardinality of the largest W ` V such that any two distinct vectors in W differ on at least k indices. We show that M(k=n;V ) (cn=(k + d)) d for some constant c. This improves on the previous best result of ((cn ..."
Abstract
-
Cited by 84 (4 self)
- Add to MetaCart
: Let V ` f0; 1g n have Vapnik-Chervonenkis dimension d. Let M(k=n;V ) denote the cardinality of the largest W ` V such that any two distinct vectors in W differ on at least k indices. We show that M(k=n;V ) (cn=(k + d)) d for some constant c. This improves on the previous best result of ((cn=k) log(n=k)) d . This new result has applications in the theory of empirical processes. 1 The author gratefully acknowledges the support of the Mathematical Sciences Research Institute at UC Berkeley and ONR grant N00014-91-J-1162. 1 1 Statement of Results Let n be natural number greater than zero. Let V ` f0; 1g n . For a sequence of indices I = (i 1 ; . . . ; i k ), with 1 i j n, let V j I denote the projection of V onto I, i.e. V j I = f(v i 1 ; . . . ; v i k ) : (v 1 ; . . . ; v n ) 2 V g: If V j I = f0; 1g k then we say that V shatters the index sequence I. The Vapnik-Chervonenkis dimension of V is the size of the longest index sequence I that is shattered by V [VC71] (t...
Which Problems Have Strongly Exponential Complexity?
- Journal of Computer and System Sciences
, 1998
"... For several NP-complete problems, there have been a progression of better but still exponential algorithms. In this paper, we address the relative likelihood of sub-exponential algorithms for these problems. We introduce a generalized reduction which we call Sub-Exponential Reduction Family (SERF) t ..."
Abstract
-
Cited by 78 (4 self)
- Add to MetaCart
For several NP-complete problems, there have been a progression of better but still exponential algorithms. In this paper, we address the relative likelihood of sub-exponential algorithms for these problems. We introduce a generalized reduction which we call Sub-Exponential Reduction Family (SERF) that preserves sub-exponential complexity. We show that CircuitSAT is SERF-complete for all NP-search problems, and that for any fixed k, k-SAT, k-Colorability, k-Set Cover, Independent Set, Clique, Vertex Cover, are SERF--complete for the class SNP of search problems expressible by second order existential formulas whose first order part is universal. In particular, sub-exponential complexity for any one of the above problems implies the same for all others. We also look at the issue of proving strongly exponential lower bounds for AC 0 ; that is, bounds of the form 2 \Omega\Gamma n) . This problem is even open for depth-3 circuits. In fact, such a bound for depth-3 circuits with even l...
On Linear-Time Deterministic Algorithms for Optimization Problems in Fixed Dimension
, 1992
"... We show that with recently developed derandomization techniques, one can convert Clarkson's randomized algorithm for linear programming in fixed dimension into a lineartime deterministic one. The constant of proportionality is d O(d) , which is better than for previously known such algorithms. We s ..."
Abstract
-
Cited by 76 (11 self)
- Add to MetaCart
We show that with recently developed derandomization techniques, one can convert Clarkson's randomized algorithm for linear programming in fixed dimension into a lineartime deterministic one. The constant of proportionality is d O(d) , which is better than for previously known such algorithms. We show that the algorithm works in a fairly general abstract setting, which allows us to solve various other problems (such as finding the maximum volume ellipsoid inscribed into the intersection of n halfspaces) in linear time.

