Results 1  10
of
100
On a theory of learning with similarity functions
 In International Conference on Machine Learning
, 2006
"... Abstract. Kernel functions have become an extremely popular tool in machine learning, with an attractive theory as well. This theory views a kernel as implicitly mapping data points into a possibly very high dimensional space, and describes a kernel function as being good for a given learning proble ..."
Abstract

Cited by 63 (8 self)
 Add to MetaCart
(Show Context)
Abstract. Kernel functions have become an extremely popular tool in machine learning, with an attractive theory as well. This theory views a kernel as implicitly mapping data points into a possibly very high dimensional space, and describes a kernel function as being good for a given learning problem if data is separable by a large margin in that implicit space. However, while quite elegant, this theory does not necessarily correspond to the intuition of a good kernel as a good measure of similarity, and the underlying margin in the implicit space usually is not apparent in “natural ” representations of the data. Therefore, it may be difficult for a domain expert to use the theory to help design an appropriate kernel for the learning task at hand. Moreover, the requirement of positive semidefiniteness may rule out the most natural pairwise similarity functions for the given problem domain. In this work we develop an alternative, more general theory of learning with similarity functions (i.e., sufficient conditions for a similarity function to allow one to learn well) that does not require reference to implicit spaces, and does not require the function to be positive semidefinite (or even symmetric). Instead, our theory talks in terms of more direct properties of how the function behaves as a similarity measure. Our results also generalize the standard theory in the sense that any good kernel function under the usual definition can be shown to also be a good similarity function under our definition (though with some loss in the parameters). In this way, we provide the first steps towards a theory of kernels and more general similarity functions that describes the effectiveness of a given function in terms of natural similaritybased properties. 1
New results for learning noisy parities and halfspaces
 In Proceedings of the 47th Annual Symposium on Foundations of Computer Science (FOCS
, 2006
"... We address wellstudied problems concerning the learnability of parities and halfspaces in the presence of classification noise. Learning of parities under the uniform distribution with random classification noise, also called the noisy parity problem is a famous open problem in computational learni ..."
Abstract

Cited by 56 (12 self)
 Add to MetaCart
(Show Context)
We address wellstudied problems concerning the learnability of parities and halfspaces in the presence of classification noise. Learning of parities under the uniform distribution with random classification noise, also called the noisy parity problem is a famous open problem in computational learning. We reduce a number of basic problems regarding learning under the uniform distribution to learning of noisy parities, thus highlighting the central role of this problem for learning under the uniform distribution. We show that under the uniform distribution, learning parities with adversarial classification noise reduces to learning parities with random classification noise. Together with the parity learning algorithm of Blum et al. [BKW03], this gives the first nontrivial algorithm for learning parities with adversarial noise. We show that learning of DNF expressions reduces to learning noisy parities of just logarithmic number of variables. We show that learning of juntas reduces to learning noisy parities of variables. These reductions work even in the presence of random classification noise in the original DNF or junta. We then consider the problem of learning halfspaces over with adversarial noise or finding a halfspace that maximizes the agreement rate with a given set of examples. Finding the best halfspace is known to behard [GJ79, PV88] and many inapproximability results are known for this problem [ABSS97, HSH95, AK95, BDEL00, BB02]. We show that even if there is a halfspace that correctly classifies fraction of the given examples, it is hard to find a halfspace that is correct on a fraction for any
assuming
Margin based active learning
 Proc. of the 20 th Conference on Learning Theory
, 2007
"... Abstract. We present a framework for margin based active learning of linear separators. We instantiate it for a few important cases, some of which have been previously considered in the literature. We analyze the effectiveness of our framework both in the realizable case and in a specific noisy sett ..."
Abstract

Cited by 56 (9 self)
 Add to MetaCart
Abstract. We present a framework for margin based active learning of linear separators. We instantiate it for a few important cases, some of which have been previously considered in the literature. We analyze the effectiveness of our framework both in the realizable case and in a specific noisy setting related to the Tsybakov small noise condition. 1
Bounded Independence Fools Halfspaces
 In Proc. 50th Annual Symposium on Foundations of Computer Science (FOCS), 2009
"... We show that any distribution on {−1, +1} n that is kwise independent fools any halfspace (a.k.a. linear threshold function) h: {−1, +1} n → {−1, +1}, i.e., any function of the form h(x) = sign ( ∑n i=1 wixi − θ) where the w1,..., wn, θ are arbitrary real numbers, with error ɛ for k = O(ɛ−2 log 2 ..."
Abstract

Cited by 46 (18 self)
 Add to MetaCart
(Show Context)
We show that any distribution on {−1, +1} n that is kwise independent fools any halfspace (a.k.a. linear threshold function) h: {−1, +1} n → {−1, +1}, i.e., any function of the form h(x) = sign ( ∑n i=1 wixi − θ) where the w1,..., wn, θ are arbitrary real numbers, with error ɛ for k = O(ɛ−2 log 2 (1/ɛ)). Our result is tight up to log(1/ɛ) factors. Using standard constructions of kwise independent distributions, we obtain the first explicit pseudorandom generators G: {−1, +1} s → {−1, +1} n that fool halfspaces. Specifically, we fool halfspaces with error ɛ and seed length s = k · log n = O(log n · ɛ−2 log 2 (1/ɛ)). Our approach combines classical tools from real approximation theory with structural results on halfspaces by Servedio (Comput. Complexity 2007).
Hardness of learning halfspaces with noise
 In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
, 2006
"... Learning an unknown halfspace (also called a perceptron) from labeled examples is one of the classic problems in machine learning. In the noisefree case, when a halfspace consistent with all the training examples exists, the problem can be solved in polynomial time using linear programming. However ..."
Abstract

Cited by 44 (3 self)
 Add to MetaCart
(Show Context)
Learning an unknown halfspace (also called a perceptron) from labeled examples is one of the classic problems in machine learning. In the noisefree case, when a halfspace consistent with all the training examples exists, the problem can be solved in polynomial time using linear programming. However, under the promise that a halfspace consistent with a fraction (1 − ε) of the examples exists (for some small constant ε> 0), it was not known how to efficiently find a halfspace that is correct on even 51 % of the examples. Nor was a hardness result that ruled out getting agreement on more than 99.9 % of the examples known. In this work, we close this gap in our understanding, and prove that even a tiny amount of worstcase noise makes the problem of learning halfspaces intractable in a strong sense. Specifically, for arbitrary ε, δ> 0, we prove that given a set of exampleslabel pairs from the hypercube a fraction (1 − ε) of which can be explained by a halfspace, it is NPhard to find a halfspace that correctly labels a fraction (1/2 + δ) of the examples. The hardness result is tight since it is trivial to get agreement on 1/2 the examples. In learning theory parlance, we prove that weak proper agnostic learning of halfspaces is hard. This settles a question that was raised by Blum et al. in their work on learning halfspaces in the presence of random classification noise [10], and in some more recent works as well. Along the way, we also obtain a strong hardness result for another basic computational problem: solving a linear system over the rationals. 1
Some Topics in Analysis of Boolean Functions
"... This article accompanies a tutorial talk given at the 40th ACM STOC conference. In it, we give a brief introduction to Fourier analysis of boolean functions and then discuss some applications: Arrow’s Theorem and other ideas from the theory of Social Choice; the BonamiBeckner Inequality as an exten ..."
Abstract

Cited by 44 (0 self)
 Add to MetaCart
This article accompanies a tutorial talk given at the 40th ACM STOC conference. In it, we give a brief introduction to Fourier analysis of boolean functions and then discuss some applications: Arrow’s Theorem and other ideas from the theory of Social Choice; the BonamiBeckner Inequality as an extension of Chernoff/Hoeffding bounds to higherdegree polynomials; and, hardness for approximation algorithms.
Bounded Independence Fools Degree2 Threshold Functions
"... Let x be a random vector coming from any kwise independent distribution over {−1, 1} n. For an nvariate degree2 polynomial p, we prove that E[sgn(p(x))] is determined up to an additive ε for k = poly(1/ε). This answers an open question of Diakonikolas et al. (FOCS 2009). Using standard constructi ..."
Abstract

Cited by 31 (9 self)
 Add to MetaCart
(Show Context)
Let x be a random vector coming from any kwise independent distribution over {−1, 1} n. For an nvariate degree2 polynomial p, we prove that E[sgn(p(x))] is determined up to an additive ε for k = poly(1/ε). This answers an open question of Diakonikolas et al. (FOCS 2009). Using standard constructions of kwise independent distributions, we obtain a broad class of explicit generators that εfool the class of degree2 threshold functions with seed length log n·poly(1/ε). Our approach is quite robust: it easily extends to yield that the intersection of any constant number of degree2 threshold functions is εfooled by poly(1/ε)wise independence. Our results also hold if the entries of x are kwise independent standard normals, implying for example that bounded independence derandomizes the GoemansWilliamson hyperplane rounding scheme. To achieve our results, we introduce a technique we dub multivariate FTmollification, a generalization of the univariate form introduced by Kane et al. (SODA 2010) in the context of streaming algorithms. Along the way we prove a generalized hypercontractive inequality for quadratic forms which takes the operator norm of the associated matrix into account. These techniques may be of independent interest. 1
Agnostic Learning of Monomials by Halfspaces is Hard
"... We prove the following strong hardness result for learning: Given a distribution on labeled examples from the hypercube such that there exists a monomial (or conjunction) consistent with (1 − ϵ)fraction of the examples, it is NPhard to find a halfspace that is correct on ( 1 +ϵ)fraction of the ..."
Abstract

Cited by 27 (10 self)
 Add to MetaCart
(Show Context)
We prove the following strong hardness result for learning: Given a distribution on labeled examples from the hypercube such that there exists a monomial (or conjunction) consistent with (1 − ϵ)fraction of the examples, it is NPhard to find a halfspace that is correct on ( 1 +ϵ)fraction of the examples, 2 for arbitrary constant ϵ> 0. In learning theory terms, weak agnostic learning of monomials by halfspaces is NPhard. This hardness result bridges between and subsumes two previous results which showed similar hardness results for the proper learning of monomials and halfspaces. As immediate corollaries of our result, we give the first optimal hardness results for weak agnostic learning of decision lists and majorities. Our techniques are quite different from previous hardness proofs for learning. We use an invariance principle and sparse approximation of halfspaces from recent work on fooling halfspaces to give a new natural list decoding of a halfspace in the context of dictatorship tests/label cover reductions. In addition, unlike previous invariance principle based proofs which are only known to give Unique Games hardness, we give a reduction from a smooth version of Label Cover that is known to be NPhard.
A Complete Characterization of Statistical Query Learning with Applications to Evolvability
, 2009
"... Statistical query (SQ) learning model of Kearns is a natural restriction of the PAC learning model in which a learning algorithm is allowed to obtain estimates of statistical properties of the examples but cannot see the examples themselves [18]. We describe a new and simple characterization of the ..."
Abstract

Cited by 27 (14 self)
 Add to MetaCart
(Show Context)
Statistical query (SQ) learning model of Kearns is a natural restriction of the PAC learning model in which a learning algorithm is allowed to obtain estimates of statistical properties of the examples but cannot see the examples themselves [18]. We describe a new and simple characterization of the query complexity of learning in the SQ learning model. Unlike the previously known bounds on SQ learning [7, 9, 33, 3, 28] our characterization preserves the accuracy and the efficiency of learning. The preservation of accuracy implies that that our characterization gives the first characterization of SQ learning in the agnostic learning framework of Haussler and Kearns, Schapire and Sellie [15, 20]. The preservation of efficiency allows us to derive a new technique for the design of evolutionary algorithms in Valiant’s model of evolvability [32]. We use this technique to demonstrate the existence of a large class of monotone evolutionary learning algorithms based on square loss fitness estimation. These results differ significantly from the few known evolutionary algorithms and give evidence that evolvability in Valiant’s model is a more versatile phenomenon than there had been previous reason to suspect. 1
Agnostically Learning Decision Trees
, 2008
"... We give a query algorithm for agnostically learning decision trees with respect to the uniform distribution on inputs. Given blackbox access to an arbitrary binary function f on the ndimensional hypercube, our algorithm finds a function that agrees with f on almost (within an ɛ fraction) as many i ..."
Abstract

Cited by 24 (10 self)
 Add to MetaCart
We give a query algorithm for agnostically learning decision trees with respect to the uniform distribution on inputs. Given blackbox access to an arbitrary binary function f on the ndimensional hypercube, our algorithm finds a function that agrees with f on almost (within an ɛ fraction) as many inputs as the best sizet decision tree, in time poly(n, t, 1/ɛ). This is the first polynomialtime algorithm for learning decision trees in a harsh noise model. We also give a proper agnostic learning algorithm for juntas, a subclass of decision trees, again using membership queries. Conceptually, the present paper parallels recent work towards agnostic learning of halfspaces [13]; algorithmically, it is significantly more challenging. The core of our learning algorithm is a procedure to implicitly solve a convex optimization problem over the L1 ball in 2 n dimensions using an approximate gradient projection method.