Results 1 
5 of
5
Online Learning of Noisy Data with Kernels
"... We study online learning when individual instances are corrupted by adversarially chosen random noise. We assume the noise distribution is unknown, and may change over time with no restriction other than having zero mean and bounded variance. Our technique relies on a family of unbiased estimators f ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
(Show Context)
We study online learning when individual instances are corrupted by adversarially chosen random noise. We assume the noise distribution is unknown, and may change over time with no restriction other than having zero mean and bounded variance. Our technique relies on a family of unbiased estimators for nonlinear functions, which may be of independent interest. We show that a variant of online gradient descent can learn functions in any dotproduct (e.g., polynomial) or Gaussian kernel space with any analytic convex loss function. Our variant uses randomized estimates that need to query a random number of noisy copies of each instance, where with high probability this number is upper bounded by a constant. Allowing such multiple queries cannot be avoided: Indeed, we show that online learning is in general impossible when only one noisy copy of each instance can be accessed. 1
On the Noise Sensitivity of Monotone Functions
, 2003
"... It is known that for all monotone functions f: {0, 1} n → {0, 1}, if x ∈ {0, 1} n is chosen uniformly at random and y is obtained from x by flipping each of the bits of x independently with probability ɛ = n −α, then P[f(x) � = f(y)] < cn −α+1/2, for some c> 0. Previously, the best constructio ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
It is known that for all monotone functions f: {0, 1} n → {0, 1}, if x ∈ {0, 1} n is chosen uniformly at random and y is obtained from x by flipping each of the bits of x independently with probability ɛ = n −α, then P[f(x) � = f(y)] < cn −α+1/2, for some c> 0. Previously, the best construction of monotone functions satisfying P[fn(x) � = fn(y)] ≥ δ, where 0 < δ < 1/2, required ɛ ≥ c(δ)n −α, where α = 1 − ln 2 / ln 3 = 0.36907..., and c(δ)> 0. We improve this result by achieving for every 0 < δ < 1/2, P[fn(x) � = fn(y)] ≥ δ, with: • ɛ = c(δ)n−α for any α < 1/2, using the recursive majority function with arity k = k(α); π/2 =.3257..., using an explicit recursive majority • ɛ = c(δ)n −1/2 log t n for t = log 2 function with increasing arities; and, • ɛ = c(δ)n −1/2, nonconstructively, following a probabilistic CNF construction due to Talagrand. We also study the problem of achieving the best dependence on δ in the case that the noise rate ɛ is at least a small constant; the results we obtain are tight to within logarithmic factors.
Online Learning of Noisy Data
"... Abstract—We study online learning of linear and kernelbased predictors, when individual examples are corrupted by random noise, and both examples and noise type can be chosen adversarially and change over time. We begin with the setting where some auxiliary information on the noise distribution is ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Abstract—We study online learning of linear and kernelbased predictors, when individual examples are corrupted by random noise, and both examples and noise type can be chosen adversarially and change over time. We begin with the setting where some auxiliary information on the noise distribution is provided, and we wish to learn predictors with respect to the squared loss. Depending on the auxiliary information, we show how one can learn linear and kernelbased predictors, using just 1 or 2 noisy copies of each example. We then turn to discuss a general setting where virtually nothing is known about the noise distribution, and one wishes to learn with respect to general losses and using linear and kernelbased predictors. We show how this can be achieved using a random, essentially constant number of noisy copies of each example. Allowing multiple copies cannot be avoided: Indeed, we show that the setting becomes impossible when only one noisy copy of each instance can be accessed. To obtain our results we introduce several novel techniques, some of which might be of independent interest. I.
Learning Juntas in the Presence of Noise
, 2005
"... The combination of two major challenges in machine learning is investigated: dealing with large amounts of irrelevant information and learning from noisy data. It is shown that large classes of Boolean concepts that depend on a small number of variablessocalled juntascan be learned eciently from ..."
Abstract
 Add to MetaCart
(Show Context)
The combination of two major challenges in machine learning is investigated: dealing with large amounts of irrelevant information and learning from noisy data. It is shown that large classes of Boolean concepts that depend on a small number of variablessocalled juntascan be learned eciently from random examples corrupted by random attribute and classication noise. To accomplish this goal, a twophase algorithm is presented that copes with several problems arising from the presence of noise: rstly, a suitable method for approximating Fourier coecients in the presence of noise is applied to infer the relevant variables. Secondly, as one cannot simply read o a truth table from the examples as in the noisefree case, an alternative method to build a hypothesis is established and applied to the examples restricted to the relevant variables. In particular, for the class of monotone juntas depending on d out of n variables, the sample complexity is polynomial in log(n=), 2d,
da, and 1 b, where is the condence parameter and
a;
b> 0 are noise parameters bounding the noise rates away from 1=2. The running time is bounded by the sample complexity times a polynomial in n. So far, all results hold for the case of uniformly distributed examplesthe only case that (apart from side notes) has been studied in the literature yet. We show how to extend our methods to nonuniformly distributed examples and derive new results for monotone juntas. For the attribute noise, we have to assume that it is generated by a product distribution since otherwise faulttolerant learning is in general impossible: we construct a noise distribution P and a concept class C such that it is impossible to learn C under Pnoise.