#### DMCA

## The tradeoffs of large scale learning (2008)

Venue: | IN: ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 20 |

Citations: | 251 - 4 self |

### Citations

4769 |
Pattern Classification and scene analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ... best learning algorithms. Maybe more surprisingly, certain algorithms perform well regardless of the assumed rate for the statistical estimation error.2 Approximate Optimization 2.1 Setup Following =-=[6, 2]-=-, we consider a space of input-output pairs (x,y) ∈ X × Y endowed with a probability distribution P(x,y). The conditional distribution P(y|x) represents the unknown relationship between inputs and out... |

1960 | A theory of the learnable
- Valiant
- 1984
(Show Context)
Citation Context ...xity of the underlying optimization algorithms in non-trivial ways. 1 Motivation The computational complexity of learning algorithms has seldom been taken into account by the learning theory. Valiant =-=[1]-=- states that a problem is “learnable” when there exists a probably approximatively correct learning algorithm with polynomial complexity. Whereas much progress has been made on the statistical aspect ... |

952 |
Estimation of Dependences Based on Empirical Data
- Vapnik
- 1982
(Show Context)
Citation Context ... that a problem is “learnable” when there exists a probably approximatively correct learning algorithm with polynomial complexity. Whereas much progress has been made on the statistical aspect (e.g., =-=[2, 3, 4]-=-), very little has been told about the complexity side of this proposal (e.g., [5].) Computational complexity becomes the limiting factor when one envisions large amounts of training data. Two importa... |

570 | Shallow Parsing with Conditional Random Fields
- Sha, Pereira
- 2003
(Show Context)
Citation Context ...ng a F1 measure that takes into account both the segment boundaries and the segment classes. The chunking task has been successfully approached using Conditional Random Fields (Lafferty et al., 2001; =-=Sha and Pereira, 2003-=-) to tag the words with labels indicating the class and the boundaries of each segment. Our baseline is the ConditionalRandomFieldmodelprovidedwiththeCRF++software(Kudo,2007). Our CRFSGD implementatio... |

519 | Pegasos: Primal Estimated sub-GrAdient SOlver for SVM - Shalev-Shwartz, Singer, et al. - 2007 |

466 |
Numerical Methods for Unconstrained Optimization and Nonlinear Equations
- Schnabel
- 1983
(Show Context)
Citation Context ...er than 1−η : tr(GH −1 ) ≤ ν and EigenSpectrum(H) ⊂ [λmin, λmax] (1.10)The condition number κ = λmax/λmin provides a convenient measure of the difficulty of the optimization problem (e.g. Dennis and =-=Schnabel, 1983-=-.) The assumption λmin > 0 avoids complications with stochastic gradient algorithms. This assumption is weaker than strict convexity because it only applies in the vicinity of the optimum. For instanc... |

164 | Convexity, classification, and risk bounds
- Bartlett, Jordan, et al.
- 2006
(Show Context)
Citation Context ...ation,estimation,andoptimization errors (1.2). It is often accepted that these upper bounds give a realistic idea of the actual convergence rates (Vapnik et al., 1994; Bousquet, 2002; Tsybakov, 2004; =-=Bartlett et al., 2006-=-). Another way to find comfort in this approach is to say that we study guaranteed convergence rates instead of the possibly pathological special cases. We are studying the asymptotic properties of th... |

156 | Statistical Behavior and Consistency of Classification Methods based on Convex Risk Minimization
- Zhang
- 2001
(Show Context)
Citation Context ...estimation errors. This tradeoff has been extensively discussed in the literature [2, 3] and lead to excess errors that scale between the inverse and the inverse square root of the number of examples =-=[7, 8]-=-. 2.2 Optimization Error Finding fn by minimizing the empirical risk En(f) is often a computationally expensive operation. Since the empirical risk En(f) is already an approximation of the expected ri... |

140 | Accelerated training of conditional random fields with stochastic gradient methods - Vishwanathan, Schraudolph, et al. |

85 | N.: SVM optimization: Inverse dependence on training set size - Shalev-Shwartz, Srebro - 2008 |

69 | A few notes on statistical learning theory
- Mendelson
- 2003
(Show Context)
Citation Context ...unds on the complexity of classes of linear functions (e.g., Bousquet, 2002) yields the following result: [ E = Eapp +Eest +Eopt = E E( ˜ fn)−E(f ∗ ] ( ) ≤ c Eapp + d n log n d +ρ ) . (1.7) See also (=-=Mendelson, 2003-=-; Bartlett and Mendelson, 2006) for more bounds taking into account the optimization accuracy. 3.2 Gradient Optimization Algorithms Wenowdiscussandcomparetheasymptoticlearningpropertiesoffourgradiento... |

62 | The importance of convexity in learning with squared loss - Lee, Bartlett, et al. - 1980 |

55 | Measuring the VC dimension of a learning machine
- Vapnik, Levin, et al.
- 1994
(Show Context)
Citation Context ...implifications: Wearestudyingupperboundsoftheapproximation,estimation,andoptimization errors (1.2). It is often accepted that these upper bounds give a realistic idea of the actual convergence rates (=-=Vapnik et al., 1994-=-; Bousquet, 2002; Tsybakov, 2004; Bartlett et al., 2006). Another way to find comfort in this approach is to say that we study guaranteed convergence rates instead of the possibly pathological special... |

37 |
On the complexity of loading shallow neural networks
- Judd
- 1988
(Show Context)
Citation Context ...ning algorithm with polynomial complexity. Whereas much progress has been made on the statistical aspect (e.g., [2, 3, 4]), very little has been told about the complexity side of this proposal (e.g., =-=[5]-=-.) Computational complexity becomes the limiting factor when one envisions large amounts of training data. Two important examples come to mind: • Data mining exists because competitive advantages can ... |

37 |
Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms,” Ph.D. dissertation, Ecole Polytechnique
- Bousquet
- 2002
(Show Context)
Citation Context ...studyingupperboundsoftheapproximation,estimation,andoptimization errors (1.2). It is often accepted that these upper bounds give a realistic idea of the actual convergence rates (Vapnik et al., 1994; =-=Bousquet, 2002-=-; Tsybakov, 2004; Bartlett et al., 2006). Another way to find comfort in this approach is to say that we study guaranteed convergence rates instead of the possibly pathological special cases. We are s... |

29 | Tjong Kim Sang and Sabine Buchholz. Introduction to the CoNLL-2000 Shared Task: Chunking - Erik - 2000 |

17 | A statistical study of on-line learning
- Murata
- 1998
(Show Context)
Citation Context ...−1 t ∂w ℓ( ) fw(t)(xt),yt . Unlike standard gradient algorithms, using the second order information does not change the influence of ρ on the convergence rate but improves the constants. Using again (=-=Murata, 1998-=-, theorem 4), accuracy ρis reached after ν/ρ+o(1/ρ) iterations. For each of the four gradient algorithms, the first three columns of table 1.2 report the time for a single iteration, the number of ite... |

16 | CRF++: Yet another CRF toolkit - Kudo |

14 |
Empirical minimization. Probability Theory and Related
- Bartlett, Mendelson
(Show Context)
Citation Context ... that a problem is “learnable” when there exists a probably approximatively correct learning algorithm with polynomial complexity. Whereas much progress has been made on the statistical aspect (e.g., =-=[2, 3, 4]-=-), very little has been told about the complexity side of this proposal (e.g., [5].) Computational complexity becomes the limiting factor when one envisions large amounts of training data. Two importa... |

11 | Gábor Lugosi. Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics - Boucheron, Bousquet - 2005 |

6 | Ingo Steinwart, "QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines - Hush, Kelly, et al. - 2006 |

3 |
and Ingo Steinwart. Fast rates for support vector machines
- Scovel
- 2005
(Show Context)
Citation Context ...estimation errors. This tradeoff has been extensively discussed in the literature [2, 3] and lead to excess errors that scale between the inverse and the inverse square root of the number of examples =-=[7, 8]-=-. 2.2 Optimization Error Finding fn by minimizing the empirical risk En(f) is often a computationally expensive operation. Since the empirical risk En(f) is already an approximation of the expected ri... |

2 | S Sathiya Keerthi, “Trust region newton method for logistic regression - Lin, Weng - 2008 |

1 | RCV1: A new benchmark collectionfortextcategorizationresearch - Lewis, Yang, et al. |