#### DMCA

## Efficient large-scale distributed training of conditional maximum entropy models (2009)

### Cached

### Download Links

Venue: | In Advances in Neural Information Processing Systems |

Citations: | 54 - 3 self |

### Citations

3217 | Numerical Optimization
- Nocedal, Wright
- 1999
(Show Context)
Citation Context ...itional maxent models using a single processor. These include generalized iterative scaling [7], improved iterative scaling [8], gradient descent, conjugate gradient methods, and second-order methods =-=[15, 18]-=-. This paper examines distributed methods for training conditional maxent models that can scale to very large samples of up to 1B instances. Both batch algorithms and on-line training algorithms such ... |

2679 | Building a large annotated corpus of English: The Penn treebank, Computational Linguistics 19:313–330
- Marcus
- 1993
(Show Context)
Citation Context ...ter than wm. The convergence bound for wµ 0 contains two terms, one somewhat more favorable, one somewhat less than its counterpart term in the bound for wpm. k=1 6pm |Y| |X | sparsity p English POS =-=[16]-=- 1 M 24 500 K 0.001 10 Sentiment 9 M 3 500 K 0.001 10 RCV1-v2 [14] 26 M 103 10 K 0.08 10 Speech 50 M 129 39 1.0 499 Deja News Archive 306 M 8 50 K 0.002 200 Deja News Archive 250K 306 M 8 250 K 0.0004... |

1335 | A Maximum Entropy Approach to Natural Language Processing
- Berger, Pietra, et al.
- 1996
(Show Context)
Citation Context ... benefits of the mixture weight method: this method consumes less resources, while achieving a performance comparable to that of standard approaches. 1 Introduction Conditional maximum entropy models =-=[1, 3]-=-, conditional maxent models for short, also known as multinomial logistic regression models, are widely used in applications, most prominently for multiclass classification problems with a large numbe... |

1125 |
Information theory and statistical mechanics
- Jaynes
- 1957
(Show Context)
Citation Context ... problems with a large number of classes in natural language processing [1, 3] and computer vision [12] over the last decade or more. These models are based on the maximum entropy principle of Jaynes =-=[11]-=-, which consists of selecting among the models approximately consistent with the constraints, the one with the greatest entropy. They benefit from a theoretical foundation similar to that of standard ... |

657 | Inducing Features of Random Fields
- Pietra, S, et al.
- 1997
(Show Context)
Citation Context ...tely consistent with the constraints, the one with the greatest entropy. They benefit from a theoretical foundation similar to that of standard maxent probabilistic models used for density estimation =-=[8]-=-. In particular, a duality theorem for conditional maxent model shows that these models belong to the exponential family. As shown by Lebanon and Lafferty [13], in the case of two classes, these model... |

640 | RCV1: A New Benchmark Collection for Text Categorization Research
- Lewis, Yang, et al.
(Show Context)
Citation Context ...e somewhat more favorable, one somewhat less than its counterpart term in the bound for wpm. k=1 6pm |Y| |X | sparsity p English POS [16] 1 M 24 500 K 0.001 10 Sentiment 9 M 3 500 K 0.001 10 RCV1-v2 =-=[14]-=- 26 M 103 10 K 0.08 10 Speech 50 M 129 39 1.0 499 Deja News Archive 306 M 8 50 K 0.002 200 Deja News Archive 250K 306 M 8 250 K 0.0004 200 Gigaword [10] 1,000 M 96 10 K 0.001 1000 Table 2: Description... |

548 |
On the method of bounded differences
- McDiarmid
- 1989
(Show Context)
Citation Context ...inequality holds: ‖w − w ⋆ R ‖ ≤ λ √ ( √ ) 1 + log 1/δ . (9) m/2 Proof. Let S and S ′ be as before samples of size m differing by a single point. To derive this bound, we apply McDiarmid’s inequality =-=[17]-=- to Ψ(S)=‖w − w ⋆ ‖. By the triangle inequality and Theorem 1, the following Lipschitz property holds: |Ψ(S ′ ) − Ψ(S)| = ∣ ∣ ′ ⋆ ⋆ ‖w − w ‖ − ‖w − w ‖ ′ 2R ≤ ‖w − w‖ ≤ . (10) λm 5Thus, by McDiarmid’... |

503 |
Generalized iterative scaling for log-linear models
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...r data sets of several million points. A number of algorithms have been described for batch training of conditional maxent models using a single processor. These include generalized iterative scaling =-=[7]-=-, improved iterative scaling [8], gradient descent, conjugate gradient methods, and second-order methods [15, 18]. This paper examines distributed methods for training conditional maxent models that c... |

478 |
Prediction and entropy of printed English
- Shannon
- 1951
(Show Context)
Citation Context ...ixture weight and 40GB for distributed gradient method when we discard machine-to-disk traffic. For the largest experiment, we examined the task of predicting the next character in a sequence of text =-=[19]-=-, which has implications for many natural language processing tasks. As a training and evaluation corpus we used the English Gigaword corpus [10] and used the full ASCII output space of that corpus of... |

284 | A comparison of algorithms for maximum entropy parameter estimation
- Malouf
- 2002
(Show Context)
Citation Context ...itional maxent models using a single processor. These include generalized iterative scaling [7], improved iterative scaling [8], gradient descent, conjugate gradient methods, and second-order methods =-=[15, 18]-=-. This paper examines distributed methods for training conditional maxent models that can scale to very large samples of up to 1B instances. Both batch algorithms and on-line training algorithms such ... |

257 | Stability and generalization - Bousquet, Elisseeff - 2002 |

256 | Logistic regression, AdaBoost and Bregman distances
- Collins, Schapire, et al.
- 2000
(Show Context)
Citation Context ...odels that can scale to very large samples of up to 1B instances. Both batch algorithms and on-line training algorithms such ∗ This work was conducted while at Google Research, New York. 1as that of =-=[5]-=- or stochastic gradient descent [21] can benefit from parallelization, but we concentrate here on batch distributed methods. We examine three common distributed training methods: a distributed gradien... |

216 | Map-reduce for machine learning on multicore
- Chu, Kim, et al.
(Show Context)
Citation Context ...nt descent [21] can benefit from parallelization, but we concentrate here on batch distributed methods. We examine three common distributed training methods: a distributed gradient computation method =-=[4]-=-, a majority vote method, and a mixture weight method. We analyze and compare the CPU and network time complexity of each of these methods (Section 2) and present a theoretical analysis of conditional... |

132 | Feature hashing for large scale multitask learning
- Weinberger, Dasgupta, et al.
(Show Context)
Citation Context ...nd the Deja News Archive, a text topic classification problem generated from a collection of Usenet discussion forums from the years 1995-2000. For all text experiments, we used random feature mixing =-=[9, 20]-=- to control the size of the feature space. The results reported in Table 3 show that the accuracy of the mixture weight method consistently matches or exceeds that of the majority vote method. As expe... |

114 | Solving Large Scale Linear Prediction Problems Using Stochastic Gradient Descent Algorithms
- Zhang
(Show Context)
Citation Context ...samples of up to 1B instances. Both batch algorithms and on-line training algorithms such ∗ This work was conducted while at Google Research, New York. 1as that of [5] or stochastic gradient descent =-=[21]-=- can benefit from parallelization, but we concentrate here on batch distributed methods. We examine three common distributed training methods: a distributed gradient computation method [4], a majority... |

103 | A survey of smoothing techniques for me models
- Chen, Rosenfeld
(Show Context)
Citation Context ... benefits of the mixture weight method: this method consumes less resources, while achieving a performance comparable to that of standard approaches. 1 Introduction Conditional maximum entropy models =-=[1, 3]-=-, conditional maxent models for short, also known as multinomial logistic regression models, are widely used in applications, most prominently for multiclass classification problems with a large numbe... |

96 | Boosting and maximum likelihood for exponential models
- Lebanon, Laerty
- 2002
(Show Context)
Citation Context ...listic models used for density estimation [8]. In particular, a duality theorem for conditional maxent model shows that these models belong to the exponential family. As shown by Lebanon and Lafferty =-=[13]-=-, in the case of two classes, these models are also closely related to AdaBoost, which can be viewed as solving precisely the same optimization problem with the same constraints, modulo a normalizatio... |

62 | Using maximum entropy for automatic image annotation
- Jeon, Manmatha
- 2004
(Show Context)
Citation Context ...ic regression models, are widely used in applications, most prominently for multiclass classification problems with a large number of classes in natural language processing [1, 3] and computer vision =-=[12]-=- over the last decade or more. These models are based on the maximum entropy principle of Jaynes [11], which consists of selecting among the models approximately consistent with the constraints, the o... |

26 | Sample selection bias correction theory
- Cortes, Mohri, et al.
- 2008
(Show Context)
Citation Context ...ded, that is there exists R > 0 such that for all (x, y) in X ×Y, ‖Φ(x, y)‖ ≤ R. Our bounds are derived using 4techniques similar to those used by Bousquet and Elisseeff [2], or other authors, e.g., =-=[6]-=-, in the analysis of stability. In what follows, for any w ∈ H and z = (x, y) ∈ X ×Y, we denote by Lz(w) the negative log-likelihood -log pw[y|x]. Theorem 1. Let S ′ and S be two arbitrary samples of ... |

24 | Small statistical models by random feature mixing
- Ganchev, Dredze
(Show Context)
Citation Context ...nd the Deja News Archive, a text topic classification problem generated from a collection of Usenet discussion forums from the years 1995-2000. For all text experiments, we used random feature mixing =-=[9, 20]-=- to control the size of the feature space. The results reported in Table 3 show that the accuracy of the mixture weight method consistently matches or exceeds that of the majority vote method. As expe... |

11 |
English gigaword third edition, linguistic data consortium
- Graff, Kong, et al.
- 2007
(Show Context)
Citation Context ...0.001 10 Sentiment 9 M 3 500 K 0.001 10 RCV1-v2 [14] 26 M 103 10 K 0.08 10 Speech 50 M 129 39 1.0 499 Deja News Archive 306 M 8 50 K 0.002 200 Deja News Archive 250K 306 M 8 250 K 0.0004 200 Gigaword =-=[10]-=- 1,000 M 96 10 K 0.001 1000 Table 2: Description of data sets. The column named sparsity reports the frequency of non-zero feature values for each data set. 4 Experiments We ran a number of experiment... |