#### DMCA

## Distributed Training Strategies for the Structured Perceptron

### Cached

### Download Links

Citations: | 74 - 4 self |

### Citations

3550 | Bagging predictors
- Breiman
- 1996
(Show Context)
Citation Context ...erceptron (f-measure of 87.9 vs. 85.8). We suspect this happens for two reasons. First, the parameter mixing has a bagging like effect which helps to reduce the variance of the per-shard classifiers (=-=Breiman, 1996-=-). Second, the fact that parameter mixing is just a form of parameter averaging perhaps has the same effect as the averaged perceptron. Our second set of experiments looked at the much more computatio... |

3376 | Conditional random fields: Probabilistic models for segmenting and labeling sequence data
- Lafferty, McCallum, et al.
- 2001
(Show Context)
Citation Context ...e structured perceptron has many desirable properties, most notably that there is no need to calculate a partition function, which is necessary for other structured prediction paradigms such as CRFs (=-=Lafferty et al., 2001-=-). Furthermore, it is robust to approximate inference, which is often required for problems where the search space is too large and where strong structural independence assumptions are insufficient, s... |

3206 | MapReduce: Simplified Data Processing on Large Clusters
- Dean, Ghemawat, et al.
- 2004
(Show Context)
Citation Context ...hine learning algorithms are typically designed for a single machine, and designing an efficient training mechanism for analogous algorithms on a computing cluster – often via a map-reduce framework (=-=Dean and Ghemawat, 2004-=-) – is an active area of research (Chu et al., 2007). However, unlike many batch learning algorithms that can easily be distributed through the gradient calculation, a distributed training analog for ... |

1130 | The perceptron: a probabilistic model for information storage and organization
- Rosenblatt
- 1958
(Show Context)
Citation Context ...y parsing – to highlight the efficiency of this method. 1 Introduction One of the most popular training algorithms for structured prediction problems in natural language processing is the perceptron (=-=Rosenblatt, 1958-=-; Collins, 2002). The structured perceptron has many desirable properties, most notably that there is no need to calculate a partition function, which is necessary for other structured prediction para... |

584 | Max margin markov networks
- Taskar, Guestrin, et al.
- 2003
(Show Context)
Citation Context ...nkel et al., 2008). However, unlike perceptron, CRFs require the computation of a partition function, which is often expensive and sometimes intractable. Other batch learning algorithms include M3Ns (=-=Taskar et al., 2004-=-) and Structured SVMs (Tsochantaridis et al., 2004). Due to their efficiency, online learning algorithms have gained attention, especially for structured prediction tasks in NLP. In addition to the pe... |

512 | Large margin classification using the perceptron algorithm
- Freund, Schapire
- 1999
(Show Context)
Citation Context ...inal weight vector is a weighted average of all parameters that occur during training, which he called the averaged perceptron and can be viewed as an approximation to the voted perceptron algorithm (=-=Freund and Schapire, 1999-=-). 4 Distributed Structured Perceptron In this section we examine two distributed training strategies for the perceptron algorithm based on parameter mixing. 4.1 Parameter Mixing Distributed training ... |

441 | Support vector machine learning for interdependent and structured output spaces
- Tsochantaridis, Hofmann, et al.
(Show Context)
Citation Context ...tron, CRFs require the computation of a partition function, which is often expensive and sometimes intractable. Other batch learning algorithms include M3Ns (Taskar et al., 2004) and Structured SVMs (=-=Tsochantaridis et al., 2004-=-). Due to their efficiency, online learning algorithms have gained attention, especially for structured prediction tasks in NLP. In addition to the perceptron (Collins, 2002), others have looked at st... |

421 | Online passive-aggressive algorithms.
- Crammer, Dekel, et al.
- 2006
(Show Context)
Citation Context ...1..N 3. for t : 1..T 4. Let y ′ = arg maxy ′ w (k) · f(xt, y ′) 5. if y ′ ̸= yt 6. w (k+1) = w (k) + f(xt, yt) − f(xt, y ′) 7. k = k + 1 8. return w (k) Figure 1: The perceptron algorithm. al., 2005; =-=Crammer et al., 2006-=-), the recently introduced confidence weighted learning (Dredze et al., 2008) and coordinate descent algorithms (Duchi and Singer, 2009). 3 Structured Perceptron The structured perceptron was introduc... |

389 | Distributed asynchronous deterministic and stochastic gradient optimization algorithms - Tsitsiklis, Bertsekas, et al. - 1986 |

333 | CoNLL-X Shared Task on Multilingual Dependency Parsing
- Buchholz, Marsi
- 2006
(Show Context)
Citation Context ...converged models. language treebank and currently one of the largest dependency treebanks in existence. We used the CoNLL-X training (72703 sentences) and testing splits (365 sentences) of this data (=-=Buchholz and Marsi, 2006-=-) and dependency parsing models based on McDonald and Pereira (2006) which factors features over pairs of dependency arcs in a tree. To parse all the sentences in the PDT, one must use a non-projectiv... |

293 | Online large-margin training of dependency parsers - McDonald, Crammer, et al. - 2005 |

220 | Map-reduce for machine learning on multicore
- Chu, Kim, et al.
- 2006
(Show Context)
Citation Context ...le machine, and designing an efficient training mechanism for analogous algorithms on a computing cluster – often via a map-reduce framework (Dean and Ghemawat, 2004) – is an active area of research (=-=Chu et al., 2007-=-). However, unlike many batch learning algorithms that can easily be distributed through the gradient calculation, a distributed training analog for the perceptron is less clear cut. It employs online... |

212 | Online Learning of Approximate Dependency Parsing Algorithms - McDonald, Pereira - 2006 |

173 | Incremental parsing with the perceptron algorithm
- Collins, Roark
- 2004
(Show Context)
Citation Context ...it is robust to approximate inference, which is often required for problems where the search space is too large and where strong structural independence assumptions are insufficient, such as parsing (=-=Collins and Roark, 2004-=-; McDonald and Pereira, 2006; Zhang and Clark, 2008) and machine translation (Liang et al., 2006). However, like all structured prediction learning frameworks, the structure perceptron can still be cu... |

155 | An End-to-End Discriminative Approach to Machine Translation
- Liang, Bouchard-Cote, et al.
- 2006
(Show Context)
Citation Context ... too large and where strong structural independence assumptions are insufficient, such as parsing (Collins and Roark, 2004; McDonald and Pereira, 2006; Zhang and Clark, 2008) and machine translation (=-=Liang et al., 2006-=-). However, like all structured prediction learning frameworks, the structure perceptron can still be cumbersome to train. This is both due to the increasing size of available training sets as well as... |

155 | On convergence proofs on perceptrons - Novikoff - 1962 |

95 | Confidence-Weighted Linear Classification
- Dredze, Crammer, et al.
- 2008
(Show Context)
Citation Context ...yt 6. w (k+1) = w (k) + f(xt, yt) − f(xt, y ′) 7. k = k + 1 8. return w (k) Figure 1: The perceptron algorithm. al., 2005; Crammer et al., 2006), the recently introduced confidence weighted learning (=-=Dredze et al., 2008-=-) and coordinate descent algorithms (Duchi and Singer, 2009). 3 Structured Perceptron The structured perceptron was introduced by Collins (2002) and we adopt much of the notation and presentation of t... |

90 | 2008b. A tale of two parsers: Investigating and combining graph-based and transition-based dependency parsing using beam-search - Zhang, Clark |

60 | Efficient, feature-based, conditional random field parsing
- Finkel, Kleeman, et al.
- 2008
(Show Context)
Citation Context ...nal random fields (CRFs) (Lafferty et al., 2001), which is the structured analog of maximum entropy. As such, its training can easily be distributed through the gradient or sub-gradient computations (=-=Finkel et al., 2008-=-). However, unlike perceptron, CRFs require the computation of a partition function, which is often expensive and sometimes intractable. Other batch learning algorithms include M3Ns (Taskar et al., 20... |

54 | Efficient large-scale distributed training of conditional maximum entropy models - Mann, McDonald, et al. - 2009 |

26 | Web-scale named entity recognition - Whitelaw, Kehlenbeck, et al. - 2008 |

24 | Efficient learning using forward-backward splitting
- Duchi, Singer
- 2009
(Show Context)
Citation Context ...+ 1 8. return w (k) Figure 1: The perceptron algorithm. al., 2005; Crammer et al., 2006), the recently introduced confidence weighted learning (Dredze et al., 2008) and coordinate descent algorithms (=-=Duchi and Singer, 2009-=-). 3 Structured Perceptron The structured perceptron was introduced by Collins (2002) and we adopt much of the notation and presentation of that study. The structured percetron algorithm – which is id... |

12 | Tjong Kim Sang and F. De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition - F |

8 | On-line learning with delayed label feedback
- Mesterharm
- 2005
(Show Context)
Citation Context ...that a linear term S in the convergence bound above is similar to convergence/regret bounds for asynchronous distributed online learning, which typically have bounds linear in the asynchronous delay (=-=Mesterharm, 2005-=-; Zinkevich et al., 2009). This delay will be on average roughly equal to the number of shards S. 6 Conclusions In this paper we have investigated distributing the structured perceptron via simple par... |