Results 1 - 10
of
4,350
Europarl: A Parallel Corpus for Statistical Machine Translation
"... We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web 1. This corpus has found widespread use in the NLP community. Here, we focus on its acquisition and its application as training data for statistical machine translat ..."
Abstract
-
Cited by 519 (1 self)
- Add to MetaCart
We collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the web 1. This corpus has found widespread use in the NLP community. Here, we focus on its acquisition and its application as training data for statistical machine
On Minimizing Training Corpus for Parser Acquisition
- In Proceedings of the Fifth Computational Natural Language Learning Workshop
, 2001
"... Many corpus-based natural language processing systems rely on using large quantities of annotated text as their training examples. Building this kind of resource is an expensive and labor-intensive project. To minimize effort spent on annotating examples that are not helpful the training process, re ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Many corpus-based natural language processing systems rely on using large quantities of annotated text as their training examples. Building this kind of resource is an expensive and labor-intensive project. To minimize effort spent on annotating examples that are not helpful the training process
The Proposition Bank: An Annotated Corpus of Semantic Roles
- Computational Linguistics
, 2005
"... The Proposition Bank project takes a practical approach to semantic representation, adding a layer of predicate-argument information, or semantic role labels, to the syntactic structures of the Penn Treebank. The resulting resource can be thought of as shallow, in that it does not represent corefere ..."
Abstract
-
Cited by 556 (22 self)
- Add to MetaCart
and to analyze the frequency of syntactic/semantic alternations in the corpus. We describe an automatic system for semantic role tagging trained on the corpus and discuss the effect on its performance of various types of information, including a comparison of full syntactic parsing with a flat representation
Headline generation using a training corpus
- Proceedings of CICLING-2001
, 2001
"... Abstract. This paper discusses fundamental issues involved in word selection for title generation. We review several common methods that have been used for title generation and compare the performance of those methods using an F1 metric. Both a KNN (k nearest neighbor) method, which we are the first ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
between the training corpus and the test collection. We also point out ways to improve the performance both from the learning side and from the generation side. 1
Improving alignment for SMT by reordering and augmenting the training corpus
"... We describe the LIU systems for English-German and German-English translation in the WMT09 shared task. We focus on two methods to improve the word alignment: (i) by applying Giza++ in a second phase to a reordered training corpus, where reordering is based on the alignments from the first phase, an ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
We describe the LIU systems for English-German and German-English translation in the WMT09 shared task. We focus on two methods to improve the word alignment: (i) by applying Giza++ in a second phase to a reordered training corpus, where reordering is based on the alignments from the first phase
Automatic Word Sense Discrimination
- Journal of Computational Linguistics
, 1998
"... This paper presents context-group discrimination, a disambiguation algorithm based on clustering. Senses are interpreted as groups (or clusters) of similar contexts of the ambiguous word. Words, contexts, and senses are represented in Word Space, a high-dimensional, real-valued space in which closen ..."
Abstract
-
Cited by 536 (1 self)
- Add to MetaCart
closeness corresponds to semantic similarity. Similarity in Word Space is based on second-order co-occurrence: two tokens (or contexts) of the ambiguous word are assigned to the same sense cluster if the words they co-occur with in turn occur with similar words in a training corpus. The algorithm
Probabilistic Latent Semantic Indexing
, 1999
"... Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized ..."
Abstract
-
Cited by 1225 (10 self)
- Add to MetaCart
Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized
Pruning Training Corpus to Speedup Text Classification”, DEXA 2002
"... Abstract: With the rapid growth of online text information, efficient text classification has become one of the key techniques for organizing and processing text repositories. In this paper, an efficient text classification approach was proposed based on pruning training-corpus. By using the propos ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract: With the rapid growth of online text information, efficient text classification has become one of the key techniques for organizing and processing text repositories. In this paper, an efficient text classification approach was proposed based on pruning training-corpus. By using
Selecting articles from the language model training corpus
- In Proc. ICASSP
, 2005
"... In this paper we study the problem of identifying meaningful patterns (i.e., motifs) from biological data. The general version of this problem is NP-hard. Numerous algorithms have been proposed in the literature to solve this problem. Many of these algorithms fall under the category of heuristics. W ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
In this paper we study the problem of identifying meaningful patterns (i.e., motifs) from biological data. The general version of this problem is NP-hard. Numerous algorithms have been proposed in the literature to solve this problem. Many of these algorithms fall under the category of heuristics. We concentrate on exact algorithms in this paper. In particular, we concentrate on two different versions of the motif search problem and offer exact algorithms for them. 1
A Maximum Entropy Model for Part-Of-Speech Tagging
, 1996
"... This paper presents a statistical model which trains from a corpus annotated with Part-OfSpeech tags and assigns them to previously unseen text with state-of-the-art accuracy(96.6%). The model can be classified as a Maximum Entropy model and simultaneously uses many contextual "features" t ..."
Abstract
-
Cited by 580 (1 self)
- Add to MetaCart
This paper presents a statistical model which trains from a corpus annotated with Part-OfSpeech tags and assigns them to previously unseen text with state-of-the-art accuracy(96.6%). The model can be classified as a Maximum Entropy model and simultaneously uses many contextual "
Results 1 - 10
of
4,350