• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Two decades of Statistical Language Modeling: Where Do We Go From Here?", (2000)

by R Rosenfeld
Venue:Proceedings of the IEEE,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 210
Next 10 →

SRILM -- An extensible language modeling toolkit

by Andreas Stolcke - IN PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING (ICSLP 2002 , 2002
"... SRILM is a collection of C++ libraries, executable programs, and helper scripts designed to allow both production of and experimentation with statistical language models for speech recognition and other applications. SRILM is freely available for noncommercial purposes. The toolkit supports creation ..."
Abstract - Cited by 1218 (21 self) - Add to MetaCart
SRILM is a collection of C++ libraries, executable programs, and helper scripts designed to allow both production of and experimentation with statistical language models for speech recognition and other applications. SRILM is freely available for noncommercial purposes. The toolkit supports creation and evaluation of a variety of language model types based on N-gram statistics, as well as several related tasks, such as statistical tagging and manipulation of N-best lists and word lattices. This paper summarizes the functionality of the toolkit and discusses its design and implementation, highlighting ease of rapid prototyping, reusability, and combinability of tools.
(Show Context)

Citation Context

... suspects that important advances are possible, and indeed needed, to bring about significant breakthroughs in the application areas cited above—such breakthroughs just have been very hard to come by =-=[2, 3]-=-. Various software packages for statistical language modeling have been in use for many years—the basic algorithms are simple enough that one can easily implement them with reasonable effort for resea...

Computational Discovery of Gene Modules, Regulatory Networks and Expression Programs

by Georg Kurt Gerber , 2007
"... High-throughput molecular data are revolutionizing biology by providing massive amounts of information about gene expression and regulation. Such information is applicable both to furthering our understanding of fundamental biology and to developing new diagnostic and treatment approaches for diseas ..."
Abstract - Cited by 236 (17 self) - Add to MetaCart
High-throughput molecular data are revolutionizing biology by providing massive amounts of information about gene expression and regulation. Such information is applicable both to furthering our understanding of fundamental biology and to developing new diagnostic and treatment approaches for diseases. However, novel mathematical methods are needed for extracting biological knowledge from highdimensional, complex and noisy data sources. In this thesis, I develop and apply three novel computational approaches for this task. The common theme of these approaches is that they seek to discover meaningful groups of genes, which confer robustness to noise and compress complex information into interpretable models. I first present the GRAM algorithm, which fuses information from genome-wide expression and in vivo transcription factor-DNA binding data to discover regulatory networks of

Cluster-based retrieval using language models

by Xiaoyong Liu - In Proceedings of SIGIR , 2004
"... Previous research on cluster-based retrieval has been inconclusive as to whether it does bring improved retrieval effectiveness over document-based retrieval. Recent developments in the language modeling approach to IR have motivated us to re-examine this problem within this new retrieval framework. ..."
Abstract - Cited by 170 (13 self) - Add to MetaCart
Previous research on cluster-based retrieval has been inconclusive as to whether it does bring improved retrieval effectiveness over document-based retrieval. Recent developments in the language modeling approach to IR have motivated us to re-examine this problem within this new retrieval framework. We propose two new models for cluster-based retrieval and evaluate them on several TREC collections. We show that cluster-based retrieval can perform consistently across collections of realistic size, and significant improvements over document-based retrieval can be obtained in a fully automatic manner and without relevance information provided by human.
(Show Context)

Citation Context

...w for parameter tuning. 3. CLUSTER-BASED RETRIEVAL USING LANGUAGE MODELS A statistical language model is a probability distribution over all possible sentences or other linguistic units in a language =-=[15]-=-. The basic approach for using language models for IR is to model the query generation process [14]. The general idea is to build a language model D for each document in the collection, and rank the d...

A hierarchical Bayesian language model based on Pitman–Yor processes

by Yee Whye Teh - In Coling/ACL, 2006. 9 , 2006
"... We propose a new hierarchical Bayesian n-gram model of natural languages. Our model makes use of a generalization of the commonly used Dirichlet distributions called Pitman-Yor processes which produce power-law distributions more closely resembling those in natural languages. We show that an approxi ..."
Abstract - Cited by 148 (10 self) - Add to MetaCart
We propose a new hierarchical Bayesian n-gram model of natural languages. Our model makes use of a generalization of the commonly used Dirichlet distributions called Pitman-Yor processes which produce power-law distributions more closely resembling those in natural languages. We show that an approximation to the hierarchical Pitman-Yor language model recovers the exact formulation of interpolated Kneser-Ney, one of the best smoothing methods for n-gram language models. Experiments verify that our model gives cross entropy results superior to interpolated Kneser-Ney and comparable to modified Kneser-Ney. 1

Exploiting latent semantic information in statistical language modeling

by Jerome R. Bellegarda, Senior Member - Proc. IEEE. 88 , 2000
"... Statistical language models used in large-vocabulary speech recognition must properly encapsulate the various constraints, both local and global, present in the language. While local constraints are readily captured through n-gram modeling, global constraints, such as long-term semantic dependencies ..."
Abstract - Cited by 114 (7 self) - Add to MetaCart
Statistical language models used in large-vocabulary speech recognition must properly encapsulate the various constraints, both local and global, present in the language. While local constraints are readily captured through n-gram modeling, global constraints, such as long-term semantic dependencies, have been more diffi-cult to handle within a data-driven formalism. This paper focuses on the use of latent semantic analysis, a paradigm that automat-ically uncovers the salient semantic relationships between words and documents in a given corpus. In this approach, (discrete) words and documents are mapped onto a (continuous) semantic vector space, in which familiar clustering techniques can be applied. This leads to the specification of a powerful framework for automatic semantic classification, as well as the derivation of several lan-guage model families with various smoothing properties. Because of their large-span nature, these language models are well suited to complement conventional n-grams. An integrative formulation is proposed for harnessing this synergy, in which the latent se-mantic information is used to adjust the standard n-gram proba-bility. Such hybrid language modeling compares favorably with the correspondingn-gram baseline: experiments conducted on the Wall Street Journal domain show a reduction in average word error rate of over 20%. This paper concludes with a discussion of intrinsic tradeoffs, such as the influence of training data selection on the re-sulting performance. Keywords—Latent semantic analysis, multispan integration, n-grams, speech recognition, statistical language modeling. I.
(Show Context)

Citation Context

...n word sequences. In the two past decades, statistical -grams have steadily emerged as the preferred way to impose such constraints in a wide range of domains [21]. The reader is referred to [61] and =-=[72]-=- for a comprehensive overview of the state-of-the-art in the field, including an insightful perspective on -grams in light of other techniques, and an excellent tutorial on challenges lying ahead. Som...

Factored language models and generalized parallel backoff

by Jeff A. Bilmes, Katrin Kirchhoff - in Proceedings of HLT/NACCL, 2003
"... We introduce factored language models (FLMs) and generalized parallel backoff (GPB). An FLM represents words as bundles of features (e.g., morphological classes, stems, data-driven clusters, etc.), and induces a prob-ability model covering sequences of bundles rather than just words. GPB extends sta ..."
Abstract - Cited by 114 (13 self) - Add to MetaCart
We introduce factored language models (FLMs) and generalized parallel backoff (GPB). An FLM represents words as bundles of features (e.g., morphological classes, stems, data-driven clusters, etc.), and induces a prob-ability model covering sequences of bundles rather than just words. GPB extends standard backoff to general conditional probability tables where variables might be heterogeneous types, where no obvious natural (temporal) backoff order exists, and where multiple dynamic backoff strategies are allowed. These methodologies were implemented during the JHU 2002 workshop as extensions to the SRI language modeling toolkit. This paper provides initial perplexity results on both CallHome Arabic and on Penn Treebank Wall Street Journal articles. Significantly, FLMs with GPB can produce bigrams with signif-icantly lower perplexity, sometimes lower than highly-optimized baseline trigrams. In a multi-pass speech recognition context, where bigrams are used to create first-pass bigram lattices or N-best lists, these results are highly relevant. 1

A survey of statistical machine translation

by Adam Lopez , 2007
"... Statistical machine translation (SMT) treats the translation of natural language as a machine learning problem. By examining many samples of human-produced translation, SMT algorithms automatically learn how to translate. SMT has made tremendous strides in less than two decades, and many popular tec ..."
Abstract - Cited by 93 (6 self) - Add to MetaCart
Statistical machine translation (SMT) treats the translation of natural language as a machine learning problem. By examining many samples of human-produced translation, SMT algorithms automatically learn how to translate. SMT has made tremendous strides in less than two decades, and many popular techniques have only emerged within the last few years. This survey presents a tutorial overview of state-of-the-art SMT at the beginning of 2007. We begin with the context of the current research, and then move to a formal problem description and an overview of the four main subproblems: translational equivalence modeling, mathematical modeling, parameter estimation, and decoding. Along the way, we present a taxonomy of some different approaches within these areas. We conclude with an overview of evaluation and notes on future directions.

Associating Genes with Gene Ontology Codes Using a Maximum Entropy Analysis of Biomedical Literature

by Soumya Raychaudhuri, Jeffrey T. Chang, Patrick D. Sutphin, Russ B. Altman , 2002
"... this paper but has been provided elsewhere (Ratnaparkhi 1997; Manning and Schutze 1999) ..."
Abstract - Cited by 92 (5 self) - Add to MetaCart
this paper but has been provided elsewhere (Ratnaparkhi 1997; Manning and Schutze 1999)

Offline recognition of unconstrained handwritten texts using HMMs and statistical language models

by Alessandro Vinciarelli, Samy Bengio, Horst Bunke - IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , 2004
"... This paper presents a system for the offline recognition of large vocabulary unconstrained handwritten texts. The only assumption made about the data is that it is written in English. This allows the application of Statistical Language Models in order to improve the performance of our system. Severa ..."
Abstract - Cited by 84 (11 self) - Add to MetaCart
This paper presents a system for the offline recognition of large vocabulary unconstrained handwritten texts. The only assumption made about the data is that it is written in English. This allows the application of Statistical Language Models in order to improve the performance of our system. Several experiments have been performed using both single and multiple writer data. Lexica of variable size (from 10,000 to 50,000 words) have been used. The use of language models is shown to improve the accuracy of the system (when the lexicon contains 50,000 words, error rate is reduced by ∼50 % for single writer data and by ∼25 % for multiple writer data). Our approach is described in detail and compared with other methods presented in the literature to deal with the same problem. An experimental setup to correctly deal with unconstrained text recognition is proposed.
(Show Context)

Citation Context

...of documents belonging to corpora assumed to reproduce the statistics of average English. This allows us to apply Statistical Language Models (SLM’s) in order to improve the performance of our system =-=[7]-=-. We used N-gram models (the most successful SLM applied until now [8]) of order 1, 2 and 3 (called unigrams, bigrams and trigrams respectively). Previous works typically preferred the application of ...

Passage Retrieval Based On Language Models

by Xiaoyong Liu, W. Bruce Croft - In Proceedings of the eleventh international conference on Information and knowledge management , 2002
"... Previous research has shown that passage-level evidence can bring added benefits to document retrieval when documents are long or span different subject areas. Recent developments in language modeling approach to IR provided a new effective alternative to traditional retrieval models. These two stre ..."
Abstract - Cited by 80 (6 self) - Add to MetaCart
Previous research has shown that passage-level evidence can bring added benefits to document retrieval when documents are long or span different subject areas. Recent developments in language modeling approach to IR provided a new effective alternative to traditional retrieval models. These two streams of research motivate us to examine the use of passages in a language model framework. This paper reports on experiments using passages in a simple language model and a relevance model, and compares the results with document-based retrieval. Results from the INQUERY search engine, which is not based on a language modeling approach, are also given for comparison. Test data include two heterogeneous and one homogeneous document collections. Our experiments show that passage retrieval is feasible in the language modeling context, and more importantly, it can provide more reliable performance than retrieval based on full documents.
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University