Results 1 - 10
of
368
TextTiling: Segmenting text into multi-paragraph subtopic passages
- Computational Linguistics
, 1997
"... TextTiling is a technique for subdividing texts into multi-paragraph units that represent passages, or subtopics. The discourse cues for identifying major subtopic shifts are patterns of lexical co-occurrence and distribution. The algorithm is fully implemented and is shown to produce segmentation t ..."
Abstract
-
Cited by 458 (2 self)
- Add to MetaCart
TextTiling is a technique for subdividing texts into multi-paragraph units that represent passages, or subtopics. The discourse cues for identifying major subtopic shifts are patterns of lexical co-occurrence and distribution. The algorithm is fully implemented and is shown to produce segmentation that corresponds well to human judgments of the subtopic boundaries of 12 texts. Multi-paragraph subtopic segmentation should be useful for many text analysis tasks, including information retrieval and summarization. 1.
Using Lexical Chains for Text Summarization
, 1997
"... We investigate one technique to produce a summary of an original text without requiring its full semantic interpretation, but instead relying on a model of the topic progression in the text derived from lexical chains. We present a new algorithm to compute lexical chains in a text, merging several r ..."
Abstract
-
Cited by 451 (9 self)
- Add to MetaCart
(Show Context)
We investigate one technique to produce a summary of an original text without requiring its full semantic interpretation, but instead relying on a model of the topic progression in the text derived from lexical chains. We present a new algorithm to compute lexical chains in a text, merging several robust knowledge sources: the WordNet thesaurus, a part-of-speech tagger and shallow parser for the ldentification of nominal groups, and a segmentation algorithm derived from (Hearst, 1994) Summarization proceeds in three steps: the original text m first segmented, lexical chains are constructed, strong chains are identified and significant sentences are extracted from the text. We present in this paper empirical results on the identification of strong chain and of significant sentences.
TileBars: Visualization of Term Distribution Information in Full Text Information Access
, 1995
"... The field of information retrieval has traditionally focused on textbases consisting of titles and abstracts. As a consequence, many underlying assumptions must be altered for retrieval from full-length text collections. This paper argues for making use of text structure when retrieving from full te ..."
Abstract
-
Cited by 341 (10 self)
- Add to MetaCart
The field of information retrieval has traditionally focused on textbases consisting of titles and abstracts. As a consequence, many underlying assumptions must be altered for retrieval from full-length text collections. This paper argues for making use of text structure when retrieving from full text documents, and presents a visualization paradigm, called TileBars, that demonstrates the usefulness of explicit term distribution information in Boolean-type queries. TileBars simultaneously and compactly indicate relative document length, query term frequency, and query term distribution. The patterns in a column of TileBars can be quickly scanned and deciphered, aiding users in making judgments about the potential relevance of the retrieved documents. KEYWORDS: Information retrieval, Full-length text, Visualization. INTRODUCTION Information access systems have traditionally focused on retrieval of documents consisting of titles and abstracts. As a consequence, the underlying assumpt...
Topic Detection and Tracking Pilot Study Final Report
- IN PROCEEDINGS OF THE DARPA BROADCAST NEWS TRANSCRIPTION AND UNDERSTANDING WORKSHOP
, 1998
"... Topic Detection and Tracking (TDT) is a DARPA-sponsored initiative to investigate the state of the art in finding and following new events in a stream of broadcast news stories. The TDT problem consists of three major tasks: (1) segmenting a stream of data, especially recognized speech, into distinc ..."
Abstract
-
Cited by 313 (34 self)
- Add to MetaCart
Topic Detection and Tracking (TDT) is a DARPA-sponsored initiative to investigate the state of the art in finding and following new events in a stream of broadcast news stories. The TDT problem consists of three major tasks: (1) segmenting a stream of data, especially recognized speech, into distinct stories; (2) identifying those news stories that are the first to discuss a new event occurring in the news; and (3) given a small number of sample news stories about an event, finding all following stories in the stream.
The Pilot Study ran from September 1996 through October 1997. The primary participants were DARPA, Carnegie Mellon University, Dragon Systems, and the University of Massachusetts at Amherst. This report summarizes the findings of the pilot study.
The TDT work continues in a new project involving larger training and test corpora, more active participants, and a more broadly defined notion of "topic" than was used in the pilot study.
The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity
, 2001
"... We describe a joint probabilistic model for modeling the contents and inter-connectivity of document collections such as sets of web pages or research paper archives. The model is based on a probabilistic factor decomposition and allows identifying principal topics of the collection as well as autho ..."
Abstract
-
Cited by 218 (3 self)
- Add to MetaCart
(Show Context)
We describe a joint probabilistic model for modeling the contents and inter-connectivity of document collections such as sets of web pages or research paper archives. The model is based on a probabilistic factor decomposition and allows identifying principal topics of the collection as well as authoritative documents within those topics. Furthermore, the relationships between topics is mapped out in order to build a predictive model of link content. Among the many applications of this approach are information retrieval and search, topic identification, query disambiguation, focused web crawling, web authoring, and bibliometric analysis.
A comparison of classifiers and document representations for the routing problem
- ANNUAL ACM CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL - ACM SIGIR
, 1995
"... In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminant ..."
Abstract
-
Cited by 196 (2 self)
- Add to MetaCart
In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification techniques which have decision rules that are derived via explicit error minimization: linear discriminant analysis, logistic regression, and neural networks. We demonstrate that the classifiers perform 1015 % better than relevance feedback via Rocchio expansion for the TREC-2 and TREC-3 routing tasks.
Error minimization is difficult in high-dimensional feature spaces because the convergence process is slow and the models are prone to overfitting. We use two different strategies, latent semantic indexing and optimal term selection, to reduce the number of features. Our results indicate that features based on latent semantic indexing are more effective for techniques such as linear discriminant analysis and logistic regression, which have no way to protect against overfitting. Neural networks perform equally well with either set of features and can take advantage of the additional information available when both feature sets are used as input.
Advances in Domain Independent Linear Text Segmentation
, 2000
"... This paper describes a method for linear text seg- mc. ntation which is twice as accurate and over seven times as fast as the state-of-the-art (Reynar, 1998). Inter-sentence similarity is replaced by rank in the local context. Boundary locations are discovered by divisive clustering. ..."
Abstract
-
Cited by 186 (1 self)
- Add to MetaCart
(Show Context)
This paper describes a method for linear text seg- mc. ntation which is twice as accurate and over seven times as fast as the state-of-the-art (Reynar, 1998). Inter-sentence similarity is replaced by rank in the local context. Boundary locations are discovered by divisive clustering.
Functional Analysis and
, 1957
"... the effect of cosmetic essence by independent component ..."
Abstract
-
Cited by 183 (0 self)
- Add to MetaCart
(Show Context)
the effect of cosmetic essence by independent component
From Discourse Structures to Text Summaries
- In Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization
, 1997
"... We describe experiments that show that the concepts of rhetorical analysis and nuclearity can be used effectively for determining the most important units in a text. We show how these concepts can be implemented and we discuss results that we obtained with a discourse-based summarization program. 1 ..."
Abstract
-
Cited by 178 (2 self)
- Add to MetaCart
(Show Context)
We describe experiments that show that the concepts of rhetorical analysis and nuclearity can be used effectively for determining the most important units in a text. We show how these concepts can be implemented and we discuss results that we obtained with a discourse-based summarization program. 1 Motivation The evaluation of automatic summarizers has always been a thorny problem: most papers on summarization describe the approach that they use and give some "convincing " samples of the output. In very few cases, the direct output of a summarization program is compared with a human-made summary or evaluated with the help of human subjects; usually, the results are modest. Unfortunately, evaluating the results of a particular implementation does not enable one to determine what part of the failure is due to the implementation itself and what part to its underlying assumptions. The position that we take in this paper is that, in order to build high-quality summarization programs, one ...
Anaphora for everyone: Pronominal anaphora resolution without a parser
- In Proceedings of COLING-96 (16th International Conference on Computational Linguistics
, 1996
"... We present an algorithm for anaphora resolution which is a modified and extended version of that developed by (Lappin and Leass, 1994). In contrast to that work, our algorithm does not require in-depth, full, syntactic parsing of text. Instead, with minimal compromise in output quality, the modifica ..."
Abstract
-
Cited by 164 (11 self)
- Add to MetaCart
(Show Context)
We present an algorithm for anaphora resolution which is a modified and extended version of that developed by (Lappin and Leass, 1994). In contrast to that work, our algorithm does not require in-depth, full, syntactic parsing of text. Instead, with minimal compromise in output quality, the modifications enable the resolution process to work from the output of a part of speech tagger, enriched only with annotations of grammatical function of lexical items in the input text stream. Evaluation of the results of our implementation demonstrates that accurate anaphora resolution can be realized within natural language processing frameworks which do not—or cannot—employ robust and reliable parsing components. 1