| M. Hearst. 1994. Multi-paragraph segmentation of expository text. In Proc. of the ACL. |
....of a large number of first uses of open class words marking a new segment. This is a fragment of a transcript of an episode of National Public Radio s show All Things Considered. Words in bold are used for the first time. The gap separates two news stories. 32 using the vector space model [Hearst, 1994b] our optimization algorithm [Reynar, 1994] and a number of other algorithms in the literature approximate the identification of simple lexical cohesion relationships by looking at patterns of word repetition. Figure 3.5 shows the number of word repetitions within an excerpt of NPR s program All ....
....repetition without a statistical model of language. In fact, assuming no preprocessing is done, we could segment text in any language which is not highly agglutinative using our optimization algorithm [Reynar, 1994] or the version of Hearst s TextTiling which does not normalize for term frequency [Hearst, 1994b] Algorithms which rely on word frequency, such as the language modeling technique developed by Beeferman et al. Beeferman et al. 1997b] require knowing the language of the text and make assumptions about its content as well. Such assumptions are necessary because the frequency of occurrence ....
[Article contains additional citation context not shown here]
Hearst, M. A. (1994b). Multi-paragraph segmentation of expository text. pages 9--16, Las Cruces, New Mexico.
....of this technique to text structuring uses word repetition information to divide a text into those regions determined to be most coherent by an optimization algorithm. The method has been successfully used to discover the document boundaries in concatenations of Wall Street Journal articles. Hearst [1994; 1997] uses cosine similarity in a word vector space as an indi cator of topic similarity. This algorithm, called TextTiling, is a simple, domain independent technique, that assigns a score to each topic boundary candidate (sentence boundaries) Topic boundaries are placed at the locations of ....
Hearst, M. A. 1994. Multi-paragraph Segmentation of Expository Text. In Proceedings of the 3nd Annual Meeting of the Association for Com- putational Linguistics, New Mexico State University, Las Cruces, NM. 9 16.
....them with options, we are also trying to determine the quantities and extent of information that is appropriate. Segmentation independent of the CC topic change marker must be improved. Several methods, such as discourse cues studied by Hirschberg and Litman [7] or Hearst s Text Tiling algorithm [6] will likely increase accuracy. More general goals include providing information via different forms of media, as well as integrating the user interface with the television itself. Finally, the long term focus is to continue to develop an overall theory of viewer interaction with television. For ....
Hearst, M.A. Multi-paragraph Segmentation of Expository Text. Proceedings of the ACL, (1994).
....segment deals with a particular subject while contiguous segments deal with different subjects. In this manner documents relevant to a query can be retrieved from a large database of unformatted (or loosely formatted) text. For an overview of the problem and various methods for its solution see [3, 4, 7, 8, 12, 13, 15, 18, 19, 20]. Department of Math. Phys. and Comp. Sciences, Electrical mxd Computer Engineering, Faculty of Engineering, Aristotle University of Thessaloniki, Greece. IDepartment of Business Administration, University of Macedonia, Thessaloniki, Greece. Department of Electrical and Computer Engineering, ....
M. Hearst. "Multi-paragraph segmentation of expository text". In Proc. of the $2nd Annual Meet- ing of the Association for Computational Linguistics, Las Cruces, NM, 1994.
....proves to be an easy task, one could just make use of the punctuation to solve this problem. Instead, paragraph segmentation is much more difficult, and this is due first of all to the highly unstructured texts that can be found on the Web. Work developed in this direction is presented in [Hearst 1994] and [Callan 1994] But these methods work only for structured texts, containing apriori known lexical separators (i.e. a tag, an empty line etc. Thus, we had to use a method that covers almost all the possible paragraph separators that can occur in the texts on the web. The paragraph separators ....
Hearst, M.A. Multi-paragraph segmentation of expository text. Proceedings of the 32th Annual Meeting of the Association for Computational Linguistics, 9-16, Las Cruces, New Mexico, 1994.
....proves to be an easy task, one could just make use of the punctuation to solve this problem. Instead, paragraph segmentation is much more difficult, and this is due first of all to the highly unstructured texts that can be found on the Web. Work developed in this direction is presented in [Hearst 1994] and [Callan 1994] But these methods work only for structured texts, containing apriori known lexical separators (i.e. a tag, an empty line e tc. Thus, we had to use a method that covers almost all the possible paragraph separators that can occur in the texts on the web. The paragraph ....
Hearst, M.A. Multi-paragraph segmentation of expository text. Proceedings of the 32th Annual Meeting of the Association for Computational Linguistics, 9-16, Las Cruces, New Mexico, 1994.
....of this technique to text structuring uses word repetition information to divide a text into those regions determined to be most coherent by an optimization algorithm. The method has been successfully used to discover the document boundaries in concatenations of Wall Street Journal articles. Hearst [1994; 1997] uses cosine similarity in a word vector space as an indi cator of topic similarity. This algorithm, called TextTiling, is a simple, domain independent technique, that assigns a score to each topic boundary candidate (sentence boundaries) Topic boundaries are placed at the locations of ....
Hearst, M. A. 1994. Multi-paragraph Segmentation of Expository Text. In Proceedings of the 32nd Annual Meeting of the Association for Com- putational Linguistics, New Mexico State University, Las Cruces, NM. 9-16.
....Story detection segmentation [Allan98] track has also spawned work on identifying topic boundaries in text and spoken audio. For example, Beeferman et al. [Beefer99] use an exponential model based on topicality and cue word features to partition text into coherent segments. Earlier work of Hearst [Hearst94] on TextTiling used a cosine similarity measure as part of an algorithm to subdivide texts into multi paragraph subtopics. We are unaware of any published work on the related problem for document images: performing automatic document separation. Current document image management applications ....
....Layout features can be obtained by image segmentation techniques, and do not require full OCR, although that is how we obtain them in our implementation. If reliable text from OCR is available, we can include a simple word based cohesion measure between two pages. Similar to TextTiling [Hearst94], we use a vector space model where each page is represented by a vector of word frequencies, and the similarity measure is the normalized cosine between the word vectors of the two pages. We exclude very common words, and stem words using Porter s algorithm. Because the text from OCR may contain ....
M. A. Hearst. Multi-paragraph segmentation of expository text. Proceedings of the 32 nd Meeting of the Association for Computational Linguistics (ACL '94), June 1994. Las Cruces, NM, USA.
....properties of documents such as sentences , paragraphs,andsections [14, 32, 47, 52] Each of these individual structures are considered as passages or are used as building blocks for larger passages. Other passage types are based on topics derived by segmenting documents into single topic units [2, 13, 26, 27, 28, 33, 36]. Yet other passage types are based on fixed length blocks [5, 41] The individual results reported in the literature show that passage level access is of benefit in full text databases. One of the outcomes of this paper is an evaluation of the e#ectiveness of di#erent passage types in a common ....
....Retrieved passages were used to assess documents as being either relevant or non relevant. The samples of documents judged were as accurate as the o#cial judgments [44, 45] strongly suggesting that short passages can be used to indicate relevance. Hearst and Plaunt s TextTiling algorithm [13, 14] partitions full length documents into multi paragraph units in order to approximate a document s subtopic structure. Such an approach is particularly useful when document structure is absent or does not reflect the text content. Passages can also be used in relevance feedback and automatic query ....
[Article contains additional citation context not shown here]
M. Hearst. Multi-paragraph segmentation of expository texts. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pages 9--16, Las Cruces, New Mexico, USA, June 1994.
....the same size and thus can systematically detect thematic textual segments of different sizes, ranging from segments slightly smaller than the entire text to segments of about one paragraph. The thematic hierarchy detection algorithm decomposes a text in a similar way as the TextTiling algorithm[2] does. The algorithm calculates a cohe sion score at fixed width intervals in a source text. A cohesion score is calculated based on the lexical sim ilarity of two adjacent blocks of a fixed size by the following formula: C(bl, b = Etwt,bt (1) where bl and b,are the textual block in the left ....
M. A. Hearst. Multi-paragraph segmentation of expository text. In Proc. of the 32nd Annual Meeting of Association for Computational Linguistics, pages 9 16, 1994.
....of this technique to text structuring uses word repetition information to divide a text into those regions determined to be most coherent by an optimization algorithm. The method has been successfully used to discover the document boundaries in concatenations of Wall Street Journal articles. Hearst [1994; 1997] uses cosine similarity in a word vector space as an indicator of topic similarity. This algorithm, called TextTiling, is a simple, domain independent technique, that assigns a score to each topic boundary candidate (sentence boundaries) Topic boundaries are placed at the locations of ....
Hearst, M. A. 1994. Multi-paragraph Segmentation of Expository Text. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, New Mexico State University, Las Cruces, NM. 9--16.
....of the text. This measure is based on both the frequency of the term in a document and the frequency of the term in all the documents. With the emergence of structured documents, the IR evolved in two directions: 1) retrieving documents taking into account the structural parts relevance ( Wil94] [Hea94]) and (2) 2 enriching the query formats with structural information to retrieve certain parts of documents ( NBY95] KM93] Recent works tried to establish some form of relevance ranking in the results ( Lal00] WFC99] HTK00] SN00] but it is still an opened research area. The DASTIR ....
M. Hearst, Multi-paragraph segmentation of expository text, 23nd Annual Meeting of the Association for Computational Linguistics, pages 9-16, New Mexico State University, Las Cruces, New Mexico, 1994.
....of the term sets in the document, and (iii) the distribution of the term sets with respect to the document and to one another. To facilitate display of distribution information, each document is partitioned in advance into a set of subtopical segments using an algorithm called TextTiling [7]. Figure 4 shows an example run on the query (virus) AND (vaccination protection cure) AND (illegal fbi damage police crime) with implicit ORs among the terms within each term set. Each large rectangle indicates a document, and each square within the document represents a coherent text segment ....
Hearst, M.A. Multi-paragraph segmentation of expository text. In Proceedings of the 32nd Meeting of the Association for Computational Linguistics, June 1994.
No context found.
M. Hearst. 1994. Multi-paragraph segmentation of expository text. In Proc. of the ACL.
No context found.
M. Hearst. Multi-paragraph segmentation of expository text. In Proceedings of the 32th Annual Meeting of the Association for Computational Linguistics, 1994.
No context found.
Hearst, M. A. (1994). Multi-Paragraph Segmentation of Expository Text, ACL '94 Proceedings (pp. 9--16).
No context found.
Hearst M., Multi-Paragraph Segmentation of Expository Text, In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics 1994, Las Cruces, New Mexico.
No context found.
Hearst, M. 1994, Multi-paragraph segmentation of expository text, In Proceedings of the 32th Annual Meeting of the Association for Computational Linguistics, 9--16. Las Cruces, New Mexico: Association for Computational Linguistics.
No context found.
Hearst, M.A. 1994. Multi-Paragraph Segmentation of Expository Text. Proceedings of the Annual Conference of the Association for Computational Linguistics (ACL94) .
No context found.
Hearst M., Multi-Paragraph Segmentation of Expository Text, In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics 1994.
No context found.
M. Hearst. Multi-paragraph segmentation of expository text. In 32nd Annual meeting of the association for computational linguistics, 1994.
No context found.
M. A. Hearst. Multi-Paragraph Segmentation of Expository Texts. UC Berkeley Computer Science Technical Report Number UCB/CSD-94-790, 1994.
No context found.
M. Hearst. Multi-paragraph segmentation of expository text. In Proceedings of the 32th Annual Meeting of the Association for Computational Linguistics, 1994.
No context found.
Hearst, M. A., `Multi Paragraph Segmentation of Expository Text', Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pp. 9-16, Las Cruces, New Mexico, 1994. 134
No context found.
Hearst, M. A. (1994). "Multi-paragraph segmentation of expository texts". In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistic, pp. 9-16.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC