| S. Sekine, "Automatic Sublanguage Identification for a New Text", Second Annual Workshop on Very Large Corpora, Kyoto, Japan, pp.109-120, August 1994. 86 |
....on the assumption that an entire article comes from a single topic. Starting with single article clusters, clusters are progressively grouped by computing the similarity and grouping the most similar two clusters. The similarity measure is based on the combination of inverse document frequencies [10], specifically S ij = w2A i #A j N ij jA 1 jA i jjA j j (3) where jA i j is the number of unique words in article i, jA j is the number of articles containing the word w and N ij = r N i N j N i # N j (4) is a normalization factor with N i being the number of articles in ....
S. Sekine, "Automatic Sublanguage Identification for a New Tex " , Second Annual Workshop on Very Large Corpora, Kyoto, Japan, pp.109-120, August 1994.
....merging the two articles found to be most similar according to a word cooccurrence metric, until a given number of article clusters was reached. Each of these groups of articles was then treated as a distinct topic. The article clustering method employed was that used in [2] as based on [4]. Each article is initially placed in a singleton group, and then given two article groups, A a and A b , the similarity between the two groups, S ab , is defined as S ab = w2A a A b N ab jA w j 1 jA a j jA b j (1) where jA w j is the number of article groups that contain the word ....
S. Sekine, "Automatic Sublanguage Identification for a New Text"; Second Annual Workshop on Very Large Corpora, Kyoto
....the new text. In the mixture approach, the corpus was statically clustered into a small number of very broad topics . We have previously reported on the effectiveness of sublanguage identification measured in terms of the frequency of overlapping words between the article and the mini corpus [6] [7] This is the first report on the application of the technique to speech recognition. For speech recognition, the scores calculated by the sublanguagecomponent are linearly combined with BBN s scores, with the result used to select the best hypothesis from the N best sentences. We optimized ....
Satoshi Sekine: "Automatic Sublanguage Identification for a New Text" Second Annual Workshop on Very Large Corpora (1994)
No context found.
S. Sekine, "Automatic Sublanguage Identification for a New Text", Second Annual Workshop on Very Large Corpora, Kyoto, Japan, pp.109-120, August 1994. 86
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC