MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  TopCat: data mining for topic identification in a text corpus (1999) [18 citations — 5 self]

Download:
Download as a PDF | Download as a PS
by Chris Clifton, Robert Cooley
In Proceedings of the 3rd European Conference of Principles and Practice of Knowledge Discovery in Databases
http://www.cs.umn.edu/research/websift/papers/pkdd99.ps
Add To MetaCart

Abstract:

Abstract. TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on "traditional " data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually-categorized "ground truth " news corpus showing this technique is e#ective in identifying topics in collections of news articles. 1

Citations

2217 J.: Introduction to Modern Information Retrieval – Salton, Macgill - 1983
1449 Mining association rules between sets of items in large databases – Agrawal, Imielinski, et al. - 1993
981 An algorithm for suffix stripping – Porter - 1980
981 K.: Introduction to wordnet: an on-line lexical database – Miller, Fellbaum, et al. - 1990
961 Text Categorization with Support Vector Machines – Joachims - 1997
512 A comparative study on feature selection in text categorization – Yang, Pedersen - 1997
358 Mining generalized association rules – Srikant, Agrawal - 1995
322 Beyond market basket: Generalizing association rules to correlations – Brin, Motwani, et al.
300 Computational Analysis of Present-day American English – Kuèera, Francis - 1967
159 Multilevel hypergraph partitioning: Application – Karypis, Aggarwal, et al. - 1997
100 Information Extraction as a Basis for High-Precision Text Classification – Riloff, Lehnert - 1994
93 Automatic structuring and retrieval of large text files – Salton, Allan, et al. - 1994
76 Clustering based on association rule hypergraphs – Han, Karypis, et al. - 1997
73 Natural Language Processing for Information Retrieval – Lewis, Jones - 1996
65 Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies – Chakrabarti, Dom, et al. - 1998
63 Fast and intuitive clustering of Web documents – Zamir, Etzioni, et al. - 1997
51 Multilevel hypergraph partitioning – Karypis, Aggarwal, et al. - 1997
49 A Method for Word-sense disambiguation of Unrestricted Text", ACL-1999 – Mihalcea, Moldovan - 1999
43 Discovering trends in text databases – Lent, Agrawal, et al. - 1997
42 Retrieval Performance in FERRET: A Conceptual Information Retrieval System – Mauldin - 1991
25 Maximal association rules: A new tool for mining for keyword co-occurrences in document collections – Feldman, Aumann, et al. - 1997
19 A WordNet-based algorithm for word sense disambiguation – Li, Szpakowicz, et al. - 1995
17 Generating association rules from semi-structured documents using an extended concept hierarchy – Singh, Scheuermann, et al. - 1997
17 Language-oriented information retrieval – Lewis, Croft, et al. - 1989
16 Exploiting background information in knowledge discovery from text – Feldman, Hirsh - 1996
13 Classification of news stories using support vector machines – Cooley - 1999
12 Rajeev Motwani, Svetlozar Nestorov, and Arnon Rosenthal. Query flocks: a generalization of association-rule mining – Tsur, Ullman, et al. - 1998
10 Mixed Initiative Development of Language Processing Systems – Day
10 Motwani R: Beyond Market Baskets: Generalizing Association Rules to Dependence Rules. Data Mining and Knowledge Discovery – Silverstein, Brin - 1998
7 Inkeri Verkamo. Mining in the Phrasal Frontier – Ahonen, Heinonen, et al. - 1997
6 Khaled Alsabti, and Sanjay Ranka. An efficient algorithm for the incremental updation of association rules in large databases – Thomas, Bodagala - 1997
4 GeoNODE: Visualizing news in geospatial context – Hyland, Clifton, et al.
4 Ramakrishnan Srikant, “Discovering trends in text databases – Lent, Agrawal - 1997
4 Wiolli Kloesgen, “Maximal association rules: a new tool for mining for keyword co-occurrences in document collections – Feldman, Aumann, et al. - 1997
3 Heikki Mannila, “Efficient algorithms for discovering frequent sets in incremental databases – Feldman, Aumann, et al. - 1997