Abstract. TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on "traditional " data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually-categorized "ground truth " news corpus showing this technique is e#ective in identifying topics in collections of news articles. 1
|
2217
|
J.: Introduction to Modern Information Retrieval
– Salton, Macgill
- 1983
|
|
1449
|
Mining association rules between sets of items in large databases
– Agrawal, Imielinski, et al.
- 1993
|
|
981
|
An algorithm for suffix stripping
– Porter
- 1980
|
|
981
|
K.: Introduction to wordnet: an on-line lexical database
– Miller, Fellbaum, et al.
- 1990
|
|
961
|
Text Categorization with Support Vector Machines
– Joachims
- 1997
|
|
512
|
A comparative study on feature selection in text categorization
– Yang, Pedersen
- 1997
|
|
358
|
Mining generalized association rules
– Srikant, Agrawal
- 1995
|
|
322
|
Beyond market basket: Generalizing association rules to correlations
– Brin, Motwani, et al.
|
|
300
|
Computational Analysis of Present-day American English
– Kuèera, Francis
- 1967
|
|
159
|
Multilevel hypergraph partitioning: Application
– Karypis, Aggarwal, et al.
- 1997
|
|
100
|
Information Extraction as a Basis for High-Precision Text Classification
– Riloff, Lehnert
- 1994
|
|
93
|
Automatic structuring and retrieval of large text files
– Salton, Allan, et al.
- 1994
|
|
76
|
Clustering based on association rule hypergraphs
– Han, Karypis, et al.
- 1997
|
|
73
|
Natural Language Processing for Information Retrieval
– Lewis, Jones
- 1996
|
|
65
|
Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies
– Chakrabarti, Dom, et al.
- 1998
|
|
63
|
Fast and intuitive clustering of Web documents
– Zamir, Etzioni, et al.
- 1997
|
|
51
|
Multilevel hypergraph partitioning
– Karypis, Aggarwal, et al.
- 1997
|
|
49
|
A Method for Word-sense disambiguation of Unrestricted Text", ACL-1999
– Mihalcea, Moldovan
- 1999
|
|
43
|
Discovering trends in text databases
– Lent, Agrawal, et al.
- 1997
|
|
42
|
Retrieval Performance in FERRET: A Conceptual Information Retrieval System
– Mauldin
- 1991
|
|
25
|
Maximal association rules: A new tool for mining for keyword co-occurrences in document collections
– Feldman, Aumann, et al.
- 1997
|
|
19
|
A WordNet-based algorithm for word sense disambiguation
– Li, Szpakowicz, et al.
- 1995
|
|
17
|
Generating association rules from semi-structured documents using an extended concept hierarchy
– Singh, Scheuermann, et al.
- 1997
|
|
17
|
Language-oriented information retrieval
– Lewis, Croft, et al.
- 1989
|
|
16
|
Exploiting background information in knowledge discovery from text
– Feldman, Hirsh
- 1996
|
|
13
|
Classification of news stories using support vector machines
– Cooley
- 1999
|
|
12
|
Rajeev Motwani, Svetlozar Nestorov, and Arnon Rosenthal. Query flocks: a generalization of association-rule mining
– Tsur, Ullman, et al.
- 1998
|
|
10
|
Mixed Initiative Development of Language Processing Systems
– Day
|
|
10
|
Motwani R: Beyond Market Baskets: Generalizing Association Rules to Dependence Rules. Data Mining and Knowledge Discovery
– Silverstein, Brin
- 1998
|
|
7
|
Inkeri Verkamo. Mining in the Phrasal Frontier
– Ahonen, Heinonen, et al.
- 1997
|
|
6
|
Khaled Alsabti, and Sanjay Ranka. An efficient algorithm for the incremental updation of association rules in large databases
– Thomas, Bodagala
- 1997
|
|
4
|
GeoNODE: Visualizing news in geospatial context
– Hyland, Clifton, et al.
|
|
4
|
Ramakrishnan Srikant, “Discovering trends in text databases
– Lent, Agrawal
- 1997
|
|
4
|
Wiolli Kloesgen, “Maximal association rules: a new tool for mining for keyword co-occurrences in document collections
– Feldman, Aumann, et al.
- 1997
|
|
3
|
Heikki Mannila, “Efficient algorithms for discovering frequent sets in incremental databases
– Feldman, Aumann, et al.
- 1997
|