Abstract:
Abstract. This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMI-IR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64 % on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing). 1
Citations
|
1463
|
Indexing by Latent Semantic Analysis
– Deerwester, Dumais, et al.
- 1990
|
|
1065
|
WordNet - An Electronic Lexical Database
– Fellbaum
- 1998
|
|
444
|
Solutions to Plato’s problem: The latent semantic analsyis theory of acquisition, induction, and representation of knowledge
– Landauer, Dumais
- 1997
|
|
428
|
Word association norms, mutual information, and lexicography
– Church, Hanks
- 1989
|
|
342
|
Dynamic itemset counting and implication rules for market basket data
– Brin, Motwani, et al.
- 1997
|
|
282
|
Foundations of statistical natural language processing
– Manning, Schütze
- 1999
|
|
256
|
Automatic retrieval and clustering of similar words
– Lin
- 1998
|
|
187
|
Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language
– Resnik
- 1999
|
|
186
|
Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy
– Jiang, Conrath
- 1997
|
|
145
|
Latent semantic indexing: A probabilistic analysis
– Papadimitriou, Raghavan, et al.
- 2000
|
|
116
|
Using statistics in lexical analysis
– Church, Gale, et al.
- 1991
|
|
102
|
ed.), EuroWordNet: A Multilingual Database with Lexical Semantic Networks
– Vossen
- 1998
|
|
100
|
Automatic query expansion using SMART: TREC 3. Paper presented at the NIST Special Publication 500–225: The Third Text REtrieval Conference (TREC-3
– Buckley, Salton, et al.
- 1995
|
|
55
|
Learning algorithms for keyphrase extraction
– Turney
|
|
43
|
Information retrieval based on conceptual distance in IS-A hierarchies
– Lee, Kim, et al.
- 1993
|
|
38
|
A synopsis of linguistic theory, 1930-1955
– Firth
- 1957
|
|
35
|
Computational Methods for Intelligent Information Access
– Berry, Dumais, et al.
- 1995
|
|
32
|
Using WordNet as a Knowledge Base for Measuring Semantic Similarity between Words
– Richardson, Smeaton, et al.
- 1994
|
|
15
|
Finding Semantic Similarity in Raw Text: the Deese Antonyms
– Grefenstette
- 1992
|
|
5
|
Search Engine Sizes. SearchEngineWatch.com, internet.com Corporation
– Sullivan
- 2000
|
|
4
|
Interlingual BRICO
– Haase
- 2000
|
|
2
|
of English as a Foreign Language (TOEFL), Educational Testing Service
– Test
|
|
2
|
Basic 2000 Words - Synonym Match 1
– Tatsuki
- 1998
|
|
2
|
K.: Comparison Between TREC2 and TREC3
– Jones
- 1994
|