MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Lecture Notes in Computer Science 1 Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

Download:
pdf
by Peter D. Turney
http://extractor.iit.nrc.ca/publications/ECML2001.pdf
Add To MetaCart

Abstract:

Abstract. This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMI-IR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64 % on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing). 1

Citations

1463 Indexing by Latent Semantic Analysis – Deerwester, Dumais, et al. - 1990
1065 WordNet - An Electronic Lexical Database – Fellbaum - 1998
444 Solutions to Plato’s problem: The latent semantic analsyis theory of acquisition, induction, and representation of knowledge – Landauer, Dumais - 1997
428 Word association norms, mutual information, and lexicography – Church, Hanks - 1989
342 Dynamic itemset counting and implication rules for market basket data – Brin, Motwani, et al. - 1997
282 Foundations of statistical natural language processing – Manning, Schütze - 1999
256 Automatic retrieval and clustering of similar words – Lin - 1998
187 Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language – Resnik - 1999
186 Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy – Jiang, Conrath - 1997
145 Latent semantic indexing: A probabilistic analysis – Papadimitriou, Raghavan, et al. - 2000
116 Using statistics in lexical analysis – Church, Gale, et al. - 1991
102 ed.), EuroWordNet: A Multilingual Database with Lexical Semantic Networks – Vossen - 1998
100 Automatic query expansion using SMART: TREC 3. Paper presented at the NIST Special Publication 500–225: The Third Text REtrieval Conference (TREC-3 – Buckley, Salton, et al. - 1995
55 Learning algorithms for keyphrase extraction – Turney
43 Information retrieval based on conceptual distance in IS-A hierarchies – Lee, Kim, et al. - 1993
38 A synopsis of linguistic theory, 1930-1955 – Firth - 1957
35 Computational Methods for Intelligent Information Access – Berry, Dumais, et al. - 1995
32 Using WordNet as a Knowledge Base for Measuring Semantic Similarity between Words – Richardson, Smeaton, et al. - 1994
15 Finding Semantic Similarity in Raw Text: the Deese Antonyms – Grefenstette - 1992
5 Search Engine Sizes. SearchEngineWatch.com, internet.com Corporation – Sullivan - 2000
4 Interlingual BRICO – Haase - 2000
2 of English as a Foreign Language (TOEFL), Educational Testing Service – Test
2 Basic 2000 Words - Synonym Match 1 – Tatsuki - 1998
2 K.: Comparison Between TREC2 and TREC3 – Jones - 1994