MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Probabilistic author-topic models for information discovery (2004) [28 citations — 3 self]

Download:
pdf
by Mark Steyvers
In The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
http://www.datalab.uci.edu/papers/author-topics-kdd04.pdf
Add To MetaCart

Abstract:

We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topic. The words in a multi-author paper are assumed to be the result of a mixture of each authors ’ topic mixture. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to a large corpus of 160,000 abstracts and 85,000 authors from the well-known CiteSeer digital library, and learn a model with 300 topics. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 1990 and 2002, parsing of abstracts by topics and authors and detection of unusual papers by specific authors. An online query interface to the model is also discussed that allows interactive exploration of author-topic models for corpora such as CiteSeer.

Citations

1463 Indexing by Latent Semantic Analysis – Deerwester, Dumais, et al. - 1990
412 Scatter/gather: A cluster-based approach to browsing large document collections – Cutting, Karger, et al. - 1992
373 Latent Dirichlet allocation – Blei, Ng, et al. - 2003
259 1999] Probabilistic latent semantic indexing – Hofmann
197 Digital libraries and autonomous citation indexing – Lawrence, Giles - 1999
182 Operations for learning with graphical models – Buntine - 1994
138 Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching – McCallum, Nigam, et al. - 2000
111 Referral Web: combining social networks and collaborative filtering – Kautz, Selman, et al. - 1997
85 Finding scientific topics – Griffiths, Steyvers - 2004
57 Applied Bayesian and Classical Inference: The Case of the Federalist Papers – Mosteller, Wallace - 1984
40 The author-topic model for authors and documents – Rosen-Zvi, Griffiths, et al. - 2004
39 Algorithms for estimating relative importance in networks – White, Smyth - 2003
27 Authorship attribution with support vector machines – Diederich, Kindermann, et al.
18 Clustering and identifying temporal trends in document databases – Popescul, Flake, et al. - 2000
17 Software forensics: Extending authorship analysis techniques to computer programs – Gray, Sallis, et al. - 1997
12 WEBSOM for textual data mining – Lagus, Honkela, et al. - 1999
8 Exploring the computing literature using temporal graph visualization (Techical Report TR0304 – Erten, Harding, et al. - 2003
6 Did Shakespeare write a newly discovered poem – Thisted, Efron - 1987
2 Mining networks and central entities in digital libraries: a graph theoretic approach applied to co-author networks, Intelligent Data Analysis 2003 – Mutschke - 2003