Unsupervised Topic Discovery (2001) [1 citations — 1 self]
Abstract:
This white paper describes a new problem, which is to determine a set of topics or subjects automatically from a corpus. The result is a large number of topics, each with meaningful names. Each of the documents in the training corpus is assigned several of these topics. Finally, using the topic classification algorithms we have previously developed in OnTopic TM, we estimate topic models from this corpus to use in topic classification for new documents from the same language and domain. What are Topics? There are many meta-definitions for topics, so it’s worth defining what we mean. By “topics”, we mean subjects that can be used to categorize a document, much as used by Primary Source Media or by Reuters. For example, a story about the Oklahoma Bombing might be labeled with topics like “Bombings”, “Terrorism”, “Oklahoma”, “Deaths and Injuries”. Each document is expected to have many topics assigned to it. The set of topics would usually be in the thousands. The topics cannot usually be organized into a strict tree. For example, the somewhat narrow topic “Labor Unions” should go under both “Economics ” and “Politics”.
Citations
| 123 | A hidden markov model information retrieval – Miller, Leek, et al. - 1999 |
| 21 | A Maximum Likelihood Model for Topic Classification of Broadcast News – Schwartz, Imai, et al. |
| 4 | Probabilistic models for topic detection and tracking – Walls, Jin, et al. - 1999 |

