Results 11  20
of
359
Collective entity resolution in relational data
 ACM Transactions on Knowledge Discovery from Data (TKDD
, 2006
"... Many databases contain uncertain and imprecise references to realworld entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query proces ..."
Abstract

Cited by 146 (12 self)
 Add to MetaCart
Many databases contain uncertain and imprecise references to realworld entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data redundancy, but also inaccuracies in query processing and knowledge extraction. These problems can be alleviated through the use of entity resolution. Entity resolution involves discovering the underlying entities and mapping each database reference to these entities. Traditionally, entities are resolved using pairwise similarity over the attributes of references. However, there is often additional relational information in the data. Specifically, references to different entities may cooccur. In these cases, collective entity resolution, in which entities for cooccurring references are determined jointly rather than independently, can improve entity resolution accuracy. We propose a novel relational clustering algorithm that uses both attribute and relational information for determining the underlying domain entities, and we give an efficient implementation. We investigate the impact that different relational similarity measures have on entity resolution quality. We evaluate our collective entity resolution algorithm on multiple realworld databases. We show that it improves entity resolution performance over both attributebased baselines and over algorithms that consider relational information but do not resolve entities collectively. In addition, we perform detailed experiments on synthetically generated data to identify data characteristics that favor collective relational resolution over purely attributebased algorithms.
A collapsed variational bayesian inference algorithm for latent dirichlet allocation
 In NIPS
"... Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been fo ..."
Abstract

Cited by 137 (8 self)
 Add to MetaCart
(Show Context)
Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and significantly more accurate than standard variational Bayesian inference for LDA. 1
The nested chinese restaurant process and bayesian inference of topic hierarchies
, 2007
"... We present the nested Chinese restaurant process (nCRP), a stochastic process which assigns probability distributions to infinitelydeep, infinitelybranching trees. We show how this stochastic process can be used as a prior distribution in a Bayesian nonparametric model of document collections. Spe ..."
Abstract

Cited by 128 (15 self)
 Add to MetaCart
(Show Context)
We present the nested Chinese restaurant process (nCRP), a stochastic process which assigns probability distributions to infinitelydeep, infinitelybranching trees. We show how this stochastic process can be used as a prior distribution in a Bayesian nonparametric model of document collections. Specifically, we present an application to information retrieval in which documents are modeled as paths down a random tree, and the preferential attachment dynamics of the nCRP leads to clustering of documents according to sharing of topics at multiple levels of abstraction. Given a corpus of documents, a posterior inference algorithm finds an approximation to a posterior distribution over trees, topics and allocations of words to levels of the tree. We demonstrate this algorithm on collections of scientific abstracts from several journals. This model exemplifies a recent trend in statistical machine learning—the use of Bayesian nonparametric methods to infer distributions on flexible data structures.
DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
"... Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. In this paper, we discuss an alternative: a discriminativ ..."
Abstract

Cited by 115 (0 self)
 Add to MetaCart
(Show Context)
Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. In this paper, we discuss an alternative: a discriminative framework in which we assume that supervised side information is present, and in which we wish to take that side information into account in finding a reduced dimensionality representation. Specifically, we present DiscLDA, a discriminative variation on Latent Dirichlet Allocation (LDA) in which a classdependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroups document classification task and show how our model can identify shared topics across classes as well as classdependent topics. 1
Evaluation methods for topic models
 In ICML
, 2009
"... A natural evaluation metric for statistical topic models is the probability of heldout documents given a trained model. While exact computation of this probability is intractable, several estimators for this probability have been used in the topic modeling literature, including the harmonic mean me ..."
Abstract

Cited by 111 (10 self)
 Add to MetaCart
(Show Context)
A natural evaluation metric for statistical topic models is the probability of heldout documents given a trained model. While exact computation of this probability is intractable, several estimators for this probability have been used in the topic modeling literature, including the harmonic mean method and empirical likelihood method. In this paper, we demonstrate experimentally that commonlyused methods are unlikely to accurately estimate the probability of heldout documents, and propose two alternative methods that are both accurate and efficient. 1.
A Latent Dirichlet Model for Unsupervised Entity Resolution
 SIAM INTERNATIONAL CONFERENCE ON DATA MINING
, 2006
"... Entity resolution has received considerable attention in recent years. Given many references to underlying entities, the goal is to predict which references correspond to the same entity. We show how to extend the Latent Dirichlet Allocation model for this task and propose a probabilistic model for ..."
Abstract

Cited by 102 (6 self)
 Add to MetaCart
(Show Context)
Entity resolution has received considerable attention in recent years. Given many references to underlying entities, the goal is to predict which references correspond to the same entity. We show how to extend the Latent Dirichlet Allocation model for this task and propose a probabilistic model for collective entity resolution for relational domains where references are connected to each other. Our approach differs from other recently proposed entity resolution approaches in that it is a) generative, b) does not make pairwise decisions and c) captures relations between entities through a hidden group variable. We propose a novel sampling algorithm for collective entity resolution which is unsupervised and also takes entity relations into account. Additionally, we do not assume the domain of entities to be known and show how to infer the number of entities from the data. We demonstrate the utility and practicality of our relational entity resolution approach for author resolution in two realworld bibliographic datasets. In addition, we present preliminary results on characterizing conditions under which relational information is useful.
Topic Models Conditioned on Arbitrary Features with Dirichletmultinomial Regression
"... Although fully generative models have been successfully used to model the contents of text documents, they are often awkward to apply to combinations of text data and document metadata. In this paper we propose a Dirichletmultinomial regression (DMR) topic model that includes a loglinear prior on ..."
Abstract

Cited by 102 (1 self)
 Add to MetaCart
(Show Context)
Although fully generative models have been successfully used to model the contents of text documents, they are often awkward to apply to combinations of text data and document metadata. In this paper we propose a Dirichletmultinomial regression (DMR) topic model that includes a loglinear prior on documenttopic distributions that is a function of observed features of the document, such as author, publication venue, references, and dates. We show that by selecting appropriate features, DMR topic models can meet or exceed the performance of several previously published topic models designed for specific data. 1
Topic modeling with network regularization
 In Proc. of the 17th WWW Conference
, 2008
"... In this paper, we formally define the problem of topic modeling with network structure (TMN). We propose a novel solution to this problem, which regularizes a statistical topic model with a harmonic regularizer based on a graph structure in the data. The proposed method combines topic modeling and s ..."
Abstract

Cited by 102 (9 self)
 Add to MetaCart
(Show Context)
In this paper, we formally define the problem of topic modeling with network structure (TMN). We propose a novel solution to this problem, which regularizes a statistical topic model with a harmonic regularizer based on a graph structure in the data. The proposed method combines topic modeling and social network analysis, and leverages the power of both statistical topic models and discrete regularization. The output of this model can summarize well topics in text, map a topic onto the network, and discover topical communities. With appropriate instantiations of the topic model and the graphbased regularizer, our model can be applied to a wide range of text mining problems such as authortopic analysis, community discovery, and spatial text mining. Empirical experiments on two data sets with different genres show that our approach is effective and outperforms both textoriented methods and networkoriented methods alone. The proposed model is general; it can be applied to any text collections with a mixture of topics and an associated network structure.
Distributed algorithms for topic models
 THE JOURNAL OF MACHINE LEARNING RESEARCH
, 2009
"... We describe distributed algorithms for two widelyused topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distribute ..."
Abstract

Cited by 101 (3 self)
 Add to MetaCart
We describe distributed algorithms for two widelyused topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbssampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newlycreated topics either by matching or by topicid. Using five realworld text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged testdata log probability for distributed learning is indistinguishable from that obtained with singleprocessor learning. Our extensive experimental results include learning topic models for two multimillion document collections using a 1024processor parallel computer.
Unsupervised prediction of citation influences
 In International Conference on Machine learning
, 2007
"... Publication repositories contain an abundance of information about the evolution of scientific research areas. We address the problem of creating a visualization of a research area that describes the flow of topics between papers, quantifies the impact that papers have on each other, and helps to ..."
Abstract

Cited by 95 (3 self)
 Add to MetaCart
Publication repositories contain an abundance of information about the evolution of scientific research areas. We address the problem of creating a visualization of a research area that describes the flow of topics between papers, quantifies the impact that papers have on each other, and helps to identify key contributions. To this end, we devise a probabilistic topic model that explains the generation of documents; the model incorporates the aspects of topical innovation and topical inheritance via citations. We evaluate the model's ability to predict the strength of influence of citations against manually rated citations. 1.