Results 11  20
of
141
Discovering Important Nodes through Graph Entropy: The Case of Enron Email Database
 KDD, Proceedings of the 3rd international workshop on Link discovery
, 2005
"... A major problem in social network analysis and link discovery is the discovery of hidden organizational structure and selection of interesting influential members based on lowlevel, incomplete and noisy evidence data. To address such a challenge, we exploit an information theoretic model that combi ..."
Abstract

Cited by 39 (0 self)
 Add to MetaCart
(Show Context)
A major problem in social network analysis and link discovery is the discovery of hidden organizational structure and selection of interesting influential members based on lowlevel, incomplete and noisy evidence data. To address such a challenge, we exploit an information theoretic model that combines information theory with statistical techniques from area of text mining and natural language processing. The Entropy model identifies the most interesting and important nodes in a graph. We show how entropy models on graphs are relevant to study of information flow in an organization. We review the results of two different experiments which are based on entropy models. The first version of this model has been successfully tested and evaluated on the Enron email dataset.
Audience selection for online brand advertising: privacyfriendly social network targeting
 In KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
, 2009
"... This paper describes and evaluates privacyfriendly methods for extracting quasisocial networks from browser behavior on usergenerated content sites, for the purpose of finding good audiences for brand advertising (as opposed to click maximizing, for example). Targeting socialnetwork neighbors re ..."
Abstract

Cited by 34 (2 self)
 Add to MetaCart
(Show Context)
This paper describes and evaluates privacyfriendly methods for extracting quasisocial networks from browser behavior on usergenerated content sites, for the purpose of finding good audiences for brand advertising (as opposed to click maximizing, for example). Targeting socialnetwork neighbors resonates well with advertisers, and online browsing behavior data counterintuitively can allow the identification of good audiences anonymously. Besides being one of the first papers to our knowledge on data mining for online brand advertising, this paper makes several important contributions. We introduce a framework for evaluating brand audiences, in analogy to predictivemodeling holdout evaluation. We introduce methods for extracting quasisocial networks from data on visitations to social networking pages, without collecting any information on the identities of the browsers or the content of the socialnetwork pages. We introduce measures of brand proximity in the network, and show that audiences with high brand proximity indeed show substantially higher brand affinity. Finally, we provide evidence that the quasisocial network embeds a true social network, which along with results from social theory offers one explanation for the increases in audience brand affinity.
Graph embedding for speaker recognition
 in Proc. Interspeech, 2010
"... This chapter presents applications of graph embedding to the problem of textindependent speaker recognition. Speaker recognition is a general term encompassing multiple applications. At the core is the problem of speaker comparison—given two speech recordings (utterances), produce a score which me ..."
Abstract

Cited by 28 (6 self)
 Add to MetaCart
(Show Context)
This chapter presents applications of graph embedding to the problem of textindependent speaker recognition. Speaker recognition is a general term encompassing multiple applications. At the core is the problem of speaker comparison—given two speech recordings (utterances), produce a score which measures speaker simi
Email alias detection using social network analysis
 In Proceedings of the ACM SIGKDD Workshop on Link Discovery: Issues, Approaches, and Applications
, 2005
"... This research addresses the problem of correctly relating aliases that belong to the same entity. Previous approaches focused on natural language processing and structured data, whereas in this research we analyze the local association, or “social ” network in which aliases reside. The network is co ..."
Abstract

Cited by 28 (0 self)
 Add to MetaCart
(Show Context)
This research addresses the problem of correctly relating aliases that belong to the same entity. Previous approaches focused on natural language processing and structured data, whereas in this research we analyze the local association, or “social ” network in which aliases reside. The network is constructed from email data mined from the Internet. Links in the network represent web pages on which two email addresses are collocated. The problem is defined as given social network S, constructed from email address collocations, and an email address E, identify any aliases for E that also appear in S. The alias detection methods are evaluated on a data set of over 14,000 University X email addresses for which ground truth relations are known. The results are reported as partial lists of k choices for possible aliases, ranked by predicted relational strength within the network. Given a source email address, a portion of all email addresses, 2%, are correctly linked to another alias that corresponds to the same entity by best rank, which is significantly better than random (0.007%) and a geodesic distance (1%) baseline prediction. Correct linkages increase to 15% and 30 % within top10 (0.07 % of all emails) and top100 rank lists (0.7 % of all emails), respectively. 1.
Learning AuthorTopic Models from Text Corpora
 ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 2008
"... We propose a new unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a twostage stochastic process. An author is represented by a probability distribution over topics, and each topic is repr ..."
Abstract

Cited by 28 (2 self)
 Add to MetaCart
We propose a new unsupervised learning technique for extracting information about authors and topics from large text collections. We model documents as if they were generated by a twostage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words. The probability distribution over topics in a multiauthor paper is a mixture of the distributions associated with the authors. The topicword and authortopic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to three large text corpora: 150,000 abstracts from the CiteSeer digital library, 1,740 papers from the Neural Information Processing Systems (NIPS) Conferences, and 121,000 emails from the Enron corporation. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors. Experiments based on perplexity scores for test documents and precisionrecall for document retrieval are used to illustrate systematic differences between the proposed authortopic model and a number of alternatives. Extensions to the model, allowing (for example) generalizations of the notion of an author, are also briefly discussed.
Mining Scalefree Networks using Geodesic Clustering
 IN PROC. 10 TH ACM SIGKDD INT. CONF
, 2004
"... Many realworld graphs have been shown to be scalefree vertex degrees follow power law distributions, vertices tend to cluster, and the average length of all shortest paths is small. We present a new model for understanding scalefree networks based on multilevel geodesic approximation, using a ..."
Abstract

Cited by 27 (6 self)
 Add to MetaCart
Many realworld graphs have been shown to be scalefree vertex degrees follow power law distributions, vertices tend to cluster, and the average length of all shortest paths is small. We present a new model for understanding scalefree networks based on multilevel geodesic approximation, using a new data structure called a multilevel mesh. Using this
An experimental investigation of graph kernels on a collaborative recommendation task
 Proceedings of the 6th International Conference on Data Mining (ICDM 2006
, 2006
"... This paper presents a survey as well as a systematic empirical comparison of seven graph kernels and two related similarity matrices (simply referred to as graph kernels), namely the exponential diffusion kernel, the Laplacian exponential diffusion kernel, the von Neumann diffusion kernel, the regul ..."
Abstract

Cited by 27 (7 self)
 Add to MetaCart
(Show Context)
This paper presents a survey as well as a systematic empirical comparison of seven graph kernels and two related similarity matrices (simply referred to as graph kernels), namely the exponential diffusion kernel, the Laplacian exponential diffusion kernel, the von Neumann diffusion kernel, the regularized Laplacian kernel, the commutetime kernel, the randomwalkwithrestart similarity matrix, and finally, three graph kernels introduced in this paper: the regularized commutetime kernel, the Markov diffusion kernel, and the crossentropy diffusion matrix. The kernelonagraph approach is simple and intuitive. It is illustrated by applying the nine graph kernels to a collaborativerecommendation task and to a semisupervised classification task, both on several databases. The graph methods compute proximity measures between nodes that help study the structure of the graph. Our comparisons suggest that the regularized commutetime and the Markov diffusion kernels perform best, closely followed by the regularized Laplacian kernel. 1
Visualization of social and other scalefree networks
 IN PROC. OF IEEE INFOVIS
, 2008
"... This paper proposes novel methods for visualizing specifically the large powerlaw graphs that arise in sociology and the sciences. In such cases a large portion of edges can be shown to be less important and removed while preserving component connectedness and other features (e.g. cliques) to more ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
(Show Context)
This paper proposes novel methods for visualizing specifically the large powerlaw graphs that arise in sociology and the sciences. In such cases a large portion of edges can be shown to be less important and removed while preserving component connectedness and other features (e.g. cliques) to more clearly reveal the network’s underlying connection pathways. This simplification approach deterministically filters (instead of clustering) the graph to retain important node and edge semantics, and works both automatically and interactively. The improved graph filtering and layout is combined with a novel computer graphics anisotropic shading of the dense crisscrossing array of edges to yield a full social network and scalefree graph visualization system. Both quantitative analysis and visual results demonstrate the effectiveness of this approach.
A family of dissimilarity measures between nodes generalizing both the shortestpath and the commutetime distances
 in Proceedings of the 14th SIGKDD International Conference on Knowledge Discovery and Data Mining
"... This work introduces a new family of linkbased dissimilarity measures between nodes of a weighted directed graph. This measure, called the randomized shortestpath (RSP) dissimilarity, depends on a parameter θ and has the interesting property of reducing, on one end, to the standard shortestpath d ..."
Abstract

Cited by 25 (11 self)
 Add to MetaCart
(Show Context)
This work introduces a new family of linkbased dissimilarity measures between nodes of a weighted directed graph. This measure, called the randomized shortestpath (RSP) dissimilarity, depends on a parameter θ and has the interesting property of reducing, on one end, to the standard shortestpath distance when θ is large and, on the other end, to the commutetime (or resistance) distance when θ is small (near zero). Intuitively, it corresponds to the expected cost incurred by a random walker in order to reach a destination node from a starting node while maintaining a constant entropy (related to θ) spread in the graph. The parameter θ is therefore biasing gradually the simple random walk on the graph towards the shortestpath policy. By adopting a statistical physics approach and computing a sum over all the possible paths (discrete path integral), it is shown that the RSP dissimilarity from every node to a particular node of interest can be computed efficiently by solving two linear systems of n equations, where n is the number of nodes. On the other hand, the dissimilarity between every couple of nodes is obtained by inverting an n × n matrix. The proposed measure can be used for various graph mining tasks such as computing betweenness centrality, finding dense communities, etc, as shown in the experimental section.
Graph nodes clustering with the sigmoid commutetime kernel: A . . .
 DATA & KNOWLEDGE ENGINEERING
, 2009
"... ..."