Results 11  20
of
332
Clustering data streams: Theory and practice
 IEEE TKDE
, 2003
"... The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, ..."
Abstract

Cited by 157 (5 self)
 Add to MetaCart
The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams.
Towards a Theoretical Foundation for LaplacianBased Manifold Methods
, 2007
"... In recent years manifold methods have attracted a considerable amount of attention in machine learning. However most algorithms in that class may be termed “manifoldmotivated” as they lack any explicit theoretical guarantees. In this paper we take a step towards closing the gap between theory and p ..."
Abstract

Cited by 156 (12 self)
 Add to MetaCart
(Show Context)
In recent years manifold methods have attracted a considerable amount of attention in machine learning. However most algorithms in that class may be termed “manifoldmotivated” as they lack any explicit theoretical guarantees. In this paper we take a step towards closing the gap between theory and practice for a class of Laplacianbased manifold methods. These methods utilize the graph Laplacian associated to a data set for a variety of applications in semisupervised learning, clustering, data representation. We show that under certain conditions the graph Laplacian of a point cloud of data samples converges to the LaplaceBeltrami operator on the underlying manifold. Theorem 3.1 contains the first result showing convergence of a random graph Laplacian to the manifold Laplacian in the context of machine learning.
Learning segmentation by random walks
 In Advances in Neural Information Processing
, 2000
"... Abstract We present a new view of image segmentation by pairwise similarities. We interpret the similarities as edge flows in a Markov random walk and study the eigenvalues and eigenvectors of the walk's transition matrix. This interpretation shows that spectral methods for clustering and segme ..."
Abstract

Cited by 141 (6 self)
 Add to MetaCart
(Show Context)
Abstract We present a new view of image segmentation by pairwise similarities. We interpret the similarities as edge flows in a Markov random walk and study the eigenvalues and eigenvectors of the walk's transition matrix. This interpretation shows that spectral methods for clustering and segmentation have a probabilistic foundation. In particular, we prove that the Normalized Cut method arises naturally from our framework. Finally, the framework provides a principled method for learning the similarity function as a combination of features. 1 Introduction Among the most successful methods in image segmentation combine a global optimality segmentation criterion with local similarity features[3]. Similarity between two pixels i; j is defined as a positive function Sij depending on the local image properties of the pixels(e.g. color, texture, edge flow). Local features are not only computationally convenient, they are also supported by neurological evidence about the human perception of shapes.
Graph mining: laws, generators, and algorithms
 ACM COMPUT SURV (CSUR
, 2006
"... How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M: N relation in ..."
Abstract

Cited by 132 (7 self)
 Add to MetaCart
How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M: N relation in database terminology can be represented as a graph. A lot of these questions boil down to the following: “How can we generate synthetic but realistic graphs? ” To answer this, we must first understand what patterns are common in realworld graphs and can thus be considered a mark of normality/realism. This survey give an overview of the incredible variety of work that has been done on these problems. One of our main contributions is the integration of points of view from physics, mathematics, sociology, and computer science. Further, we briefly describe recent advances on some related and interesting graph problems.
You Are Who You Know: Inferring User Profiles in Online Social Networks
"... Online social networks are now a popular way for users to connect, express themselves, and share content. Users in today’s online social networks often post a profile, consisting of attributes like geographic location, interests, and schools attended. Such profile information is used on the sites as ..."
Abstract

Cited by 118 (8 self)
 Add to MetaCart
(Show Context)
Online social networks are now a popular way for users to connect, express themselves, and share content. Users in today’s online social networks often post a profile, consisting of attributes like geographic location, interests, and schools attended. Such profile information is used on the sites as a basis for grouping users, for sharing content, and for suggesting users who may benefit from interaction. However, in practice, not all users provide these attributes. In this paper, we ask the question: given attributes for some fraction of the users in an online social network, can we infer the attributes of the remaining users? In other words, can the attributes of users, in combination with the social network graph, be used to predict the attributes of another user in the network? To answer this question, we gather finegrained data from two social networks and try to infer user profile attributes. We find that users with common attributes are more likely to be friends and often form dense communities, and we propose a method of inferring user attributes that is inspired by previous approaches to detecting communities in social networks. Our results show that certain user attributes can be inferred with high accuracy when given information on as little as 20 % of the users.
Beyond streams and graphs: Dynamic tensor analysis
 In KDD
, 2006
"... How do we find patterns in authorkeyword associations, evolving over time? Or in DataCubes, with productbranchcustomer sales information? Matrix decompositions, like principal component analysis (PCA) and variants, are invaluable tools for mining, dimensionality reduction, feature selection, rule ..."
Abstract

Cited by 113 (16 self)
 Add to MetaCart
(Show Context)
How do we find patterns in authorkeyword associations, evolving over time? Or in DataCubes, with productbranchcustomer sales information? Matrix decompositions, like principal component analysis (PCA) and variants, are invaluable tools for mining, dimensionality reduction, feature selection, rule identification in numerous settings like streaming data, text, graphs, social networks and many more. However, they have only two orders, like author and keyword, in the above example. We propose to envision such higher order data as tensors, and tap the vast literature on the topic. However, these methods do not necessarily scale up, let alone operate on semiinfinite streams. Thus, we introduce the dynamic tensor analysis (DTA) method, and its variants. DTA provides a compact summary for highorder and highdimensional data, and it also reveals the hidden correlations. Algorithmically, we designed DTA very carefully so that it is (a) scalable, (b) space efficient (it does not need to store the past) and (c) fully automatic with no need for user defined parameters. Moreover, we propose STA, a streaming tensor analysis method, which provides a fast, streaming approximation to DTA. We implemented all our methods, and applied them in two real settings, namely, anomaly detection and multiway latent semantic indexing. We used two real, large datasets, one on network flow data (100GB over 1 month) and one from DBLP (200MB over 25 years). Our experiments show that our methods are fast, accurate and that they find interesting patterns and outliers on the real datasets. 1.
SybilInfer: Detecting Sybil Nodes using Social Networks
"... SybilInfer is an algorithm for labelling nodes in a social network as honest users or Sybils controlled by an adversary. At the heart of SybilInfer lies a probabilistic model of honest social networks, and an inference engine that returns potential regions of dishonest nodes. The Bayesian inference ..."
Abstract

Cited by 109 (9 self)
 Add to MetaCart
SybilInfer is an algorithm for labelling nodes in a social network as honest users or Sybils controlled by an adversary. At the heart of SybilInfer lies a probabilistic model of honest social networks, and an inference engine that returns potential regions of dishonest nodes. The Bayesian inference approach to Sybil detection comes with the advantage label has an assigned probability, indicating its degree of certainty. We prove through analytical results as well as experiments on simulated and realworld network topologies that, given standard constraints on the adversary, SybilInfer is secure, in that it successfully distinguishes between honest and dishonest nodes and is not susceptible to manipulation by the adversary. Furthermore, our results show that SybilInfer outperforms state of the art algorithms, both in being more widely applicable, as well as providing vastly more accurate results. 1
The Principal Components Analysis of a Graph, and its Relationships to Spectral Clustering
 Proceedings of the 15th European Conference on Machine Learning (ECML 2004). Lecture Notes in Artificial Intelligence
, 2004
"... This work presents a novel procedure for computing (1) distances between nodes of a weighted, undirected, graph, called the Euclidean Commute Time Distance (ECTD), and (2) a subspace projection of the nodes of the graph that preserves as much variance as possible, in terms of the ECTD  a princi ..."
Abstract

Cited by 101 (21 self)
 Add to MetaCart
This work presents a novel procedure for computing (1) distances between nodes of a weighted, undirected, graph, called the Euclidean Commute Time Distance (ECTD), and (2) a subspace projection of the nodes of the graph that preserves as much variance as possible, in terms of the ECTD  a principal components analysis of the graph. It is based on a Markovchain model of random walk through the graph. The model assigns transition probabilities to the links between nodes, so that a random walker can jump from node to node. A quantity, called the average commute time, computes the average time taken by a random walker for reaching node j when starting from node i, and coming back to node i. The square root of this quantity, the ECTD, is a distance measure between any two nodes, and has the nice property of decreasing when the number of paths connecting two nodes increases and when the "length" of any path decreases. The ECTD can be computed from the pseudoinverse of the Laplacian matrix of the graph, which is a kernel. We finally define the Principal Components Analysis (PCA) of a graph as the subspace projection that preserves as much variance as possible, in terms of the ECTD. This graph PCA has some interesting links with spectral graph theory, in particular spectral clustering.
Filtering Spam with Behavioral Blacklisting
, 2007
"... Spam filters often use the reputation of an IP address (or IP address range) to classify email senders. This approach worked well when most spam originated from senders with fixed IP addresses, but spam today is also sent from IP addresses for which blacklist maintainers have outdated or inaccurate ..."
Abstract

Cited by 100 (10 self)
 Add to MetaCart
Spam filters often use the reputation of an IP address (or IP address range) to classify email senders. This approach worked well when most spam originated from senders with fixed IP addresses, but spam today is also sent from IP addresses for which blacklist maintainers have outdated or inaccurate information (or no information at all). Spam campaigns also involve many senders, reducing the amount of spam any particular IP address sends to a single domain; this method allows spammers to stay “under the radar”. The dynamism of any particular IP address begs for blacklisting techniques that automatically adapt as the senders of spam change. This paper presents SpamTracker, a spam filtering system that uses a new technique called behavioral blacklisting to classify email senders based on their sending behavior rather than their identity. Spammers cannot evade SpamTracker merely by using “fresh” IP addresses because blacklisting decisions are based on sending patterns, which tend to remain more invariant. SpamTracker uses fast clustering algorithms that react quickly to changes in sending patterns. We evaluate SpamTracker’s ability to classify spammers using email logs for over 115 email domains; we find that SpamTracker can correctly classify many spammers missed by current filtering techniques. Although our current datasets prevent us from confirming SpamTracker’s ability to completely distinguish spammers from legitimate senders, our evaluation shows that SpamTracker can identify a significant fraction of spammers that current IPbased blacklists miss. SpamTracker’s ability to identify spammers before existing blacklists suggests that it can be used in conjunction with existing techniques (e.g., as an input to greylisting). SpamTracker is inherently distributed and can be easily replicated; incorporating it into existing email filtering infrastructures requires only small modifications to mail server configurations.
A Decentralized Algorithm for Spectral Analysis
, 2004
"... In many large network settings, such as computer networks, social networks, or hyperlinked text documents, much information can be obtained from the network’s spectral properties. However, traditional centralized approaches for computing eigenvectors struggle with at least two obstacles: the data ma ..."
Abstract

Cited by 94 (2 self)
 Add to MetaCart
In many large network settings, such as computer networks, social networks, or hyperlinked text documents, much information can be obtained from the network’s spectral properties. However, traditional centralized approaches for computing eigenvectors struggle with at least two obstacles: the data may be difficult to obtain (both due to technical reasons and because of privacy concerns), and the sheer size of the networks makes the computation expensive. A decentralized, distributed algorithm addresses both of these obstacles: it utilizes the computational power of all nodes in the network and their ability to communicate, thus speeding up the computation with the network size. And as each node knows its incident edges, the data collection problem is avoided as well. Our main result is a simple decentralized algorithm for computing the top k eigenvectors of a symmetric weighted adjacency matrix, and a proof that it converges essentially in O(τmix log 2 n) rounds of communication and computation, where τmix is the mixing time of a random walk on the network. An additional contribution of our work is a decentralized way of actually detecting convergence, and diagnosing the current error. Our protocol scales well, in that the amount of computation performed at any node in any one round, and the sizes of messages sent, depend polynomially on k, but not at all on the (typically much larger) number n of nodes.