Results 1 - 10
of
165
Correlation Clustering
- MACHINE LEARNING
, 2002
"... We consider the following clustering problem: we have a complete graph on # vertices (items), where each edge ### ## is labeled either # or depending on whether # and # have been deemed to be similar or different. The goal is to produce a partition of the vertices (a clustering) that agrees as mu ..."
Abstract
-
Cited by 332 (4 self)
- Add to MetaCart
(Show Context)
We consider the following clustering problem: we have a complete graph on # vertices (items), where each edge ### ## is labeled either # or depending on whether # and # have been deemed to be similar or different. The goal is to produce a partition of the vertices (a clustering) that agrees as much as possible with the edge labels. That is, we want a clustering that maximizes the number of # edges within clusters, plus the number of edges between clusters (equivalently, minimizes the number of disagreements: the number of edges inside clusters plus the number of # edges between clusters). This formulation is motivated from a document clustering problem in which one has a pairwise similarity function # learned from past data, and the goal is to partition the current set of documents in a way that correlates with # as much as possible; it can also be viewed as a kind of "agnostic learning" problem. An interesting
Improved approximation algorithms for large matrices via random projections.
- In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
, 2006
"... ..."
(Show Context)
Toward privacy in public databases
, 2005
"... We initiate a theoretical study of the census problem. Informally, in a census individual respondents give private information to a trusted party (the census bureau), who publishes a sanitized version of the data. There are two fundamentally conflicting requirements: privacy for the respondents an ..."
Abstract
-
Cited by 107 (10 self)
- Add to MetaCart
We initiate a theoretical study of the census problem. Informally, in a census individual respondents give private information to a trusted party (the census bureau), who publishes a sanitized version of the data. There are two fundamentally conflicting requirements: privacy for the respondents and utility of the sanitized data. Unlike in the study of secure function evaluation, in which privacy is preserved to the extent possible given a specific functionality goal, in the census problem privacy is paramount; intuitively, things that cannot be learned “safely ” should not be learned at all. An important contribution of this work is a definition of privacy (and privacy compromise) for statistical databases, together with a method for describing and comparing the privacy offered by specific sanitization techniques. We obtain several privacy results using two different sanitization techniques, and then show how to combine them via cross training. We also obtain two utility results involving clustering.
A Decentralized Algorithm for Spectral Analysis
, 2004
"... In many large network settings, such as computer networks, social networks, or hyperlinked text documents, much information can be obtained from the network’s spectral properties. However, traditional centralized approaches for computing eigenvectors struggle with at least two obstacles: the data ma ..."
Abstract
-
Cited by 94 (2 self)
- Add to MetaCart
In many large network settings, such as computer networks, social networks, or hyperlinked text documents, much information can be obtained from the network’s spectral properties. However, traditional centralized approaches for computing eigenvectors struggle with at least two obstacles: the data may be difficult to obtain (both due to technical reasons and because of privacy concerns), and the sheer size of the networks makes the computation expensive. A decentralized, distributed algorithm addresses both of these obstacles: it utilizes the computational power of all nodes in the network and their ability to communicate, thus speeding up the computation with the network size. And as each node knows its incident edges, the data collection problem is avoided as well. Our main result is a simple decentralized algorithm for computing the top k eigenvectors of a symmetric weighted adjacency matrix, and a proof that it converges essentially in O(τmix log 2 n) rounds of communication and computation, where τmix is the mixing time of a random walk on the network. An additional contribution of our work is a decentralized way of actually detecting convergence, and diagnosing the current error. Our protocol scales well, in that the amount of computation performed at any node in any one round, and the sizes of messages sent, depend polynomially on k, but not at all on the (typically much larger) number n of nodes.
Spectral techniques applied to sparse random graphs. Random Structures and Algorithms
- Random Structures and Algorithms
, 2003
"... We analyze the eigenvalue gap for the adjacency matrices of sparse random graphs. Let λ1 ≥... ≥ λn be the eigenvalues of an n-vertex graph, and let λ = max[λ2, |λn|]. Let c be a large enough constant. For graphs of average degree d = c log n it is well known that λ1 ≥ d, and we show that λ = O ( √ ..."
Abstract
-
Cited by 62 (3 self)
- Add to MetaCart
We analyze the eigenvalue gap for the adjacency matrices of sparse random graphs. Let λ1 ≥... ≥ λn be the eigenvalues of an n-vertex graph, and let λ = max[λ2, |λn|]. Let c be a large enough constant. For graphs of average degree d = c log n it is well known that λ1 ≥ d, and we show that λ = O ( √ d). For d = c it is no longer true that λ = O ( √ d), but we show that by removing a small number of vertices of highest degree in G, one gets a graph G ′ for which λ = O ( √ d). Our proofs are based on the techniques of Kahn and Szemeredi from STOC 1989, who proved similar results for regular graphs. Our results are useful for extending the analysis of certain heuristics to sparser instances of NP-hard problems. We illustrate this by removing some unnecessary logarithmic factors in the density of k-SAT formulas that are refuted by the algorithm of Goerdt and Krivelevich from STACS 2001. 1
Clustering partially observed graphs via convex optimization.
- Journal of Machine Learning Research,
, 2014
"... Abstract This paper considers the problem of clustering a partially observed unweighted graph-i.e., one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organiz ..."
Abstract
-
Cited by 47 (13 self)
- Add to MetaCart
(Show Context)
Abstract This paper considers the problem of clustering a partially observed unweighted graph-i.e., one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organize the nodes into disjoint clusters so that there is relatively dense (observed) connectivity within clusters, and sparse across clusters. We take a novel yet natural approach to this problem, by focusing on finding the clustering that minimizes the number of "disagreements"-i.e., the sum of the number of (observed) missing edges within clusters, and (observed) present edges across clusters. Our algorithm uses convex optimization; its basis is a reduction of disagreement minimization to the problem of recovering an (unknown) low-rank matrix and an (unknown) sparse matrix from their partially observed sum. We evaluate the performance of our algorithm on the classical Planted Partition/Stochastic Block Model. Our main theorem provides sufficient conditions for the success of our algorithm as a function of the minimum cluster size, edge density and observation probability; in particular, the results characterize the tradeoff between the observation probability and the edge density gap. When there are a constant number of clusters of equal size, our results are optimal up to logarithmic factors.
Spectral clustering of graphs with general degrees in the extended planted partition model
- Journal of Machine Learning Research - Proceedings Track
"... Abstract In this paper, we examine a spectral clustering algorithm for similarity graphs drawn from a simple random graph model, where nodes are allowed to have varying degrees, and we provide theoretical bounds on its performance. The random graph model we study is the Extended Planted Partition ( ..."
Abstract
-
Cited by 42 (0 self)
- Add to MetaCart
(Show Context)
Abstract In this paper, we examine a spectral clustering algorithm for similarity graphs drawn from a simple random graph model, where nodes are allowed to have varying degrees, and we provide theoretical bounds on its performance. The random graph model we study is the Extended Planted Partition (EPP) model, a variant of the classical planted partition model. The standard approach to spectral clustering of graphs is to compute the bottom k singular vectors or eigenvectors of a suitable graph Laplacian, project the nodes of the graph onto these vectors, and then use an iterative clustering algorithm on the projected nodes. However a challenge with applying this approach to graphs generated from the EPP model is that unnormalized Laplacians do not work, and normalized Laplacians do not concentrate well when the graph has a number of low degree nodes. We resolve this issue by introducing the notion of a degree-corrected graph Laplacian. For graphs with many low degree nodes, degree correction has a regularizing effect on the Laplacian. Our spectral clustering algorithm projects the nodes in the graph onto the bottom k right singular vectors of the degree-corrected random-walk Laplacian, and clusters the nodes in this subspace. We show guarantees on the performance of this algorithm, demonstrating that it outputs the correct partition under a wide range of parameter values. Unlike some previous work, our algorithm does not require access to any generative parameters of the model.
Graph partitioning via adaptive spectral techniques
- Comb. Probab. Comput
"... Abstract. In this paper we study the use of spectral techniques for graph partitioning. Let G = (V,E) be a graph whose vertex set has a “latent ” partition V1,..., Vk. Moreover, consider a “density matrix” E = (Evw)v,w∈V such that for v ∈ Vi and w ∈ Vj the entry Evw is the fraction of all possible V ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
(Show Context)
Abstract. In this paper we study the use of spectral techniques for graph partitioning. Let G = (V,E) be a graph whose vertex set has a “latent ” partition V1,..., Vk. Moreover, consider a “density matrix” E = (Evw)v,w∈V such that for v ∈ Vi and w ∈ Vj the entry Evw is the fraction of all possible Vi-Vj-edges that are actually present in G. We show that on input (G, k) the partition V1,..., Vk can (almost) be recovered in polynomial time via spectral methods, provided that the following holds: E approximates the adjacency matrix of G in the operator norm, for vertices v ∈ Vi, w ∈ Vj 6 = Vi the corresponding column vectors Ev, Ew are separated, and G is sufficiently “regular ” w.r.t. the matrix E. This result in particular applies to sparse graphs with bounded average degree as n = #V →∞, and it yields interesting consequences on partitioning random graphs.
Modularity-Maximizing Graph Communities via Mathematical Programming
"... In many networks, it is of great interest to identify communities, unusually densely knit groups of individuals. Such communities often shed light on the function of the networks or underlying properties of the individuals. Recently, Newman suggested modularity as a natural measure of the quality ..."
Abstract
-
Cited by 38 (1 self)
- Add to MetaCart
In many networks, it is of great interest to identify communities, unusually densely knit groups of individuals. Such communities often shed light on the function of the networks or underlying properties of the individuals. Recently, Newman suggested modularity as a natural measure of the quality of a network partitioning into communities. Since then, various algorithms have been proposed for (approximately) maximizing the modularity of the partitioning determined. In this paper, we introduce the technique of rounding mathematical programs to the problem of modularity maximization, presenting two novel algorithms. More specifically, the algorithms round solutions to linear and vector programs. Importantly, the linear programing algorithm comes with an a posteriori approximation guarantee: by comparing the solution quality to the fractional solution of the linear program, a bound on the available “room for improvement ” can be obtained. The vector programming algorithm provides a similar bound for the best partition into two communities. We evaluate both algorithms using experiments on several standard test cases for network partitioning algorithms, and find that they perform comparably or better than past algorithms, while being more efficient than exhaustive techniques.