Results 1  10
of
18
Clustering partially observed graphs via convex optimization.
 Journal of Machine Learning Research,
, 2014
"... Abstract This paper considers the problem of clustering a partially observed unweighted graphi.e., one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organiz ..."
Abstract

Cited by 47 (13 self)
 Add to MetaCart
(Show Context)
Abstract This paper considers the problem of clustering a partially observed unweighted graphi.e., one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organize the nodes into disjoint clusters so that there is relatively dense (observed) connectivity within clusters, and sparse across clusters. We take a novel yet natural approach to this problem, by focusing on finding the clustering that minimizes the number of "disagreements"i.e., the sum of the number of (observed) missing edges within clusters, and (observed) present edges across clusters. Our algorithm uses convex optimization; its basis is a reduction of disagreement minimization to the problem of recovering an (unknown) lowrank matrix and an (unknown) sparse matrix from their partially observed sum. We evaluate the performance of our algorithm on the classical Planted Partition/Stochastic Block Model. Our main theorem provides sufficient conditions for the success of our algorithm as a function of the minimum cluster size, edge density and observation probability; in particular, the results characterize the tradeoff between the observation probability and the edge density gap. When there are a constant number of clusters of equal size, our results are optimal up to logarithmic factors.
Clustering Sparse Graphs
, 2012
"... We develop a new algorithm to cluster sparse unweighted graphs – i.e. partition the nodes into disjoint clusters so that there is higher density within clusters, and low across clusters. By sparsity we mean the setting where both the incluster and across cluster edge densities are very small, possi ..."
Abstract

Cited by 25 (6 self)
 Add to MetaCart
(Show Context)
We develop a new algorithm to cluster sparse unweighted graphs – i.e. partition the nodes into disjoint clusters so that there is higher density within clusters, and low across clusters. By sparsity we mean the setting where both the incluster and across cluster edge densities are very small, possibly vanishing in the size of the graph. Sparsity makes the problem noisier, and hence more difficult to solve. Any clustering involves a tradeoff between minimizing two kinds of errors: missing edges within clusters and present edges across clusters. Our insight is that in the sparse case, these must be penalized differently. We analyze our algorithm’s performance on the natural, classical and widely studied “planted partition ” model (also called the stochastic block model); we show that our algorithm can cluster sparser graphs, and with smaller clusters, than all previous methods. This is seen empirically as well. 1
Improved Algorithms for the Random Cluster Graph Model
 Proceedings 7th Scandinavian Workshop on Algorithm Theory
, 2002
"... The following probabilistic process models the generation of noisy clustering data: Clusters correspond to disjoint sets of vertices in a graph. Each two vertices from the same set are connected by an edge with probability p, and each two vertices from different sets are connected by an edge with pr ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
The following probabilistic process models the generation of noisy clustering data: Clusters correspond to disjoint sets of vertices in a graph. Each two vertices from the same set are connected by an edge with probability p, and each two vertices from different sets are connected by an edge with probability r < p. The goal of the clustering problem is to reconstruct the clusters from the graph. We give algorithms that solve this problem with high probability. Compared to previous studies, our algorithms have lower time complexity and wider parameter range of applicability. In particular, our algorithms can handle O( n/ log n) clusters in an nvertex graph, while all previous algorithms require that the number of clusters is constant.
Breaking the Small Cluster Barrier of Graph Clustering Supplementary Material
"... In this supplementary material, we present proof details. 1 Notation and Conventions We use the following notation and conventions throughout the supplement. For a real n × n matrix M, we use the unadorned norm ‖M ‖ to denote its spectral norm. The notation ‖M‖F refers to the Frobenius norm, ‖M‖1 is ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
In this supplementary material, we present proof details. 1 Notation and Conventions We use the following notation and conventions throughout the supplement. For a real n × n matrix M, we use the unadorned norm ‖M ‖ to denote its spectral norm. The notation ‖M‖F refers to the Frobenius norm, ‖M‖1 is ∑ i,j M(i, j)  and ‖M‖ ∞ is maxij M(i, j). We will also study operators on the space of matrices. To distinguish them from the matrices studied in this work, we will simply call these objects “operators”, and will denote them using a calligraphic font, e.g. P. The norm ‖P ‖ of an operator is defined as where the supremum is over matrices M. ‖P ‖ = sup ‖PM‖F, M:‖M‖F =1 For a fixed, real n × n matrix M, we define the matrix linear subspace T (M) as follows: T (M): = {Y M + MX: X, Y ∈ R n×n}. In words, this subspace is the set of matrices spanned by matrices each row of which is in the row space of M, and matrices each column of which is in the column space of M. For any given subspace of matrices S ⊆ Rn×n, we let PS denote the orthogonal projection onto S with respect to the the inner product 〈X, Y 〉 = ∑n i,j=1 X(i, j)Y (i, j) = tr XtY. This means that for any matrix M, PSM = argminX∈S ‖M − X‖F. For a matrix M, we let Γ(M) denote the set of matrices supported on a subset of the support of M. Note that for any matrix X,
A rigorous analysis of population stratification with limited data
 In Proceedings of the 18th ACMSIAM SODA
, 2007
"... Abstract Finding the genetic factors of complex diseases such as cancer, currently a major effort of the international community, will potentially lead to better treatment of these diseases.One of the major difficulties in these studies, is the fact that the genetic components of an individual not ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
(Show Context)
Abstract Finding the genetic factors of complex diseases such as cancer, currently a major effort of the international community, will potentially lead to better treatment of these diseases.One of the major difficulties in these studies, is the fact that the genetic components of an individual not only depend onthe disease, but also on its ethnicity. Therefore, it is crucial to find methods that could reduce the population structureeffects on these studies. This can be formalized as a clustering problem, where the individuals are clustered accordingto their genetic information. Mathematically, we consider the problem of clusteringbit &quot;feature &quot; vectors, where each vector represents the genetic information of an individual. Our model assumes thatthis bit vector is generated according to a prior probability distribution specified by the individual's membership in apopulation. We present methods that can cluster the vectors while attempting to optimize the number of featuresrequired. The focus of the paper is not on the algorithms, but on showing that optimizing certain objective functionson the data yields the right clustering, under the random generative model. In particular, we prove that some of theprevious formulations for clustering are effective. We consider two different clustering approaches. Thefirst approach forms a graph, and then clusters the data using a connected components algorithm, or a max cut algorithm. The second approach tries to estimate simultanously the feature frequencies in each of the populations, and theclassification of vectors into populations. We show that using the first approach \Theta (log N/fl2) data (i.e., total numberof features times number of vectors) is sufficient to find the correct classification, where N is the number of vectors of each population, and fl is the average `22 distance betweenthe feature probability vectors of the two populations. Using the second approach, we show that O(log N/ff4) datais enough, where ff is the average ` 1 distance between thepopulations.
Potential networks, contagious communities, and understanding social network structure
 In WWW 2013
"... In this paper we provide evidence that digital social networks look fundamentally different from social networks. We show that online social networks look like a contagion spread over traditional models for social networks. Thus, if these traditional models are correct, then digital social networks ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
In this paper we provide evidence that digital social networks look fundamentally different from social networks. We show that online social networks look like a contagion spread over traditional models for social networks. Thus, if these traditional models are correct, then digital social networks and social networks differ in key properties, and we will need different models to describe each. This also indicates that using data from digital social networks may mislead us if we try to use it to directly infer the structure of social networks. Additionally, we describe a framework that we call “potential networks ” that may help to use information from digital networks to infer the structure of social networks. Potential networks is a two phase model of social networks. The first phase is the“potential”network. However, this network may not be directly observed and might not even exist an any normal manner. A random process is run over a potential network to produce a behavioral network, the second phase, which can be observed. We then discuss applications of this twophase framework.
A Study on Performance of the (1+1)Evolutionary Algorithm
 FOUNDATIONS OF GENETIC ALGORITHMS, 7
, 2003
"... The first contribution of this paper is a theoretical comparison of the (1+1) EA evolutionary algorithm to other evolutionary algorithms in the case of socalled monotone reproduction operator, which indicates that the (1+1) EA is an optimal search technique in this setting. After that we study ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
The first contribution of this paper is a theoretical comparison of the (1+1) EA evolutionary algorithm to other evolutionary algorithms in the case of socalled monotone reproduction operator, which indicates that the (1+1) EA is an optimal search technique in this setting. After that we study the expected optimization time for the (1+1)EA and show two set covering problem families where it is superior to certain generalpurpose exact algorithms. Finally some pessimistic estimates of mutation operators in terms of upper bounds on evolvability are suggested for the NPhard optimization problems.
Improved graph clustering
"... Graph clustering involves the task of partitioning nodes, so that the edge density is higher within partitions as opposed to across partitions. A natural, classic and popular statistical setting for evaluating solutions to this problem is the stochastic block model, also referred to as the planted p ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Graph clustering involves the task of partitioning nodes, so that the edge density is higher within partitions as opposed to across partitions. A natural, classic and popular statistical setting for evaluating solutions to this problem is the stochastic block model, also referred to as the planted partition model. In this paper we present a new algorithm a convexified version of Maximum Likelihood for graph clustering. We show that, in the classic stochastic block model setting, it outperforms all existing methods by polynomial