Results 1  10
of
14
A local algorithm for finding wellconnected clusters
 CoRR
, 2013
"... Motivated by applications of largescale graph clustering, we study randomwalkbased local algorithms whose running times depend only on the size of the output cluster, rather than the entire graph. In particular, we develop a method with better theoretical guarantee compared to all previous work, b ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
Motivated by applications of largescale graph clustering, we study randomwalkbased local algorithms whose running times depend only on the size of the output cluster, rather than the entire graph. In particular, we develop a method with better theoretical guarantee compared to all previous work, both in terms of the clustering accuracy and the conductance of the output set. We also prove that our analysis is tight, and perform empirical evaluation to support our theory on both synthetic and real data. More specifically, our method outperforms prior work when the cluster is wellconnected. In fact, the better it is wellconnected inside, the more significant improvement we can obtain. Our results shed light on why in practice some randomwalkbased algorithms perform better than its previous theory, and help guide future research about local clustering. 1.
Analyzing the Harmonic Structure in GraphBased Learning
"... We find that various wellknown graphbased models exhibit a common important harmonic structure in its target function – the value of a vertex is approximately the weighted average of the values of its adjacent neighbors. Understanding of such structure and analysis of the loss defined over such st ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
We find that various wellknown graphbased models exhibit a common important harmonic structure in its target function – the value of a vertex is approximately the weighted average of the values of its adjacent neighbors. Understanding of such structure and analysis of the loss defined over such structure help reveal important properties of the target function over a graph. In this paper, we show that the variation of the target function across a cut can be upper and lower bounded by the ratio of its harmonic loss and the cut cost. We use this to develop an analytical tool and analyze five popular graphbased models: absorbing random walks, partially absorbing random walks, hitting times, pseudoinverse of the graph Laplacian, and eigenvectors of the Laplacian matrices. Our analysis sheds new insights into several open questions related to these models, and provides theoretical justifications and guidelines for their practical use. Simulations on synthetic and real datasets confirm the potential of the proposed theory and tool. 1
Coinciding walk kernels: Parallel absorbing random walks for learning with graphs and few labels
 In Asian Conference on Machine Learning
, 2013
"... Exploiting autocorrelation for nodelabel prediction in networked data has led to great success. However, when dealing with sparsely labeled networks, common in presentday tasks, the autocorrelation assumption is difficult to exploit. Taking a step beyond, we propose the coinciding walk kernel (cw ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Exploiting autocorrelation for nodelabel prediction in networked data has led to great success. However, when dealing with sparsely labeled networks, common in presentday tasks, the autocorrelation assumption is difficult to exploit. Taking a step beyond, we propose the coinciding walk kernel (cwk), a novel kernel leveraging labelstructure similarity – the idea that nodes with similarly arranged labels in their local neighbourhoods are likely to have the same label – for learning problems on partially labeled graphs. Inspired by the success of random walk based schemes for the construction of graph kernels, cwk is defined in terms of the probability that the labels encountered during parallel random walks coincide. In addition to its intuitive probabilistic interpretation, coinciding walk kernels outperform existing kernel and walkbased methods on the task of nodelabel prediction in sparsely labeled graphs with high labelstructure similarity. We also show that computing cwks is faster than many stateoftheart kernels on graphs. We evaluate cwks on several realworld networks, including cocitation and coauthor graphs, as well as a graph of interlinked populated places extracted from the dbpedia knowledge base.
Scaling Graphbased Semi Supervised Learning to Large Number of Labels Using CountMin Sketch
"... Graphbased Semisupervised learning (SSL) algorithms have been successfully used in a large number of applications. These methods classify initially unlabeled nodes by propagating label information over the structure of graph starting from seed nodes. Graphbased SSL algorithms usually scale linea ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Graphbased Semisupervised learning (SSL) algorithms have been successfully used in a large number of applications. These methods classify initially unlabeled nodes by propagating label information over the structure of graph starting from seed nodes. Graphbased SSL algorithms usually scale linearly with the number of distinct labels (m), and require O(m) space on each node. Unfortunately, there exist many applications of practical significance with very large m over large graphs, demanding better space and time complexity. In this paper, we propose MADSketch, a novel graphbased SSL algorithm which compactly stores label distribution on each node using Countmin Sketch, a randomized data structure. We present theoretical analysis showing that under mild conditions, MADSketch can reduce space complexity at each node from O(m) to O(logm), and achieve similar savings in time complexity as well. We support our analysis through experiments on multiple real world datasets. We observe that MADSketch achieves similar performance as existing stateoftheart graphbased SSL algorithms, while requiring smaller memory footprint and at the same time achieving up to 10x speedup. We find that MADSketch is able to scale to datasets with one million labels, which is beyond the scope of existing graphbased SSL algorithms.
Coinciding Walk Kernels
"... Exploiting autocorrelation for nodelabel prediction in networked data has led to great success. However, when dealing with sparsely labeled networks, common in presentday tasks, the autocorrelation assumption is difficult to exploit. Taking a step beyond, we propose the coinciding walk kernel (cwk ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Exploiting autocorrelation for nodelabel prediction in networked data has led to great success. However, when dealing with sparsely labeled networks, common in presentday tasks, the autocorrelation assumption is difficult to exploit. Taking a step beyond, we propose the coinciding walk kernel (cwk), a novel kernel leveraging labelstructure similarity – the idea that nodes with similarly arranged labels in their local neighbourhoods are likely to have the same label – for learning problems on partially labeled graphs. Inspired by the success of random walk based schemes for the construction of graph kernels, cwk is defined in terms of the probability that the labels encountered during parallel random walks coincide. In addition to its intuitive probabilistic interpretation, coinciding walk kernels outperform stateoftheart kernel and walkbased methods on the task of nodelabel prediction in sparsely labeled graphs. We also show that computing cwks is faster than many stateoftheart kernels on graphs. We evaluate cwks on several realworld networks, including cocitation and coauthor graphs, as well as a network of interlinked populated places extracted from the dbpedia knowledge base. 1.
Graphbased Semisupervised Learning: Realizing Pointwise Smoothness Probabilistically Yuan Fang † ‡
"... As the central notion in semisupervised learning, smoothness is often realized on a graph representation of the data. In this paper, we study two complementary dimensions of smoothness: its pointwise nature and probabilistic modeling. While no existing graphbased work exploits them in conjunctio ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
As the central notion in semisupervised learning, smoothness is often realized on a graph representation of the data. In this paper, we study two complementary dimensions of smoothness: its pointwise nature and probabilistic modeling. While no existing graphbased work exploits them in conjunction, we encompass both in a novel framework of Probabilistic Graphbased Pointwise Smoothness (PGP), building upon two foundational models of data closeness and label coupling. This new form of smoothness axiomatizes a set of probability constraints, which ultimately enables class prediction. Theoretically, we provide an error and robustness analysis of PGP. Empirically, we conduct extensive experiments to show the advantages of PGP. 1.
Σoptimality for active learning on Gaussian random fields
 In Advances in Neural Information Processing Systems 26
, 2013
"... A common classifier for unlabeled nodes on undirected graphs uses label propagation from the labeled nodes, equivalent to the harmonic predictor on Gaussian random fields (GRFs). For active learning on GRFs, the commonly used Voptimality criterion queries nodes that reduce the L2 (regression) los ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
A common classifier for unlabeled nodes on undirected graphs uses label propagation from the labeled nodes, equivalent to the harmonic predictor on Gaussian random fields (GRFs). For active learning on GRFs, the commonly used Voptimality criterion queries nodes that reduce the L2 (regression) loss. Voptimality satisfies a submodularity property showing that greedy reduction produces a (1 − 1/e) globally optimal solution. However, L2 loss may not characterise the true nature of 0/1 loss in classification problems and thus may not be the best choice for active learning. We consider a new criterion we call Σoptimality, which queries the node that minimizes the sum of the elements in the predictive covariance. Σoptimality directly optimizes the risk of the surveying problem, which is to determine the proportion of nodes belonging to one class. In this paper we extend submodularity guarantees from Voptimality to Σoptimality using properties specific to GRFs. We further show that GRFs satisfy the suppressorfree condition in addition to the conditional independence inherited from Markov random fields. We test Σoptimality on realworld graphs with both synthetic and real data and show that it outperforms Voptimality and other related methods on classification. 1
Local Network Community Detection with Continuous Optimization of Conductance and Weighted Kernel KMeans Twan van Laarhoven
, 2016
"... Abstract Local network community detection is the task of finding a single community of nodes concentrated around few given seed nodes in a localized way. Conductance is a popular objective function used in many algorithms for local community detection. This paper studies a continuous relaxation of ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract Local network community detection is the task of finding a single community of nodes concentrated around few given seed nodes in a localized way. Conductance is a popular objective function used in many algorithms for local community detection. This paper studies a continuous relaxation of conductance. We show that continuous optimization of this objective still leads to discrete communities. We investigate the relation of conductance with weighted kernel kmeans for a single community, which leads to the introduction of a new objective function, σconductance. Conductance is obtained by setting σ to 0. Two algorithms, EMc and PGDc, are proposed to locally optimize σconductance and automatically tune the parameter σ. They are based on expectation maximization and projected gradient descent, respectively. We prove locality and give performance guarantees for EMc and PGDc for a class of dense and well separated communities centered around the seeds. Experiments are conducted on networks with groundtruth communities, comparing to stateoftheart graph diffusion algorithms for conductance optimization. On large graphs, results indicate that EMc and PGDc stay localized and produce communities most similar to the ground, while graph diffusion algorithms generate large communities of lower quality.
Walking in the Cloud: Parallel SimRank at Scale
"... ABSTRACT Despite its popularity, SimRank is computationally costly, in both time and space. In particular, its recursive nature poses a great challenge in using modern distributed computing power, and also prevents querying similarities individually. Existing solutions suffer greatly from these pra ..."
Abstract
 Add to MetaCart
(Show Context)
ABSTRACT Despite its popularity, SimRank is computationally costly, in both time and space. In particular, its recursive nature poses a great challenge in using modern distributed computing power, and also prevents querying similarities individually. Existing solutions suffer greatly from these practical issues. In this paper, we break such dependency for maximum efficiency possible. Our method consists of offline and online phases. In offline phase, a lengthn indexing vector is derived by solving a linear system in parallel. At online query time, the similarities are computed instantly from the index vector. Throughout, the Monte Carlo method is used to maximally reduce time and space. Our algorithm, called CloudWalker, is highly parallelizable, with only linear time and space. Remarkably, it responses to both singlepair and singlesource queries in constant time. CloudWalker is orders of magnitude more efficient and scalable than existing solutions for largescale problems. Implemented on Spark with 10 machines and tested on the webscale clueweb graph with 1 billion nodes and 43 billion edges, it takes 110 hours for offline indexing, 64 seconds for a singlepair query, and 188 seconds for a singlesource query. To the best of our knowledge, our work is the first to report results on clueweb, which is 10x larger than the largest graph ever reported for SimRank computation.