Results 11  20
of
68
Albatross Sampling: Robust and Effective Hybrid Vertex Sampling for Social Graphs
"... Nowadays, Online Social Networks (OSNs) have become dramatically popular and the study of social graphs attracts the interests of a large number of researchers. One critical challenge is the huge size of the social graph, which makes the graph analyzing or even the data crawling incredibly time cons ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
Nowadays, Online Social Networks (OSNs) have become dramatically popular and the study of social graphs attracts the interests of a large number of researchers. One critical challenge is the huge size of the social graph, which makes the graph analyzing or even the data crawling incredibly time consuming, and sometimes impossible to be completed. Thus, graph sampling algorithms have been introduced to obtain a smaller subgraph which reflects the properties of the original graph well. BreadthFirst Sampling (BFS) is widely used in graph sampling, but it is biased towards highdegree vertices during the process of sampling. Besides, MetropolisHasting Random Walk (MHRW), which is proposed to get unbiased samples of the social graph, requires the graph to be well connected. In this paper, we propose a vertex sampling algorithm, socalled Albatross Sampling (AS), which introduces random jump strategy into MHRW during the sampling process. The embedded random jump makes the sampling procedure more flexible and avoids being trapped in some locally well connected part. According to our evaluation, we find that no matter using tightly or loosely connected graphs, AS performs significantly better than MHRW and BFS. On the one hand, AS estimates the degree distribution with much lower Normalized Mean Square Error (NMSE) by consuming the same resource budget. On the other hand, to get an acceptable estimation of the degree distribution, AS requires much less resource budget.
Understanding Graph Sampling Algorithms for Social Network Analysis
"... Abstract—Being able to keep the graph scale small while capturing the properties of the original social graph, graph sampling provides an efficient, yet inexpensive solution for social network analysis. The challenge is how to create a small, but representative sample out of the massive social graph ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
Abstract—Being able to keep the graph scale small while capturing the properties of the original social graph, graph sampling provides an efficient, yet inexpensive solution for social network analysis. The challenge is how to create a small, but representative sample out of the massive social graph with millions or even billions of nodes. Several sampling algorithms have been proposed in previous studies, but there lacks fair evaluation and comparison among them. In this paper, we analyze the stateofart graph sampling algorithms and evaluate their performance on some widely recognized graph properties on directed graphs using largescale social network datasets. We evaluate not only the commonly used node degree distribution, but also clustering coefficient, which quantifies how well connected are the neighbors of a node in a graph. Through the comparison we have found that none of the algorithms is able to obtain satisfied sampling results in both of these properties, and the performance of each algorithm differs much in different kinds of datasets. I.
Estimating Clustering Coefficients and Size of Social Networks via Random Walk
"... Online social networks have become a major force in today’s society and economy. The largest of today’s social networks may have hundreds of millions to more than a billion users. Such networks are too large to be downloaded or stored locally, even if terms of use and privacy policies were to permit ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
Online social networks have become a major force in today’s society and economy. The largest of today’s social networks may have hundreds of millions to more than a billion users. Such networks are too large to be downloaded or stored locally, even if terms of use and privacy policies were to permit doing so. This limitation complicates even simple computational tasks. One such task is computing the clustering coefficient of a network. Another task is to compute the network size (number of registered users) or a subpopulation size. The clustering coefficient, a classic measure of network connectivity, comes in two flavors, global and network average. In this work, we provide efficient algorithms for estimating these measures which (1) assume no prior knowledge about the network; and (2) access the network using only the publicly
Network Sampling via Edgebased Node Selection with Graph Induction
, 2011
"... In order to efficiently study the characteristics of network domains and support development of network systems (e.g. algorithms, protocols that operate on networks), it is often necessary to sample a representative subgraph from a large complex network. While prior research has shown that topologic ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
In order to efficiently study the characteristics of network domains and support development of network systems (e.g. algorithms, protocols that operate on networks), it is often necessary to sample a representative subgraph from a large complex network. While prior research has shown that topological (e.g. randomwalk based) sampling methods produce more accurate samples than approaches based on node or edge sampling, they still do not produce samples that closely match the distributions of graph properties (e.g., degree) found in the original graph. In this paper, we observe that part of the problem is that any sampling process fundamentally biases the structure of the sampled subgraph, since all neighbors of a sample node may not be included in the sampled subgraph. We address this problem using a novel sampling algorithm called TIES that (1) aims to offset this bias by using edgebased node selection, which favors selection of highdegree nodes, and (2) uses a graph induction step to select additional edges between sampled nodes to restore connectivity and bring the structure closer to that of the original graph. To understand the properties of TIES we compare it analytically to random node and edge sampling. We also evaluate the efficacy of TIES empirically using several realworld data sets. Across all datasets, we found that TIES produces samples that better match the original distributions. In terms of two distributional distance metrics, KS distance and skew divergence, we found that samples produced by TIES consistently outperform other sampling algorithms—with up to 2 × reduction in KS distance and up to 37 × reduction in skew divergence, compared to the current stateoftheart algorithms.
Online estimating the k central nodes of a network
 In Proc. of the IEEE Network Science Workshop (NSW
, 2011
"... Estimating the most influential nodes in a network is a fundamental problem in network analysis. Influential nodes may be important spreaders of diseases in biological networks, key actors in terrorist networks, or marketing targets in social ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
(Show Context)
Estimating the most influential nodes in a network is a fundamental problem in network analysis. Influential nodes may be important spreaders of diseases in biological networks, key actors in terrorist networks, or marketing targets in social
2.5KGraphs: from Sampling to Generation
"... Abstract—Understanding network structure and having access to realistic graphs plays a central role in computer and social networks research. In this paper, we propose a complete, practical methodology for generating graphs that resemble a real graph of interest. The metrics of the original topology ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract—Understanding network structure and having access to realistic graphs plays a central role in computer and social networks research. In this paper, we propose a complete, practical methodology for generating graphs that resemble a real graph of interest. The metrics of the original topology we target to match are the joint degree distribution (JDD) and the degreedependent average clustering coefficient (¯c(k)). We start by developing efficient estimators for these two metrics based on a node sample collected via either independence sampling or random walks. Then, we process the output of the estimators to ensure that the target metrics are realizable. Finally, we propose an efficient algorithm for generating topologies that have the exact target JDD and a ¯c(k) close to the target. Extensive simulations using reallife graphs show that the graphs generated by our methodology are similar to the original graph with respect to, not only the two target metrics, but also a wide range of other topological metrics. Furthermore, our generator is order of magnitudes faster than stateoftheart techniques. I.
Distributed size estimation in anonymous networks
"... The knowledge of the size of a network, i.e. of the number of nodes composing it, is important for maintenance and organization purposes. In networks where the identity of the nodes or is not unique or cannot be disclosed for privacy reasons, the sizeestimation problem is particularly challenging s ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
The knowledge of the size of a network, i.e. of the number of nodes composing it, is important for maintenance and organization purposes. In networks where the identity of the nodes or is not unique or cannot be disclosed for privacy reasons, the sizeestimation problem is particularly challenging since the exchanged messages cannot be uniquely associated with a specific node. In this work, we propose a totally distributed anonymous strategy based on statistical inference concepts. In our approach, each node starts generating a vector of independent random numbers from a known distribution. Then nodes compute a common function via some distributed consensus algorithms, and finally they compute the Maximum Likelihood (ML) estimate of the network size exploiting opportune statistical inferences. In this work we study the performance that can be obtained following this computational scheme when the consensus strategy is either the maximum or the average. In the maxconsensus scenario, when data come from absolutely continuous distributions, we provide a complete characterization of the ML estimator. In particular, we show that the squared estimation error decreases as 1/M, where M is the amount of random numbers locally generated by each node, independently of the chosen probability distribution. Differently, in the averageconsensus scenario, we show that if the locally generated data are independent Bernoulli trials, then the probability for the ML estimator to return a wrong answer decreases exponentially in M. Finally, we provide a discussion as how the numerical errors may affect the estimators performance under different scenarios.
Spaceefficient sampling from social activity streams
 In BigMine
, 2012
"... In order to efficiently study the characteristics of network domains and support development of network systems (e.g. algorithms, protocols that operate on networks), it is often necessary to sample a representative subgraph from a large complex network. Although recent subgraph sampling methods hav ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
In order to efficiently study the characteristics of network domains and support development of network systems (e.g. algorithms, protocols that operate on networks), it is often necessary to sample a representative subgraph from a large complex network. Although recent subgraph sampling methods have been shown to work well, they focus on sampling from memoryresident graphs and assume that the sampling algorithm can access the entire graph in order to decide which nodes/edges to select. Many largescale network datasets, however, are too large and/or dynamic to be processed using main memory (e.g., email, tweets, wall posts). In this work, we formulate the problem of sampling from large graph streams. We propose a streaming graph sampling algorithm that dynamically maintains a representative sample in a reservoir based setting. We evaluate the efficacy of our proposed methods empirically using several realworld data sets. Across all datasets, we found that our method produce samples that preserve better the original graph distributions. 1.
Distributed size estimation of dynamic anonymous networks.
 In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on,
, 2012
"... AbstractWe consider the problem of estimating the size of dynamic anonymous networks, motivated by network maintenance. The proposed algorithm is based on maxconsensus information exchange protocols, and extends a previous algorithm for static anonymous networks. A regularization term is accounti ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
AbstractWe consider the problem of estimating the size of dynamic anonymous networks, motivated by network maintenance. The proposed algorithm is based on maxconsensus information exchange protocols, and extends a previous algorithm for static anonymous networks. A regularization term is accounting for apriori assumptions on the smoothness of the estimate, and we specifically consider quadratic regularization terms since they lead to closedform solutions and intuitive design laws. We derive an explicit estimation scheme for a particular peertopeer service network, starting from its statistical model. To validate the accuracy of the algorithm, we perform numerical experiments and show how the algorithm can be implemented using finite precision arithmetics as well as small communication burdens.
On the Estimation Accuracy of Degree Distributions from Graph Sampling
"... Abstract — Estimating characteristics of large graphs via sampling is vital in the study of complex networks. In this work, we study the Mean Squared Error (MSE) associated with different sampling methods for the degree distribution. These sampling methods include independent random vertex (RV) and ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Abstract — Estimating characteristics of large graphs via sampling is vital in the study of complex networks. In this work, we study the Mean Squared Error (MSE) associated with different sampling methods for the degree distribution. These sampling methods include independent random vertex (RV) and random edge (RE) sampling, and crawling methods such as random walks (RWs) and the widely used MetropolisHastings algorithm for uniformly sampling vertices (MHRWu). We see that the RW MSE is proportional to the RE MSE and inversely proportional to the spectral gap of the RW transition probability matrix. We also determine conditions under which RW is preferable to RV. Finally, we present an approximation of the MHRWu MSE. We evaluate the accuracy of our approximations and bounds through simulations on large real world graphs. I.