Results 11 - 20
of
68
Albatross Sampling: Robust and Effective Hybrid Vertex Sampling for Social Graphs
"... Nowadays, Online Social Networks (OSNs) have become dramatically popular and the study of social graphs attracts the interests of a large number of researchers. One critical challenge is the huge size of the social graph, which makes the graph analyzing or even the data crawling incredibly time cons ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
(Show Context)
Nowadays, Online Social Networks (OSNs) have become dramatically popular and the study of social graphs attracts the interests of a large number of researchers. One critical challenge is the huge size of the social graph, which makes the graph analyzing or even the data crawling incredibly time consuming, and sometimes impossible to be completed. Thus, graph sampling algorithms have been introduced to obtain a smaller subgraph which reflects the properties of the original graph well. Breadth-First Sampling (BFS) is widely used in graph sampling, but it is biased towards high-degree vertices during the process of sampling. Besides, Metropolis-Hasting Random Walk (MHRW), which is proposed to get unbiased samples of the social graph, requires the graph to be well connected. In this paper, we propose a vertex sampling algorithm, so-called Albatross Sampling (AS), which introduces random jump strategy into MHRW during the sampling process. The embedded random jump makes the sampling procedure more flexible and avoids being trapped in some locally well connected part. According to our evaluation, we find that no matter using tightly or loosely connected graphs, AS performs significantly better than MHRW and BFS. On the one hand, AS estimates the degree distribution with much lower Normalized Mean Square Error (NMSE) by consuming the same resource budget. On the other hand, to get an acceptable estimation of the degree distribution, AS requires much less resource budget.
Understanding Graph Sampling Algorithms for Social Network Analysis
"... Abstract—Being able to keep the graph scale small while capturing the properties of the original social graph, graph sampling provides an efficient, yet inexpensive solution for social network analysis. The challenge is how to create a small, but representative sample out of the massive social graph ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
(Show Context)
Abstract—Being able to keep the graph scale small while capturing the properties of the original social graph, graph sampling provides an efficient, yet inexpensive solution for social network analysis. The challenge is how to create a small, but representative sample out of the massive social graph with millions or even billions of nodes. Several sampling algorithms have been proposed in previous studies, but there lacks fair evaluation and comparison among them. In this paper, we analyze the state-ofart graph sampling algorithms and evaluate their performance on some widely recognized graph properties on directed graphs using large-scale social network datasets. We evaluate not only the commonly used node degree distribution, but also clustering coefficient, which quantifies how well connected are the neighbors of a node in a graph. Through the comparison we have found that none of the algorithms is able to obtain satisfied sampling results in both of these properties, and the performance of each algorithm differs much in different kinds of datasets. I.
Estimating Clustering Coefficients and Size of Social Networks via Random Walk
"... Online social networks have become a major force in today’s society and economy. The largest of today’s social networks may have hundreds of millions to more than a billion users. Such networks are too large to be downloaded or stored locally, even if terms of use and privacy policies were to permit ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
Online social networks have become a major force in today’s society and economy. The largest of today’s social networks may have hundreds of millions to more than a billion users. Such networks are too large to be downloaded or stored locally, even if terms of use and privacy policies were to permit doing so. This limitation complicates even simple computational tasks. One such task is computing the clustering coefficient of a network. Another task is to compute the network size (number of registered users) or a subpopulation size. The clustering coefficient, a classic measure of network connectivity, comes in two flavors, global and network average. In this work, we provide efficient algorithms for estimating these measures which (1) assume no prior knowledge about the network; and (2) access the network using only the publicly
Network Sampling via Edge-based Node Selection with Graph Induction
, 2011
"... In order to efficiently study the characteristics of network domains and support development of network systems (e.g. algorithms, protocols that operate on networks), it is often necessary to sample a representative subgraph from a large complex network. While prior research has shown that topologic ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
In order to efficiently study the characteristics of network domains and support development of network systems (e.g. algorithms, protocols that operate on networks), it is often necessary to sample a representative subgraph from a large complex network. While prior research has shown that topological (e.g. random-walk based) sampling methods produce more accurate samples than approaches based on node or edge sampling, they still do not produce samples that closely match the distributions of graph properties (e.g., degree) found in the original graph. In this paper, we observe that part of the problem is that any sampling process fundamentally biases the structure of the sampled subgraph, since all neighbors of a sample node may not be included in the sampled subgraph. We address this problem using a novel sampling algorithm called TIES that (1) aims to offset this bias by using edge-based node selection, which favors selection of high-degree nodes, and (2) uses a graph induction step to select additional edges between sampled nodes to restore connectivity and bring the structure closer to that of the original graph. To understand the properties of TIES we compare it analytically to random node and edge sampling. We also evaluate the efficacy of TIES empirically using several real-world data sets. Across all datasets, we found that TIES produces samples that better match the original distributions. In terms of two distributional distance metrics, KS distance and skew divergence, we found that samples produced by TIES consistently outperform other sampling algorithms—with up to 2 × reduction in KS distance and up to 3-7 × reduction in skew divergence, compared to the current state-ofthe-art algorithms.
Online estimating the k central nodes of a network
- In Proc. of the IEEE Network Science Workshop (NSW
, 2011
"... Estimating the most influential nodes in a network is a fundamental problem in network analysis. Influential nodes may be important spreaders of diseases in biological networks, key actors in terrorist networks, or marketing targets in social ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
(Show Context)
Estimating the most influential nodes in a network is a fundamental problem in network analysis. Influential nodes may be important spreaders of diseases in biological networks, key actors in terrorist networks, or marketing targets in social
2.5K-Graphs: from Sampling to Generation
"... Abstract—Understanding network structure and having access to realistic graphs plays a central role in computer and social networks research. In this paper, we propose a complete, practical methodology for generating graphs that resemble a real graph of interest. The metrics of the original topology ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
(Show Context)
Abstract—Understanding network structure and having access to realistic graphs plays a central role in computer and social networks research. In this paper, we propose a complete, practical methodology for generating graphs that resemble a real graph of interest. The metrics of the original topology we target to match are the joint degree distribution (JDD) and the degree-dependent average clustering coefficient (¯c(k)). We start by developing efficient estimators for these two metrics based on a node sample collected via either independence sampling or random walks. Then, we process the output of the estimators to ensure that the target metrics are realizable. Finally, we propose an efficient algorithm for generating topologies that have the exact target JDD and a ¯c(k) close to the target. Extensive simulations using real-life graphs show that the graphs generated by our methodology are similar to the original graph with respect to, not only the two target metrics, but also a wide range of other topological metrics. Furthermore, our generator is order of magnitudes faster than state-of-the-art techniques. I.
Distributed size estimation in anonymous networks
"... The knowledge of the size of a network, i.e. of the number of nodes composing it, is important for maintenance and organization purposes. In networks where the identity of the nodes or is not unique or cannot be disclosed for privacy reasons, the size-estimation problem is particularly challenging s ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The knowledge of the size of a network, i.e. of the number of nodes composing it, is important for maintenance and organization purposes. In networks where the identity of the nodes or is not unique or cannot be disclosed for privacy reasons, the size-estimation problem is particularly challenging since the exchanged messages cannot be uniquely associated with a specific node. In this work, we propose a totally distributed anonymous strategy based on statistical inference concepts. In our approach, each node starts generating a vector of independent random numbers from a known distribution. Then nodes compute a common function via some distributed consensus algorithms, and finally they compute the Maximum Likelihood (ML) estimate of the network size exploiting opportune statistical inferences. In this work we study the performance that can be obtained following this computational scheme when the consensus strategy is either the maximum or the average. In the max-consensus scenario, when data come from absolutely continuous distributions, we provide a complete characterization of the ML estimator. In particular, we show that the squared estimation error decreases as 1/M, where M is the amount of random numbers locally generated by each node, independently of the chosen probability distribution. Differently, in the average-consensus scenario, we show that if the locally generated data are independent Bernoulli trials, then the probability for the ML estimator to return a wrong answer decreases exponentially in M. Finally, we provide a discussion as how the numerical errors may affect the estimators performance under different scenarios.
Space-efficient sampling from social activity streams
- In BigMine
, 2012
"... In order to efficiently study the characteristics of network domains and support development of network systems (e.g. algorithms, protocols that operate on networks), it is often necessary to sample a representative subgraph from a large complex network. Although recent subgraph sampling methods hav ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
In order to efficiently study the characteristics of network domains and support development of network systems (e.g. algorithms, protocols that operate on networks), it is often necessary to sample a representative subgraph from a large complex network. Although recent subgraph sampling methods have been shown to work well, they focus on sampling from memory-resident graphs and assume that the sampling algorithm can access the entire graph in order to decide which nodes/edges to select. Many large-scale network datasets, however, are too large and/or dynamic to be processed using main memory (e.g., email, tweets, wall posts). In this work, we formulate the problem of sampling from large graph streams. We propose a streaming graph sampling algorithm that dynamically maintains a representative sample in a reservoir based setting. We evaluate the efficacy of our proposed methods empirically using several real-world data sets. Across all datasets, we found that our method produce samples that preserve better the original graph distributions. 1.
Distributed size estimation of dynamic anonymous networks.
- In Decision and Control (CDC), 2012 IEEE 51st Annual Conference on,
, 2012
"... Abstract-We consider the problem of estimating the size of dynamic anonymous networks, motivated by network maintenance. The proposed algorithm is based on max-consensus information exchange protocols, and extends a previous algorithm for static anonymous networks. A regularization term is accounti ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Show Context)
Abstract-We consider the problem of estimating the size of dynamic anonymous networks, motivated by network maintenance. The proposed algorithm is based on max-consensus information exchange protocols, and extends a previous algorithm for static anonymous networks. A regularization term is accounting for a-priori assumptions on the smoothness of the estimate, and we specifically consider quadratic regularization terms since they lead to closed-form solutions and intuitive design laws. We derive an explicit estimation scheme for a particular peer-to-peer service network, starting from its statistical model. To validate the accuracy of the algorithm, we perform numerical experiments and show how the algorithm can be implemented using finite precision arithmetics as well as small communication burdens.
On the Estimation Accuracy of Degree Distributions from Graph Sampling
"... Abstract — Estimating characteristics of large graphs via sampling is vital in the study of complex networks. In this work, we study the Mean Squared Error (MSE) associated with different sampling methods for the degree distribution. These sampling methods include independent random vertex (RV) and ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Abstract — Estimating characteristics of large graphs via sampling is vital in the study of complex networks. In this work, we study the Mean Squared Error (MSE) associated with different sampling methods for the degree distribution. These sampling methods include independent random vertex (RV) and random edge (RE) sampling, and crawling methods such as random walks (RWs) and the widely used Metropolis-Hastings algorithm for uniformly sampling vertices (MHRWu). We see that the RW MSE is proportional to the RE MSE and inversely proportional to the spectral gap of the RW transition probability matrix. We also determine conditions under which RW is preferable to RV. Finally, we present an approximation of the MHRWu MSE. We evaluate the accuracy of our approximations and bounds through simulations on large real world graphs. I.