Results 1 - 10
of
68
Practical recommendations on crawling online social networks
- SELECTED AREAS IN COMMUNICATIONS, IEEE JOURNAL ON
, 2011
"... Our goal in this paper is to develop a practical framework for obtaining a uniform sample of users in an online social network (OSN) by crawling its social graph. Such a sample allows to estimate any user property and some topological properties as well. To this end, first, we consider and compare ..."
Abstract
-
Cited by 37 (1 self)
- Add to MetaCart
(Show Context)
Our goal in this paper is to develop a practical framework for obtaining a uniform sample of users in an online social network (OSN) by crawling its social graph. Such a sample allows to estimate any user property and some topological properties as well. To this end, first, we consider and compare several candidate crawling techniques. Two approaches that can produce approximately uniform samples are the Metropolis-Hasting random walk (MHRW) and a re-weighted random walk (RWRW). Both have pros and cons, which we demonstrate through a comparison to each other as well as to the “ground truth. ” In contrast, using Breadth-First-Search (BFS) or an unadjusted Random Walk (RW) leads to substantially biased results. Second, and in addition to offline performance assessment, we introduce online formal convergence diagnostics to assess sample quality during the data collection process. We show how these diagnostics can be used to effectively determine when a random walk sample is of adequate size and quality. Third, as a case study, we apply the above methods to Facebook and we collect the first, to the best of our knowledge, representative sample of Facebook users. We make it publicly available and employ it to characterize several key properties of Facebook.
Improving random walk estimation accuracy with uniform restarts
- In Algorithms and Models for the Web-Graph
, 2010
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract
-
Cited by 36 (11 self)
- Add to MetaCart
(Show Context)
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. appor t de r ech er ch e
Multigraph Sampling of Online Social Networks
- IEEE J. SEL. AREAS COMMUN. ON MEASUREMENT OF INTERNET TOPOLOGIES
, 2011
"... State-of-the-art techniques for probability sampling of users of online social networks (OSNs) are based on random walks on a single social relation (typically friendship). While powerful, these methods rely on the social graph being fully connected. Furthermore, the mixing time of the sampling pro ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
(Show Context)
State-of-the-art techniques for probability sampling of users of online social networks (OSNs) are based on random walks on a single social relation (typically friendship). While powerful, these methods rely on the social graph being fully connected. Furthermore, the mixing time of the sampling process strongly depends on the characteristics of this graph. In this paper, we observe that there often exist other relations between OSN users, such as membership in the same group or participation in the same event. We propose to exploit the graphs these relations induce, by performing a random walk on their union multigraph. We design a computationally efficient way to perform multigraph sampling by randomly selecting the graph on which to walk at each iteration. We demonstrate the benefits of our approach through (i) simulation in synthetic graphs, and (ii) measurements of Last.fm- an Internet website for music with social networking features. More specifically, we show that multigraph sampling can obtain a representative sample and faster convergence, even when the individual graphs fail, i.e., are disconnected or highly clustered.
Towards unbiased BFS sampling
- SELECTED AREAS IN COMMUNICATIONS, IEEE JOURNAL ON
, 2011
"... Breadth First Search (BFS) is a widely used approach for sampling large graphs. However, it has been empirically observed that BFS sampling is biased toward high-degree nodes, which may strongly affect the measurement results. In this paper, we quantify and correct the degree bias of BFS. First, we ..."
Abstract
-
Cited by 26 (4 self)
- Add to MetaCart
Breadth First Search (BFS) is a widely used approach for sampling large graphs. However, it has been empirically observed that BFS sampling is biased toward high-degree nodes, which may strongly affect the measurement results. In this paper, we quantify and correct the degree bias of BFS. First, we consider a random graph RG(pk) with an arbitrary degree distribution pk. For this model, we calculate the node degree distribution expected to be observed by BFS as a function of the fraction f of covered nodes. We also show that, for RG(pk), all commonly used graph traversal techniques (BFS, DFS, Forest Fire, Snowball Sampling, RDS) have exactly the same bias. Next, we propose a practical BFS-bias correction procedure that takes as input a collected BFS sample together with the fraction f. Our correction technique is exact (i.e., leads to unbiased estimation) for RG(pk). Furthermore, it performs well when applied to a broad range of Internet topologies and to two large BFS samples of Facebook and Orkut networks.
Walking on a Graph with a Magnifying Glass: Stratified Sampling via Weighted Random Walks
- in Proc. ACM SIGMETRICS
, 2011
"... Our objective is to sample the node set of a large unknown graph via crawling, to accurately estimate a given metric of interest. We design a random walk on an appropriately defined weighted graph that achieves high efficiency by preferentially crawling those nodes and edges that convey greater info ..."
Abstract
-
Cited by 23 (7 self)
- Add to MetaCart
(Show Context)
Our objective is to sample the node set of a large unknown graph via crawling, to accurately estimate a given metric of interest. We design a random walk on an appropriately defined weighted graph that achieves high efficiency by preferentially crawling those nodes and edges that convey greater information regarding the target metric. Our approach begins by employing the theory of stratification to find optimal node weights, for a given estimation problem, under an independence sampler. While optimal under independence sampling, these weights may be impractical under graph crawling due to constraints arising from the structure of the graph. Therefore, the edge weights for our random walk should be chosen so as to lead to an equilibrium distribution that strikes a balance between approximating the optimal weights under an independence sampler and achieving fast convergence. We propose a heuristic approach (stratified weighted random walk, or S-WRW) that achieves this goal, while using only limited information about the graph structure and the node properties. We evaluate our technique in simulation, and experimentally, by collecting a sample of Facebook college users. We show that S-WRW requires 13-15 times fewer samples than the simple re-weighted random walk (RW) to achieve the same estimation accuracy for a range of metrics.
Sampling directed graphs with random walks
, 2011
"... Abstract—Despite recent efforts to characterize complex net-works such as citation graphs or online social networks (OSNs), little attention has been given to developing tools that can be used to characterize directed graphs in the wild, where no pre-processed data is available. The presence of hidd ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
(Show Context)
Abstract—Despite recent efforts to characterize complex net-works such as citation graphs or online social networks (OSNs), little attention has been given to developing tools that can be used to characterize directed graphs in the wild, where no pre-processed data is available. The presence of hidden incoming edges but observable outgoing edges poses a challenge to characterize large directed graphs through crawling, as existing sampling methods cannot cope with hidden incoming links. The driving principle behind our random walk (RW) sampling method is to construct, in real-time, an undirected graph from the directed graph such that the random walk on the directed graph is consistent with one on the undirected graph. We then use the RW on the undirected graph to estimate the outdegree distribution. Our algorithm accurately estimates outdegree distributions of a variety of real world graphs. We also study the hardness of indegree distribution estimation when indegrees are latent (i.e., incoming links are only observed as outgoing edges). We observe that, in the same scenarios, indegree distribution estimates are highly innacurate unless the directed graph is highly symmetrical. I.
Beyond random walk and metropolis-hastings samplers: Why you should not backtrack for unbiased graph sampling
, 2012
"... ar ..."
(Show Context)
Network Sampling: From Static to Streaming Graphs
, 2013
"... Network sampling is integral to the analysis of social, information, and biological networks. Since many real-world networks are massive in size, continuously evolving, and/or distributed in nature, the network structure is often sampled in order to facilitate study. For these reasons, a more thorou ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Network sampling is integral to the analysis of social, information, and biological networks. Since many real-world networks are massive in size, continuously evolving, and/or distributed in nature, the network structure is often sampled in order to facilitate study. For these reasons, a more thorough and complete understanding of network sampling is critical to support the field of network science. In this paper, we outline a framework for the general problem of network sampling, by highlighting the different objectives, population and units of interest, and classes of network sampling methods. In addition, we propose a spectrum of computational models for network sampling methods, ranging from the traditionally studied model based on the assumption of a static domain to a more challenging model that is appropriate for streaming domains. We design a family of sampling methods based on the concept of graph induction that generalize across the full spectrum of computational models (from static to streaming) while efficiently preserving many of the topological properties of the input graphs. Furthermore, we demonstrate how traditional static sampling algorithms can be modified for graph streams for each of the three main classes of sampling methods: node, edge, and topology-based sampling. Experimental results indicate that our proposed family of sampling methods more accurately preserve the underlying properties of the graph in both static and streaming domains. Finally, we study the impact of network sampling algorithms on the parameter estimation and performance evaluation of relational classification algorithms.
Coarse-Grained Topology Estimation via Graph Sampling
, 2012
"... In many online networks, nodes are partitioned into categories (e.g., countries or universities in OSNs), which naturally defines a weighted category graph i.e., a coarse-grained version of the underlying network. In this paper, we show how to efficiently estimate the category graph from a probabili ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
(Show Context)
In many online networks, nodes are partitioned into categories (e.g., countries or universities in OSNs), which naturally defines a weighted category graph i.e., a coarse-grained version of the underlying network. In this paper, we show how to efficiently estimate the category graph from a probability sample of nodes. We prove consistency of our estimators and evaluate their efficiency via simulation. We also apply our methodology to a sample of Facebook users to obtain a number of category graphs, such as the college friendship graph and the country friendship graph. We share and visualize the resulting data at www.geosocialmap.com.
Counting youtube videos via random prefix sampling
- In SIGCOMM
, 2011
"... Leveraging the characteristics of YouTube video id space and ex-ploiting a unique property of YouTube search API, in this paper we develop a random prefix sampling method to estimate the total number of videos hosted by YouTube. Through theoretical model-ing and analysis, we demonstrate that the est ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Leveraging the characteristics of YouTube video id space and ex-ploiting a unique property of YouTube search API, in this paper we develop a random prefix sampling method to estimate the total number of videos hosted by YouTube. Through theoretical model-ing and analysis, we demonstrate that the estimator based on this method is unbiased, and provide bounds on its variance and confi-dence interval. These bounds enable us to judiciously select sam-ple sizes to control estimation errors. We evaluate our sampling method and validate the sampling results using two distinct collec-tions of YouTube video id’s (namely, treating each collection as if it were the “true ” collection of YouTube videos). We then apply our sampling method to the live YouTube system, and estimate that there are a total of roughly 500 millions YouTube videos by May, 2011. Finally, using an unbiased collection of YouTube videos sampled by our method, we show that YouTube video view count statistics collected by prior methods (e.g., through crawling of re-lated video links) are highly skewed, significantly under-estimating the number of videos with very small view counts (< 1000) ; we al-so shed lights on the bounds for the total storage YouTube must have and the network capacity needed to delivery YouTube videos.