• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Estimating and sampling graphs with multidimensional random walks. (2010)

by B Ribeiro, D Towsley
Venue:In Proc. IMC,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 68
Next 10 →

Practical recommendations on crawling online social networks

by Minas Gjoka, Maciej Kurant, Carter T. Butts, Athina Markopoulou - SELECTED AREAS IN COMMUNICATIONS, IEEE JOURNAL ON , 2011
"... Our goal in this paper is to develop a practical framework for obtaining a uniform sample of users in an online social network (OSN) by crawling its social graph. Such a sample allows to estimate any user property and some topological properties as well. To this end, first, we consider and compare ..."
Abstract - Cited by 37 (1 self) - Add to MetaCart
Our goal in this paper is to develop a practical framework for obtaining a uniform sample of users in an online social network (OSN) by crawling its social graph. Such a sample allows to estimate any user property and some topological properties as well. To this end, first, we consider and compare several candidate crawling techniques. Two approaches that can produce approximately uniform samples are the Metropolis-Hasting random walk (MHRW) and a re-weighted random walk (RWRW). Both have pros and cons, which we demonstrate through a comparison to each other as well as to the “ground truth. ” In contrast, using Breadth-First-Search (BFS) or an unadjusted Random Walk (RW) leads to substantially biased results. Second, and in addition to offline performance assessment, we introduce online formal convergence diagnostics to assess sample quality during the data collection process. We show how these diagnostics can be used to effectively determine when a random walk sample is of adequate size and quality. Third, as a case study, we apply the above methods to Facebook and we collect the first, to the best of our knowledge, representative sample of Facebook users. We make it publicly available and employ it to characterize several key properties of Facebook.
(Show Context)

Citation Context

...mpared with MHRW in the context of peer-to-peer sampling by Rasti et al. [17]. Further improvements or variants of random walks include random walk with jumps [29,33], multiple dependent random walks =-=[34]-=-, weighted random walks [35], or multigraph sampling [36]. Our work is most closely related to the random walk techniques. We obtain unbiased estimators of user properties in Facebook using MHRW and R...

Improving random walk estimation accuracy with uniform restarts

by Konstantin Avrachenkov, Bruno Ribeiro, Don Towsley, Konstantin Avrachenkov, Bruno Ribeiro, Don Towsley Improving R, Om Walk Estimation, Hal Id Inria, Thème Com, Konstantin Avrachenkov, Bruno Ribeiro, Don Towsley - In Algorithms and Models for the Web-Graph , 2010
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract - Cited by 36 (11 self) - Add to MetaCart
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. appor t de r ech er ch e
(Show Context)

Citation Context

...g to valid users [10], i.e., only one in every ten queries successfully finds a valid MySpace account. Within crawl-based sampling methods, random walk (RW) sampling is among the most popular methods =-=[5, 11, 12, 18, 20, 23]-=-. Let G = (V,E) be an undirected, non-bipartite graph with n nodes. RW sampling is preferred because it requires few resources and, when G is connected, can be shown to produce asymptotically unbiased...

Multigraph Sampling of Online Social Networks

by Minas Gjoka, Carter T. Butts, Maciej Kurant, Athina Markopoulou - IEEE J. SEL. AREAS COMMUN. ON MEASUREMENT OF INTERNET TOPOLOGIES , 2011
"... State-of-the-art techniques for probability sampling of users of online social networks (OSNs) are based on random walks on a single social relation (typically friendship). While powerful, these methods rely on the social graph being fully connected. Furthermore, the mixing time of the sampling pro ..."
Abstract - Cited by 26 (8 self) - Add to MetaCart
State-of-the-art techniques for probability sampling of users of online social networks (OSNs) are based on random walks on a single social relation (typically friendship). While powerful, these methods rely on the social graph being fully connected. Furthermore, the mixing time of the sampling process strongly depends on the characteristics of this graph. In this paper, we observe that there often exist other relations between OSN users, such as membership in the same group or participation in the same event. We propose to exploit the graphs these relations induce, by performing a random walk on their union multigraph. We design a computationally efficient way to perform multigraph sampling by randomly selecting the graph on which to walk at each iteration. We demonstrate the benefits of our approach through (i) simulation in synthetic graphs, and (ii) measurements of Last.fm- an Internet website for music with social networking features. More specifically, we show that multigraph sampling can obtain a representative sample and faster convergence, even when the individual graphs fail, i.e., are disconnected or highly clustered.
(Show Context)

Citation Context

...ly Friendster, Twitter and Facebook. Random walks have also been used to sample peer-to-peer networks [28]–[30] and other large graphs [31]. Design of random walk techniques to improve mixing include =-=[18,32]-=-–[34]. Boyd et al. [18] pose the problem of finding the fastest mixing Markov Chain on a known graph as an optimization problem. However, in our case such an exact optimization is not possible since w...

Towards unbiased BFS sampling

by Maciej Kurant, Athina Markopoulou, Patrick Thiran - SELECTED AREAS IN COMMUNICATIONS, IEEE JOURNAL ON , 2011
"... Breadth First Search (BFS) is a widely used approach for sampling large graphs. However, it has been empirically observed that BFS sampling is biased toward high-degree nodes, which may strongly affect the measurement results. In this paper, we quantify and correct the degree bias of BFS. First, we ..."
Abstract - Cited by 26 (4 self) - Add to MetaCart
Breadth First Search (BFS) is a widely used approach for sampling large graphs. However, it has been empirically observed that BFS sampling is biased toward high-degree nodes, which may strongly affect the measurement results. In this paper, we quantify and correct the degree bias of BFS. First, we consider a random graph RG(pk) with an arbitrary degree distribution pk. For this model, we calculate the node degree distribution expected to be observed by BFS as a function of the fraction f of covered nodes. We also show that, for RG(pk), all commonly used graph traversal techniques (BFS, DFS, Forest Fire, Snowball Sampling, RDS) have exactly the same bias. Next, we propose a practical BFS-bias correction procedure that takes as input a collected BFS sample together with the fraction f. Our correction technique is exact (i.e., leads to unbiased estimation) for RG(pk). Furthermore, it performs well when applied to a broad range of Internet topologies and to two large BFS samples of Facebook and Orkut networks.

Walking on a Graph with a Magnifying Glass: Stratified Sampling via Weighted Random Walks

by Maciej Kurant, Minas Gjoka, Carter T. Butts, Athina Markopoulou - in Proc. ACM SIGMETRICS , 2011
"... Our objective is to sample the node set of a large unknown graph via crawling, to accurately estimate a given metric of interest. We design a random walk on an appropriately defined weighted graph that achieves high efficiency by preferentially crawling those nodes and edges that convey greater info ..."
Abstract - Cited by 23 (7 self) - Add to MetaCart
Our objective is to sample the node set of a large unknown graph via crawling, to accurately estimate a given metric of interest. We design a random walk on an appropriately defined weighted graph that achieves high efficiency by preferentially crawling those nodes and edges that convey greater information regarding the target metric. Our approach begins by employing the theory of stratification to find optimal node weights, for a given estimation problem, under an independence sampler. While optimal under independence sampling, these weights may be impractical under graph crawling due to constraints arising from the structure of the graph. Therefore, the edge weights for our random walk should be chosen so as to lead to an equilibrium distribution that strikes a balance between approximating the optimal weights under an independence sampler and achieving fast convergence. We propose a heuristic approach (stratified weighted random walk, or S-WRW) that achieves this goal, while using only limited information about the graph structure and the node properties. We evaluate our technique in simulation, and experimentally, by collecting a sample of Facebook college users. We show that S-WRW requires 13-15 times fewer samples than the simple re-weighted random walk (RW) to achieve the same estimation accuracy for a range of metrics.
(Show Context)

Citation Context

...ainly interested in measuring the relative sizes ftiny and fbig of categories Ctiny and Cbig, respectively. We use Normalized Root Mean Square Error (NRMSE) to assess the estimation error, defined as =-=[37]-=-: NRMSE(x̂) = √ E [ (x̂− x)2 ] x , (29) where x is the real value and x̂ is the estimated one. 5The term “community” refers to cluster and is defined purely based on topology. The term“category” is a ...

Sampling directed graphs with random walks

by Bruno Ribeiro, Pinghui Wang, Fabricio Murai, Don Towsley , 2011
"... Abstract—Despite recent efforts to characterize complex net-works such as citation graphs or online social networks (OSNs), little attention has been given to developing tools that can be used to characterize directed graphs in the wild, where no pre-processed data is available. The presence of hidd ..."
Abstract - Cited by 16 (5 self) - Add to MetaCart
Abstract—Despite recent efforts to characterize complex net-works such as citation graphs or online social networks (OSNs), little attention has been given to developing tools that can be used to characterize directed graphs in the wild, where no pre-processed data is available. The presence of hidden incoming edges but observable outgoing edges poses a challenge to characterize large directed graphs through crawling, as existing sampling methods cannot cope with hidden incoming links. The driving principle behind our random walk (RW) sampling method is to construct, in real-time, an undirected graph from the directed graph such that the random walk on the directed graph is consistent with one on the undirected graph. We then use the RW on the undirected graph to estimate the outdegree distribution. Our algorithm accurately estimates outdegree distributions of a variety of real world graphs. We also study the hardness of indegree distribution estimation when indegrees are latent (i.e., incoming links are only observed as outgoing edges). We observe that, in the same scenarios, indegree distribution estimates are highly innacurate unless the directed graph is highly symmetrical. I.
(Show Context)

Citation Context

...e fraction of nodes with indegree j, R is the largest outdegree, and W is the largest indegree. The degree distribution of a large undirected graph can be estimated using random walks (RW) [7], [11], =-=[14]-=-. But these RW methods cannot be readily applied to directed graphs with hidden incoming edges, which is the case of a number of interesting directed networks, e.g., the WWW, Wikipedia, and Flickr. To...

Beyond random walk and metropolis-hastings samplers: Why you should not backtrack for unbiased graph sampling

by Chul-ho Lee, Xin Xu, Do Young Eun , 2012
"... ar ..."
Abstract - Cited by 14 (0 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...=1{d(i)=d}, i∈N , for the corresponding estimators. Similarly, we choose f(i)=1{d(i)>d} for P{DG>d}. To measure the estimation accuracy, we use the following normalized root mean square error (NRMSE) =-=[5, 30, 20]-=-, √ E{(x̂(t)− x)2}/x, where x̂(t) is the estimated value out of t samples and x is the (groundtruth) real value. (x = limt→∞ x̂(t) from unbiasedness.) In all simulations, an initial position of each r...

Network Sampling: From Static to Streaming Graphs

by Nesreen K. Ahmed, Jennifer Neville, Ramana Kompella , 2013
"... Network sampling is integral to the analysis of social, information, and biological networks. Since many real-world networks are massive in size, continuously evolving, and/or distributed in nature, the network structure is often sampled in order to facilitate study. For these reasons, a more thorou ..."
Abstract - Cited by 12 (3 self) - Add to MetaCart
Network sampling is integral to the analysis of social, information, and biological networks. Since many real-world networks are massive in size, continuously evolving, and/or distributed in nature, the network structure is often sampled in order to facilitate study. For these reasons, a more thorough and complete understanding of network sampling is critical to support the field of network science. In this paper, we outline a framework for the general problem of network sampling, by highlighting the different objectives, population and units of interest, and classes of network sampling methods. In addition, we propose a spectrum of computational models for network sampling methods, ranging from the traditionally studied model based on the assumption of a static domain to a more challenging model that is appropriate for streaming domains. We design a family of sampling methods based on the concept of graph induction that generalize across the full spectrum of computational models (from static to streaming) while efficiently preserving many of the topological properties of the input graphs. Furthermore, we demonstrate how traditional static sampling algorithms can be modified for graph streams for each of the three main classes of sampling methods: node, edge, and topology-based sampling. Experimental results indicate that our proposed family of sampling methods more accurately preserve the underlying properties of the graph in both static and streaming domains. Finally, we study the impact of network sampling algorithms on the parameter estimation and performance evaluation of relational classification algorithms.

Coarse-Grained Topology Estimation via Graph Sampling

by Maciej Kurant, Minas Gjoka, Yan Wang, Zack W. Almquist, Carter T. Butts, Athina Markopoulou , 2012
"... In many online networks, nodes are partitioned into categories (e.g., countries or universities in OSNs), which naturally defines a weighted category graph i.e., a coarse-grained version of the underlying network. In this paper, we show how to efficiently estimate the category graph from a probabili ..."
Abstract - Cited by 10 (4 self) - Add to MetaCart
In many online networks, nodes are partitioned into categories (e.g., countries or universities in OSNs), which naturally defines a weighted category graph i.e., a coarse-grained version of the underlying network. In this paper, we show how to efficiently estimate the category graph from a probability sample of nodes. We prove consistency of our estimators and evaluate their efficiency via simulation. We also apply our methodology to a sample of Facebook users to obtain a number of category graphs, such as the college friendship graph and the country friendship graph. We share and visualize the resulting data at www.geosocialmap.com.
(Show Context)

Citation Context

...e-of-the-art crawling-based node sampling techniques use variants of random walks (RW), such as the classic RW [20,27,41, 51,56], Metropolis-Hasting RW (MHRW) [18,20,42,51, 60], multiple dependent RW =-=[52]-=-, multigraph RW [19], RW with jumps [6,30,38,53], and weighted RW [35]. Based on the resulting (uniform or non-uniform) sample of nodes, there exist principled methods to estimate local graph properti...

Counting youtube videos via random prefix sampling

by Jia Zhou, Yanhua Li, Vijay Kumar Adhikari, Zhi-li Zhang - In SIGCOMM , 2011
"... Leveraging the characteristics of YouTube video id space and ex-ploiting a unique property of YouTube search API, in this paper we develop a random prefix sampling method to estimate the total number of videos hosted by YouTube. Through theoretical model-ing and analysis, we demonstrate that the est ..."
Abstract - Cited by 10 (1 self) - Add to MetaCart
Leveraging the characteristics of YouTube video id space and ex-ploiting a unique property of YouTube search API, in this paper we develop a random prefix sampling method to estimate the total number of videos hosted by YouTube. Through theoretical model-ing and analysis, we demonstrate that the estimator based on this method is unbiased, and provide bounds on its variance and confi-dence interval. These bounds enable us to judiciously select sam-ple sizes to control estimation errors. We evaluate our sampling method and validate the sampling results using two distinct collec-tions of YouTube video id’s (namely, treating each collection as if it were the “true ” collection of YouTube videos). We then apply our sampling method to the live YouTube system, and estimate that there are a total of roughly 500 millions YouTube videos by May, 2011. Finally, using an unbiased collection of YouTube videos sampled by our method, we show that YouTube video view count statistics collected by prior methods (e.g., through crawling of re-lated video links) are highly skewed, significantly under-estimating the number of videos with very small view counts (< 1000) ; we al-so shed lights on the bounds for the total storage YouTube must have and the network capacity needed to delivery YouTube videos.
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University