Results 1  10
of
19
Temporal Analysis of the Wikigraph
 In Proc. of Web Intelligence, Hong Kong
, 2006
"... Abstract — Wikipedia (www.wikipedia.org) is an online encyclopedia, available in more than 100 languages and comprising over 1 million articles in its English version. If we consider each Wikipedia article as a node and each hyperlink between articles as an arc we have a “Wikigraph”, a graph that re ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
(Show Context)
Abstract — Wikipedia (www.wikipedia.org) is an online encyclopedia, available in more than 100 languages and comprising over 1 million articles in its English version. If we consider each Wikipedia article as a node and each hyperlink between articles as an arc we have a “Wikigraph”, a graph that represents the link structure of Wikipedia. The Wikigraph differs from other Web graphs studied in the literature by the fact that there are timestamps associated with each node. The timestamps indicate the creation and update dates of each page, and this allows us to do a detailed analysis of the Wikipedia evolution over time. In the first part of this study we characterize this evolution in terms of users, editions and articles; in the second part, we depict the temporal evolution of several topological properties of the Wikigraph. The insights obtained from the Wikigraphs can be applied to large Web graphs from which the temporal data is usually not available. I.
Web graph similarity for anomaly detection
 Journal of Internet Services and Applications
, 2010
"... Web graphs are approximate snapshots of the web, created by search engines. They are essential to monitor the evolution of the web and to compute global properties like PageRank values of web pages. Their continuous monitoring requires a notion of graph similarity to help measure the amount and sign ..."
Abstract

Cited by 30 (4 self)
 Add to MetaCart
(Show Context)
Web graphs are approximate snapshots of the web, created by search engines. They are essential to monitor the evolution of the web and to compute global properties like PageRank values of web pages. Their continuous monitoring requires a notion of graph similarity to help measure the amount and significance of changes in the evolving web. As a result, these measurements provide means to validate how well search engines acquire content from the web. In this paper we propose five similarity schemes: three of them we adapted from existing graph similarity measures, and two we adapted from wellknown document and vector similarity methods (namely, the shingling method and random projection based method). We empirically evaluate and compare all five schemes using a sequence of web graphs from Yahoo!, and study if the schemes can identify anomalies that may occur due to hardware or other problems. 1
RankMass Crawler: A Crawler with High Personalized PageRank Coverage Guarantee
, 2007
"... Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the unbounded number of pages available on the Web, searchengine operators constantly struggle with the following vexing questions: When can I stop downl ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the unbounded number of pages available on the Web, searchengine operators constantly struggle with the following vexing questions: When can I stop downloading the Web? How many pages should I download to cover “most ” of the Web? How can I know I am not missing an important part when I stop? In this paper we provide an answer to these questions by developing, in the context of a system that is given a set of trusted pages, a family of crawling algorithms that (1) provide a theoretical guarantee on how much of the “important ” part of the Web it will download after crawling a certain number of pages and (2) give a high priority to important pages during a crawl, so that the search engine can index the most important part of the Web first. We prove the correctness of our algorithms by theoretical analysis and evaluate their performance experimentally based on 141 million URLs obtained from the Web. Our experiments demonstrate that even our simple algorithm is effective in downloading important pages early on and provides high “coverage of the Web with a relatively small number of pages.
Determining factors behind the PageRank loglog plot
 Mathematics and Computer Science, University of Twente
"... We study the relation between PageRank and other parameters of information networks such as indegree, outdegree, and the fraction of dangling nodes. We model this relation through a stochastic equation inspired by the original definition of PageRank. Further, we use the theory of regular variation ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
(Show Context)
We study the relation between PageRank and other parameters of information networks such as indegree, outdegree, and the fraction of dangling nodes. We model this relation through a stochastic equation inspired by the original definition of PageRank. Further, we use the theory of regular variation to prove that PageRank and indegree follow power laws with the same exponent. The difference between these two power laws is in a multiple coefficient, which depends mainly on the fraction of dangling nodes, average indegree, the power law exponent, and damping factor. The outdegree distribution has a minor effect, which we explicitly quantify. Our theoretical predictions show a good agreement with experimental data on three different samples of the Web.
Indegree and PageRank: Why do they follow similar power laws
"... PageRank is a popularity measure designed by Google to rank Web pages. Experiments confirm that PageRank values obey a power law with the same exponent as InDegree values. This paper presents a novel mathematical model that explains this phenomenon. The relation between PageRank and InDegree is mo ..."
Abstract

Cited by 7 (5 self)
 Add to MetaCart
(Show Context)
PageRank is a popularity measure designed by Google to rank Web pages. Experiments confirm that PageRank values obey a power law with the same exponent as InDegree values. This paper presents a novel mathematical model that explains this phenomenon. The relation between PageRank and InDegree is modeled through a stochastic equation, which is inspired by the original definition of PageRank, and is analogous to the wellknown distributional identity for the busy period in the M/G/1 queue. Further, we employ the theory of regular variation and Tauberian theorems to analytically prove that the tail distributions of PageRank and InDegree differ only by a multiplicative constant, for which we derive a closedform expression. Our analytical results are in good agreement with experimental data.
InDegree and PageRank of Web Pages: Why Do They Follow Similar Power Laws?” Memorandum 1807, Dept
 of Applied Math., Univ. of Twente
, 2006
"... The PageRank is a popularity measure designed by Google to rank Web pages. Experiments confirm that the PageRank obeys a ‘power law ’ with the same exponent as the InDegree. This paper presents a novel mathematical model that explains this phenomenon. The relation between the PageRank and InDegree ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
The PageRank is a popularity measure designed by Google to rank Web pages. Experiments confirm that the PageRank obeys a ‘power law ’ with the same exponent as the InDegree. This paper presents a novel mathematical model that explains this phenomenon. The relation between the PageRank and InDegree is modelled through a stochastic equation, which is inspired by the original definition of the PageRank, and is analogous to the wellknown distributional identity for the busy period in the M/G/1 queue. Further, we employ the theory of regular variation and Tauberian theorems to analytically prove that the tail behavior of the PageRank and the InDegree differ only by a multiplicative factor, for which we derive a closedform expression. Our analytical results are in good agreement with experimental data.
FrogWild!–Fast PageRank Approximations on Graph Engines
 In NIPS Workshop on Distributed Machine Learning and Matrix Computations
, 2014
"... We propose FrogWild, a novel algorithm for fast approximation of high PageRank vertices, geared towards reducing network costs of running traditional PageRank algorithms. Our algorithm can be seen as a quantized version of power iteration that performs multiple parallel random walks over a directed ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We propose FrogWild, a novel algorithm for fast approximation of high PageRank vertices, geared towards reducing network costs of running traditional PageRank algorithms. Our algorithm can be seen as a quantized version of power iteration that performs multiple parallel random walks over a directed graph. One important innovation is that we introduce a modification to the GraphLab framework that only partially synchronizes mirror vertices. This partial synchronization vastly reduces the network traffic generated by traditional PageRank algorithms, thus greatly reducing the periteration cost of PageRank. On the other hand, this partial synchronization also creates dependencies between the random walks used to estimate PageRank. Our main theoretical innovation is the analysis of the correlations introduced by this partial synchronization process and a bound establishing that our approximation is close to the true PageRank vector. We implement our algorithm in GraphLab and compare it against the default PageRank implementation. We show that our algorithm is very fast, performing each iteration in less than one second on the Twitter graph and can be up to 7 × faster compared to the standard GraphLab PageRank implementation. 1.
Approximating Eigenvectors by Subsampling
, 2009
"... We show that averaging eigenvectors of randomly sampled submatrices efficiently approximates the true eigenvectors of the original matrix under certain conditions on the incoherence of the spectral decomposition. This incoherence assumption is typically milder than those made in matrix completion an ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We show that averaging eigenvectors of randomly sampled submatrices efficiently approximates the true eigenvectors of the original matrix under certain conditions on the incoherence of the spectral decomposition. This incoherence assumption is typically milder than those made in matrix completion and allows eigenvectors to be sparse. We discuss applications to spectral methods in dimensionality reduction and information retrieval.
ABSTRACT RankMass Crawler: A Crawler with High PageRank Coverage Guarantee
"... Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the infinite number of pages available on the Web, searchengine operators constantly struggle with the following vexing questions: When can I stop downlo ..."
Abstract
 Add to MetaCart
(Show Context)
Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the infinite number of pages available on the Web, searchengine operators constantly struggle with the following vexing questions: When can I stop downloading the Web? How many pages should I download to cover “most ” of the Web? How can I know I am not missing an important part when I stop? In this paper we provide an answer to these questions by developing a family of crawling algorithms that (1) provide a theoretical guarantee on how much of the “important ” part of the Web it will download after crawling a certain number of pages and (2) give a high priority to important pages during a crawl, so that the search engine can index the most important part of the Web first. We prove the correctness of our algorithms by theoretical analysis and evaluate their performance experimentally based on 141 million URLs obtained from the Web. Our experiments demonstrate that even our simple algorithm is effective in downloading important pages early on and provides high “coverage ” of the Web with a relatively small number of pages. 1.