Results 1  10
of
167
Combating web spam with trustrank
 In VLDB
, 2004
"... Web spam pages use various techniques to achieve higherthandeserved rankings in a search engine’s results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semiautomatically separate reputable, good pages fr ..."
Abstract

Cited by 413 (3 self)
 Add to MetaCart
(Show Context)
Web spam pages use various techniques to achieve higherthandeserved rankings in a search engine’s results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semiautomatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert. Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques. Our results show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites. 1
Topicsensitive pagerank: A contextsensitive ranking algorithm for web search
 IEEE Transactions on Knowledge and Data Engineering
, 2003
"... Abstract—The original PageRank algorithm for improving the ranking of searchquery results computes a single vector, using the link structure of the Web, to capture the relative “importance ” of Web pages, independent of any particular search query. To yield more accurate search results, we propose ..."
Abstract

Cited by 237 (2 self)
 Add to MetaCart
(Show Context)
Abstract—The original PageRank algorithm for improving the ranking of searchquery results computes a single vector, using the link structure of the Web, to capture the relative “importance ” of Web pages, independent of any particular search query. To yield more accurate search results, we propose computing a set of PageRank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. For ordinary keyword search queries, we compute the topicsensitive PageRank scores for pages satisfying the query using the topic of the query keywords. For searches done in context (e.g., when the search query is performed by highlighting words in a Web page), we compute the topicsensitive PageRank scores using the topic of the context in which the query appeared. By using linear combinations of these (precomputed) biased PageRank vectors to generate contextspecific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic PageRank vector. We describe techniques for efficiently implementing a largescale search system based on the topicsensitive PageRank scheme. Index Terms—Web search, web graph, link analysis, PageRank, search in context, personalized search, ranking algorithm.
Authoritybased keyword search in databases
 TODS
"... The ObjectRank system applies authoritybased ranking to keyword search in databases modeled as labeled graphs. Conceptually, authority originates at the nodes (objects) containing the keywords and flows to objects according to their semantic connections. Each node is ranked according to its authori ..."
Abstract

Cited by 220 (13 self)
 Add to MetaCart
(Show Context)
The ObjectRank system applies authoritybased ranking to keyword search in databases modeled as labeled graphs. Conceptually, authority originates at the nodes (objects) containing the keywords and flows to objects according to their semantic connections. Each node is ranked according to its authority with respect to the particular
Deeper Inside PageRank
 INTERNET MATHEMATICS
, 2004
"... This paper serves as a companion or extension to the “Inside PageRank” paper by Bianchini et al. [Bianchini et al. 03]. It is a comprehensive survey of all issues associated with PageRank, covering the basic PageRank model, available and recommended solution methods, storage issues, existence, uniq ..."
Abstract

Cited by 208 (4 self)
 Add to MetaCart
This paper serves as a companion or extension to the “Inside PageRank” paper by Bianchini et al. [Bianchini et al. 03]. It is a comprehensive survey of all issues associated with PageRank, covering the basic PageRank model, available and recommended solution methods, storage issues, existence, uniqueness, and convergence properties, possible alterations to the basic model, suggested alternatives to the traditional solution methods, sensitivity and conditioning, and finally the updating problem. We introduce a few new results, provide an extensive reference list, and speculate about exciting areas of future research.
Exploiting the Block Structure of the Web for Computing PageRank
, 2003
"... The web link graph has a nested block structure: the vast majority of hyperlinks link pages on a host to other pages on the same host, and many of those that do not link pages within the same domain. We show how to exploit this structure to speed up the computation of PageRank by a 3stage alg ..."
Abstract

Cited by 158 (4 self)
 Add to MetaCart
The web link graph has a nested block structure: the vast majority of hyperlinks link pages on a host to other pages on the same host, and many of those that do not link pages within the same domain. We show how to exploit this structure to speed up the computation of PageRank by a 3stage algorithm whereby (1) the local PageRanks of pages for each host are computed independently using the link structure of that host, (2) these local PageRanks are then weighted by the "importance" of the corresponding host, and (3) the standard PageRank algorithm is then run using as its starting vector the weighted concatenation of the local PageRanks. Empirically, this algorithm speeds up the computation of PageRank by a factor of 2 in realistic scenarios. Further, we develop a variant of this algorithm that efficiently computes many different "personalized" PageRanks, and a variant that efficiently recomputes PageRank after node updates.
Ranking the Web Frontier
, 2004
"... The celebrated PageRank algorithm has proved to be a very effective paradigm for ranking results of web search algorithms. In this paper we refine this basic paradigm to take into account several evolving prominent features of the web, and propose several algorithmic innovations. First, we analyze f ..."
Abstract

Cited by 114 (0 self)
 Add to MetaCart
(Show Context)
The celebrated PageRank algorithm has proved to be a very effective paradigm for ranking results of web search algorithms. In this paper we refine this basic paradigm to take into account several evolving prominent features of the web, and propose several algorithmic innovations. First, we analyze features of the rapidly growing "frontier" of the web, namely the part of the web that crawlers are unable to cover for one reason or another. We analyze the effect of these pages and find it to be significant. We suggest ways to improve the quality of ranking by modeling the growing presence of "link rot" on the web as more sites and pages fall out of maintenance. Finally we suggest new methods of ranking that are motivated by the hierarchical structure of the web, are more efficient than PageRank, and may be more resistant to direct manipulation.
A survey of eigenvector methods for web information retrieval,”
 SIAM Rev.,
, 2005
"... Abstract Web information retrieval is significantly more challenging than traditional wellcontrolled, small document collection information retrieval. One main difference between traditional information retrieval and Web information retrieval is the Web's hyperlink structure. This structure h ..."
Abstract

Cited by 95 (6 self)
 Add to MetaCart
(Show Context)
Abstract Web information retrieval is significantly more challenging than traditional wellcontrolled, small document collection information retrieval. One main difference between traditional information retrieval and Web information retrieval is the Web's hyperlink structure. This structure has been exploited by several of today's leading Web search engines, particularly Google and Teoma. In this survey paper, we focus on Web information retrieval methods that use eigenvector computations, presenting the three popular methods of HITS, PageRank, and SALSA.
Connections: using context to enhance file search
 In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP ’05
, 2005
"... Connections is a file system search tool that combines traditional contentbased search with context information gathered from user activity. By tracing file system calls, Connections can identify temporal relationships between files and use them to expand and reorder traditional content search resu ..."
Abstract

Cited by 91 (6 self)
 Add to MetaCart
(Show Context)
Connections is a file system search tool that combines traditional contentbased search with context information gathered from user activity. By tracing file system calls, Connections can identify temporal relationships between files and use them to expand and reorder traditional content search results. Doing so improves both recall (reducing falsepositives) and precision (reducing falsenegatives). For example, Connections improves the average recall (from 13% to 22%) and precision (from 23 % to 29%) on the first ten results. When averaged across all recall levels, Connections improves precision from 17 % to 28%. Connections provides these benefits with only modest increases in average query time (2 seconds), indexing time (23 seconds daily), and index size (under 1 % of the user’s data set).
The second eigenvalue of the Google matrix.
, 2003
"... Abstract. We determine analytically the modulus of the second eigenvalue for the web hyperlink matrix used by Google for computing PageRank. Specifically, we prove the following statement: T , where P is an n × n rowstochastic matrix, E is a nonnegative n × n rankone rowstochastic matrix, and 0 ..."
Abstract

Cited by 90 (7 self)
 Add to MetaCart
(Show Context)
Abstract. We determine analytically the modulus of the second eigenvalue for the web hyperlink matrix used by Google for computing PageRank. Specifically, we prove the following statement: T , where P is an n × n rowstochastic matrix, E is a nonnegative n × n rankone rowstochastic matrix, and 0 ≤ c ≤ 1, the second eigenvalue of A has modulus λ2 ≤ c. Furthermore, if P has at least two irreducible closed subsets, the second eigenvalue λ2 = c." This statement has implications for the convergence rate of the standard PageRank algorithm as the web scales, for the stability of PageRank to perturbations to the link structure of the web, for the detection of Google spammers, and for the design of algorithms to speed up PageRank. 1 Theorem Theorem 1. Let P be an n × n rowstochastic matrix. Let c be a real number such that 0 ≤ c ≤ 1. Let E be the n × n rankone rowstochastic matrix E = ev T , where e is the nvector whose elements are all e i = 1, and v is an nvector that represents a probability distribution 1 . Define the matrix A = [cP + (1 − c)E] T . Its second eigenvalue λ 2  ≤ c. Theorem 2. Further, if P has at least two irreducible closed subsets (which is the case for the web hyperlink matrix), then the second eigenvalue of A is given by λ 2 = c. Notation and Preliminaries P is an n × n rowstochastic matrix. E is the n × n rankone rowstochastic matrix E = ev T , where e to be the nvector whose elements are all e i = 1. A is the n × n columnstochastic matrix: We denote the ith eigenvalue of A as λ i , and the corresponding eigenvector as x i . By convention, we choose eigenvectors x i such that x i  1 = 1. Since A is columnstochastic, λ 1 = 1, 1 ≥ λ 2  ≥ . . . ≥ λ n  ≥ 0. 1 i.e., a vector whose elements are nonnegative and whose L1 norm is 1.
Adaptive Methods for the Computation of PageRank
 STANFORD UNIVERSITY
, 2003
"... We observe that the convergence patterns of pages in the PageRank algorithm have a nonuniform distribution. Specifically, many pages converge to their true PageRank quickly, while relatively few pages take a much longer time to converge. Furthermore, we observe that these slowconverging pages ar ..."
Abstract

Cited by 62 (0 self)
 Add to MetaCart
We observe that the convergence patterns of pages in the PageRank algorithm have a nonuniform distribution. Specifically, many pages converge to their true PageRank quickly, while relatively few pages take a much longer time to converge. Furthermore, we observe that these slowconverging pages are generally those pages with high PageRank. We use this observation to devise a simple algorithm to speed up the computation of PageRank, in which the PageRank of pages that have converged are not recomputed at each iteration after convergence. This