Results 1  10
of
42
LinkBased Characterization and Detection of Web Spam
 In AIRWeb
, 2006
"... We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several metrics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a stu ..."
Abstract

Cited by 56 (8 self)
 Add to MetaCart
We perform a statistical analysis of a large collection of Web pages, focusing on spam detection. We study several metrics such as degree correlations, number of neighbors, rank propagation through links, TrustRank and others to build several automatic web spam classifiers. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance. Using this approach we are able to detect 80.4% of the Web spam in our sample, with only 1.1% of false positives.
Using Rank Propagation and Probabilistic Counting for LinkBased Spam Detection
 In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD
, 2006
"... This paper describes a technique for automating the detection of Web link spam, that is, groups of pages that are linked together with the sole purpose of obtaining an undeservedly high score in search engines. The problem of Web spam is widespread and di#cult to solve, mostly due to the large size ..."
Abstract

Cited by 41 (15 self)
 Add to MetaCart
This paper describes a technique for automating the detection of Web link spam, that is, groups of pages that are linked together with the sole purpose of obtaining an undeservedly high score in search engines. The problem of Web spam is widespread and di#cult to solve, mostly due to the large size of web collections that makes many algorithms unfeasible in practice.
Link analysis for web spam detection
 ACM Transactions on the Web
, 2007
"... We propose linkbased techniques for automating the detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The issue of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means th ..."
Abstract

Cited by 35 (3 self)
 Add to MetaCart
(Show Context)
We propose linkbased techniques for automating the detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The issue of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means that, in practice, many algorithms are infeasible. We perform a statistical analysis of a large collection of Web pages. In particular, we compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the entire Web graph in a scalable way. We build several automatic web spam classifiers using different techniques. This paper presents a study of the performance of each of these classifiers alone, as well as their combined performance. Based on these results we propose spam detection techniques which only consider the link structure of Web, regardless of page contents. These statistical features are used to build a classifier that is tested over a large collection of Web link spam. After tenfold crossvalidation, our best classifiers have a performance comparable to that of stateoftheart spam classifiers that use content attributes, and orthogonal to their methods.
PageRank: Functional Dependencies
"... PageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α = 0.85 ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
(Show Context)
PageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α = 0.85 by Brin and Page is still used. In this paper, we give a mathematical analysis of PageRank when α changes. In particular, we show that, contrarily to popular belief, for realworld graphs values of α close to 1 do not give a more meaningful ranking. Then, we give closedform formulae for PageRank derivatives of any order, and by proving that the kth iteration of the Power Method gives exactly the value obtained by truncating the PageRank power series at degree k, we show how to obtain an approximation of the derivatives. Finally, we view PageRank as a linear operator acting on the preference vector and show a tight connection between iterated computation and derivation.
Transductive Link Spam Detection
, 2007
"... Web spam can significantly deteriorate the quality of search engines. Early web spamming techniques mainly manipulate page content. Since linkage information is widely used in web search, linkbased spamming has also developed. So far, many techniques have been proposed to detect link spam. Those ap ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
Web spam can significantly deteriorate the quality of search engines. Early web spamming techniques mainly manipulate page content. Since linkage information is widely used in web search, linkbased spamming has also developed. So far, many techniques have been proposed to detect link spam. Those approaches are basically variants of linkbased web ranking methods. In contrast, we cast the link spam detection problem into a machine learning problem of classification on directed graphs. We develop discrete analysis on directed graphs, and construct a discrete analogue of classical regularization theory via discrete analysis. A classification algorithm for directed graphs is then derived from the discrete regularization. We have applied the approach to realworld link spam detection problems, and encouraging results have been obtained.
Survey on web spam detection: principles and algorithms
 SIGKDD Explor. Newsl
"... Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phenomenon, search results are not always as good as desired. Moreover, spam evolves that makes the problem of providing high quality search even more challenging. Over the last decade research ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phenomenon, search results are not always as good as desired. Moreover, spam evolves that makes the problem of providing high quality search even more challenging. Over the last decade research on adversarial information retrieval has gained a lot of interest both from academia and industry. In this paper we present a systematic review of web spam detection techniques with the focus on algorithms and underlying principles. We categorize all existing algorithms into three categories based on the type of information they use: contentbased methods, linkbased methods, and methods based on nontraditional data such as user behaviour, clicks, HTTP sessions. In turn, we perform a subcategorization of linkbased category into five groups based on ideas and principles used: labels propagation, link pruning and reweighting, labels refinement, graph regularization, and featurebased. We also define the concept of web spam numerically and provide a brief survey on various spam forms. Finally, we summarize the observations and underlying principles applied for web spam detection. Keywords web spam detection, content spam, link spam, cloaking,
New metrics for reputation management in P2P networks
 In Proc. the 3rd International Workshop on Adversarial Information Retrieval on the Web
"... All intext references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
All intext references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
A deeper investigation of PageRank as a function of the damping factor
 In Web Information Retrieval and Linear Algebra Algorithms, number 07071 in Dagstuhl Seminar Proceedings
, 2007
"... PageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α = 0.85 ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
PageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor α that spreads uniformly part of the rank. The choice of α is eminently empirical, and in most cases the original suggestion α = 0.85 by Brin and Page is still used. In this paper, we give a mathematical analysis of PageRank when α changes. In particular, we show that, contrarily to popular belief, for realworld graphs values of α close to 1 do not give a more meaningful ranking. Then, we give closedform formulae for PageRank derivatives of any order, and by proving that the kth iteration of the Power Method gives exactly the PageRank value obtained using a Maclaurin polynomial of degree k, we show how to obtain an approximation of the derivatives. Finally, we view PageRank as a linear operator acting on the preference vector and show a tight connection between iterated computation and derivation. 1
Using polynomial chaos to compute the influence of multiple random surfers in the PageRank model
 Proceedings of the 5th Workshop on Algorithms and Models for the Web Graph (WAW2007), volume 4863 of Lecture Notes in Computer Science
, 2007
"... Abstract. The PageRank equation computes the importance of pages in a web graph relative to a single random surfer with a constant teleportation coefficient. To be globally relevant, the teleportation coefficient should account for the influence of all users. Therefore, we correct the PageRank form ..."
Abstract

Cited by 7 (5 self)
 Add to MetaCart
(Show Context)
Abstract. The PageRank equation computes the importance of pages in a web graph relative to a single random surfer with a constant teleportation coefficient. To be globally relevant, the teleportation coefficient should account for the influence of all users. Therefore, we correct the PageRank formulation by modeling the teleportation coefficient as a random variable distributed according to user behavior. With this correction, the PageRank values themselves become random. We present two methods to quantify the uncertainty in the random PageRank: a Monte Carlo sampling algorithm and an algorithm based the truncated polynomial chaos expansion of the random quantities. With each of these methods, we compute the expectation and standard deviation of the PageRanks. Our statistical analysis shows that the standard deviation of the PageRanks are uncorrelated with the PageRank vector. 1
Exemplar Queries: Give me an Example of What You Need
"... Search engines are continuously employing advanced techniques that aim to capture user intentions and provide results that go beyond the data that simply satisfy the query conditions. Examples include the personalized results, related searches, similarity search, popular and relaxed queries. In this ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
(Show Context)
Search engines are continuously employing advanced techniques that aim to capture user intentions and provide results that go beyond the data that simply satisfy the query conditions. Examples include the personalized results, related searches, similarity search, popular and relaxed queries. In this work we introduce a novel query paradigm that considers a user query as an example of the data in which the user is interested. We call these queries exemplar queries and claim that they can play an important role in dealing with the information deluge. We provide a formal specification of the semantics of such queries and show that they are fundamentally different from notions like queries by example, approximate and related queries. We provide an implementation of these semantics for graphbased data and present an exact solution with a number of optimizations that improve performance without compromising the quality of the answers. We also provide an approximate solution that prunes the search space and achieves considerably better timeperformance with minimal or no impact on effectiveness. We experimentally evaluate the effectiveness and efficiency of these solutions with synthetic and real datasets, and illustrate the usefulness of exemplar queries in practice. 1.