Results 1 -
4 of
4
A Spamicity Approach to Web Spam Detection ∗
"... Web spam, which refers to any deliberate actions bringing to selected web pages an unjustifiable favorable relevance or importance, is one of the major obstacles for high quality information retrieval on the web. Most of the existing web spam detection methods are supervised that require a large and ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Web spam, which refers to any deliberate actions bringing to selected web pages an unjustifiable favorable relevance or importance, is one of the major obstacles for high quality information retrieval on the web. Most of the existing web spam detection methods are supervised that require a large and representative training set of web pages. Moreover, they often assume some global information such as a large web graph and snapshots of a large collection of web pages. However, in many situations such assumptions may not hold. In this paper, we study the problem of unsupervised web spam detection. We introduce the notion of spamicity to measure how likely a page is spam. Spamicity is a more flexible and user-controllable measure than the traditional supervised classification methods. We propose efficient online link spam and term spam detection methods using spamicity. Our methods do not need training and are cost effective. A real data set is used to evaluate the effectiveness and the efficiency of our methods. 1
Spam-resilient web rankings via influence throttling
- In IPDPS
, 2007
"... Web search is one of the most critical applications for managing the massive amount of distributed Web content. Due to the overwhelming reliance on Web search, there is a rise in efforts to manipulate (or spam) Web search engines. In this paper, we develop a spam-resilient ranking model that promote ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Web search is one of the most critical applications for managing the massive amount of distributed Web content. Due to the overwhelming reliance on Web search, there is a rise in efforts to manipulate (or spam) Web search engines. In this paper, we develop a spam-resilient ranking model that promotes a source-based view of the Web. One of the most salient features of our spam-resilient ranking algorithm is the concept of influence throttling. We show how to utilize influence throttling to counter Web spam that aims at manipulating link-based ranking systems, especially PageRank-like systems. Through formal analysis and experimental evaluation, we show the effectiveness and robustness of our spam-resilient ranking model in comparison with existing Web algorithms such as PageRank. 1.
Link-based ranking of the web with source-centric collaboration
- In CollaborateCom
, 2006
"... Abstract — Web ranking is one of the most successful and widely used collaborative computing applications, in which Web pages collaborate in the form of varying degree of relationships to assess their relative quality. Though many observe that links display strong source-centric locality, for exampl ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract — Web ranking is one of the most successful and widely used collaborative computing applications, in which Web pages collaborate in the form of varying degree of relationships to assess their relative quality. Though many observe that links display strong source-centric locality, for example, in terms of administrative domains and hosts, most Web ranking analysis to date has focused on the flat page-level Web linkage structure. In this paper we develop a framework for link-based collaborative ranking of the Web by utilizing the strong Web link structure. We argue that this source-centric link analysis is promising since it captures the natural link-locality structure of the Web, can provide more appealing and efficient Web applications, and reflects many natural types of structured human collaborations. Concretely, we propose a generic framework for source-centric collaborative ranking of the Web. This paper makes two unique contributions. First, we provide a rigorous study of the set of critical parameters that can impact source-centric link analysis, such as source size, the presence of self-links, and different source-citation link weighting schemes (e.g., uniform, link count, source consensus). Second, we conduct a large-scale experimental study to understand how different parameter settings may impact the time complexity, stability, and spam-resilience of Web ranking. We find that careful tuning of these parameters is vital to ensure success over each objective and to balance the performance across all objectives. I.
Predicting Web Spam with HTTP Session Information
"... Web spam is a widely-recognized threat to the quality and security of the Web. Web spam pages pollute search engine indexes, burden Web crawlers and Web mining services, and expose users to dangerous Web-borne malware. To defend against Web spam, most previous research analyzes the contents of Web p ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Web spam is a widely-recognized threat to the quality and security of the Web. Web spam pages pollute search engine indexes, burden Web crawlers and Web mining services, and expose users to dangerous Web-borne malware. To defend against Web spam, most previous research analyzes the contents of Web pages and the link structure of the Web graph. Unfortunately, these heavyweight approaches require full downloads of both legitimate and spam pages to be effective, making real-time deployment of these techniques infeasible for Web browsers, high-performance Web crawlers, and real-time Web applications. In this paper, we present a lightweight, predictive approach to Web spam classification that relies exclusively on HTTP session information (i.e., hosting IP addresses and HTTP session headers).

