Results 1 - 10
of
62
Detecting Spam Web Pages through Content Analysis
- In Proceedings of international conference on World Wide Web 2006 Myle Ott, Yejin Choi, Claire Cardie
"... In this paper, we continue our investigations of “web spam”: the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatica ..."
Abstract
-
Cited by 207 (4 self)
- Add to MetaCart
(Show Context)
In this paper, we continue our investigations of “web spam”: the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).
Opinion spam and analysis
- In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM
, 2008
"... Evaluative texts on the Web have become a valuable source of opinions on products, services, events, individuals, etc. Recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. However, existing research has been focused on classification and summarizat ..."
Abstract
-
Cited by 160 (19 self)
- Add to MetaCart
Evaluative texts on the Web have become a valuable source of opinions on products, services, events, individuals, etc. Recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. However, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. An important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. In this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. In the past two years, several startup companies also appeared which aggregate opinions from product reviews. It is thus high time to study spam in reviews. To the best of our knowledge, there is still no published study on this topic, although Web spam and email spam have been investigated extensively. We will see that opinion spam is quite different from Web spam and email spam, and thus requires different detection techniques. Based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. This paper analyzes such spam activities and presents some novel techniques to detect them.
Identifying link farm spam pages
- In Proc. of the 14th International WWW conference
, 2005
"... With the increasing importance of search in guiding today's web trac, more and more eort has been spent to cre-ate search engine spam. Since link analysis is one of the most important factors in current commercial search en-gines ' ranking systems, new kinds of spam aiming at links have ap ..."
Abstract
-
Cited by 105 (11 self)
- Add to MetaCart
(Show Context)
With the increasing importance of search in guiding today's web trac, more and more eort has been spent to cre-ate search engine spam. Since link analysis is one of the most important factors in current commercial search en-gines ' ranking systems, new kinds of spam aiming at links have appeared. Building link farms is one technique that can deteriorate link-based ranking algorithms. In this paper, we present algorithms for detecting these link farms automati-cally by rst generating a seed set based on the common link set between incoming and outgoing links of Web pages and then expanding it. Links between identied pages are re-weighted, providing a modied web graph to use in ranking page importance. Experimental results show that we can identify most link farm spam pages and the nal ranking results are improved for almost all tested queries.
Blocking Blog Spam with Language Model Disagreement
- In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb
, 2005
"... We present an approach for detecting link spam common in blog comments by comparing the language models used in the blog post, the comment, and pages linked by the comments. In contrast to other link spam filtering approaches, our method requires no training, no hard-coded rule sets, and no knowledg ..."
Abstract
-
Cited by 102 (1 self)
- Add to MetaCart
We present an approach for detecting link spam common in blog comments by comparing the language models used in the blog post, the comment, and pages linked by the comments. In contrast to other link spam filtering approaches, our method requires no training, no hard-coded rule sets, and no knowledge of complete-web connectivity. Preliminary experiments with identification of typical blog spam show promising results.
SpamRank -- Fully Automatic Link Spam Detection
- IN PROCEEDINGS OF THE FIRST INTERNATIONAL WORKSHOP ON ADVERSARIAL INFORMATION RETRIEVAL ON THE WEB (AIRWEB
, 2005
"... Spammers intend to increase the PageRank of certain spam pages by creating a large number of links pointing to them. We propose a novel method based on the concept of personalized PageRank that detects pages with an undeserved high PageRank value without the need of any kind of white or blacklists ..."
Abstract
-
Cited by 96 (5 self)
- Add to MetaCart
(Show Context)
Spammers intend to increase the PageRank of certain spam pages by creating a large number of links pointing to them. We propose a novel method based on the concept of personalized PageRank that detects pages with an undeserved high PageRank value without the need of any kind of white or blacklists or other means of human intervention. We assume that spammed pages have a biased distribution of pages that contribute to the undeserved high PageRank value. We define SpamRank by penalizing pages that originate a suspicious PageRank share and personalizing PageRank on the penalties. Our method is tested on a 31 M page crawl of the .de domain with a manually classified 1000-page stratified random sample with bias towards large PageRank values.
Thwarting the nigritude ultramarine: learning to identify link spam
- In Proceedings of the 16th European Conference on Machine Learning (ECML
"... Abstract. The page rank of a commercial web site has an enormous economic impact because it directly influences the number of potential customers that find the site as a highly ranked search engine result. Link spamming – inflating the page rank of a target page by artificially cre-ating many referr ..."
Abstract
-
Cited by 52 (0 self)
- Add to MetaCart
(Show Context)
Abstract. The page rank of a commercial web site has an enormous economic impact because it directly influences the number of potential customers that find the site as a highly ranked search engine result. Link spamming – inflating the page rank of a target page by artificially cre-ating many referring pages – has therefore become a common practice. In order to maintain the quality of their search results, search engine providers try to oppose efforts that decorrelate page rank and relevance and maintain blacklists of spamming pages while spammers, at the same time, try to camouflage their spam pages. We formulate the problem of identifying link spam and discuss a methodology for generating training data. Experiments reveal the effectiveness of classes of intrinsic and re-lational attributes and shed light on the robustness of classifiers against obfuscation of attributes by an adversarial spammer. We identify open research problems related to web spam. 1
Detecting phrase-level duplication on the world wide web
- In Proceedings of the 28th Annual International ACM SIGIR Conference on Research & Development in Information Retrieval
, 2005
"... Two years ago, we conducted a study on the evolution of web pages over time. In the course of that study, we discovered a large number of machine-generated “spam ” web pages emanating from a handful of web servers in Germany. These spam web pages were dynamically assembled by stitching together gram ..."
Abstract
-
Cited by 52 (1 self)
- Add to MetaCart
(Show Context)
Two years ago, we conducted a study on the evolution of web pages over time. In the course of that study, we discovered a large number of machine-generated “spam ” web pages emanating from a handful of web servers in Germany. These spam web pages were dynamically assembled by stitching together grammatically wellformed German sentences drawn from a large collection of sentences. This discovery motivated us to develop techniques for finding other instances of such “slice and dice ” generation of web pages, where pages are automatically generated by stitching together phrases drawn from a limited corpus. We applied these techniques to two data sets, a set of 151 million web pages collected in December 2002 and a set of 96 million web pages collected in June 2004. We found a number of other instances of large-scale phrase-level replication within the two data sets. This paper describes the algorithms we used to discover this type of replication, and highlights the results of our data mining.
Web spam detection with Anti-trustRank
- Proc. the 2nd international workshop on Adversarial Information Retrieval on the Web
, 2006
"... Spam pages on the web use various techniques to artificially achieve high rankings in search engine results. Human ex-perts can do a good job of identifying spam pages and pages whose information is of dubious quality, but it is practically infeasible to use human effort for a large number of pages. ..."
Abstract
-
Cited by 42 (0 self)
- Add to MetaCart
Spam pages on the web use various techniques to artificially achieve high rankings in search engine results. Human ex-perts can do a good job of identifying spam pages and pages whose information is of dubious quality, but it is practically infeasible to use human effort for a large number of pages. Similar to the approach in [1], we propose a method of se-lecting a seed set of pages to be evaluated by a human. We then use the link structure of the web and the manually labeled seed set, to detect other spam pages. Our experi-ments on the WebGraph dataset [3] show that our approach is very effective at detecting spam pages from a small seed set and achieves higher precision of spam page detection than the Trust Rank algorithm, apart from detecting pages with higher pageranks, on an average. 1.
Detecting Semantic Cloaking on the Web
- Proceedings of the 15th International World Wide Web Conference
, 2006
"... By supplying different versions of a web page to search engines and to browsers, a content provider attempts to cloak the real content from the view of the search engine. Semantic cloaking refers to differences in meaning between pages which have the effect of deceiving search engine ranking algorit ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
By supplying different versions of a web page to search engines and to browsers, a content provider attempts to cloak the real content from the view of the search engine. Semantic cloaking refers to differences in meaning between pages which have the effect of deceiving search engine ranking algorithms. In this paper, we propose an automated two-step method to detect semantic cloaking pages based on different copies of the same page downloaded by a web crawler and a web browser. The first step is a filtering step, which generates a candidate list of semantic cloaking pages. In the second step, a classifier is used to detect semantic cloaking pages from the candidates generated by the filtering step. Experiments on manually labeled data sets show that we can generate a classifier with a precision of 93% and a recall of 85%. We apply our approach to links from the dmoz Open Directory Project and estimate that more than 50,000 of these pages employ semantic cloaking.