Results 1 - 10
of
28
b-bit minwise hashing
- In WWW. 671–680
, 2010
"... This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, computational advertising, etc. By only storin ..."
Abstract
-
Cited by 26 (7 self)
- Add to MetaCart
This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, computational advertising, etc. By only storing b bits of each hashed value (e.g., b =1or 2), we gain substantial advantages in terms of storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using b =1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to b =64(or b =32), if one is interested in resemblance ≥ 0.5.
Hashing algorithms for large-scale learning
- In NIPS
, 2011
"... Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise has ..."
Abstract
-
Cited by 24 (9 self)
- Add to MetaCart
(Show Context)
Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not fit in memory. We compare b-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data. 1
b-bit minwise hashing for estimating three-way similarities
- In NIPS
, 2010
"... Computing 1 two-way and multi-way set similarities is a fundamental problem. This study focuses on estimating 3-way resemblance (Jaccard similarity) using b-bit minwise hashing. While traditional minwise hashing methods store each hashed value using 64 bits, b-bit minwise hashing only stores the low ..."
Abstract
-
Cited by 17 (13 self)
- Add to MetaCart
(Show Context)
Computing 1 two-way and multi-way set similarities is a fundamental problem. This study focuses on estimating 3-way resemblance (Jaccard similarity) using b-bit minwise hashing. While traditional minwise hashing methods store each hashed value using 64 bits, b-bit minwise hashing only stores the lowest b bits (where b ≥ 2 for 3-way). The extension to 3-way similarity from the prior work on 2-way similarity is technically non-trivial. We develop the precise estimator which is accurate and very complicated; and we recommend a much simplified estimator suitable for sparse data. Our analysis shows that b-bit minwise hashing can normally achieve a 10 to 25-fold improvement in the storage space required for a given estimator accuracy of the 3-way resemblance. 1
Sbotminer: Large scale search bot detection
, 2010
"... In this paper, we study search bot traffic from search engine query logs at a large scale. Although bots that generate search traffic aggressively can be easily detected, a large number of distributed, low rate search bots are difficult to identify and are often associated with malicious attacks. We ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
(Show Context)
In this paper, we study search bot traffic from search engine query logs at a large scale. Although bots that generate search traffic aggressively can be easily detected, a large number of distributed, low rate search bots are difficult to identify and are often associated with malicious attacks. We present SBotMiner, a system for automatically identifying stealthy, low-rate search bot traffic from query logs. Instead of detecting individual bots, our approach captures groups of distributed, coordinated search bots. Using sampled data from two different months, SBotMiner identifies over 123 million bot-related pageviews, accounting for 3.8 % of total traffic. Our in-depth analysis shows that a large fraction of the identified bot traffic may be associated with various malicious activities such as phishing attacks or vulnerability exploits. This finding suggests that detecting search bot traffic holds great promise to detect and stop attacks early on.
Plagiarism Detection Using Stopword n-Grams
- JASIST
"... In this paper, a novel method for detecting plagiarized passages in document collections is presented. In contrast to previous work in this field that uses content terms to represent documents, the proposed method is based on a small list of stopwords (i.e., very frequent words). We show that stopwo ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
(Show Context)
In this paper, a novel method for detecting plagiarized passages in document collections is presented. In contrast to previous work in this field that uses content terms to represent documents, the proposed method is based on a small list of stopwords (i.e., very frequent words). We show that stopword n-grams reveal important information for plagiarism detection since they are able to capture syntactic similarities between suspicious and original documents and they can be used to detect the exact plagiarized passage boundaries. Experimental results on a publicly-available corpus demonstrate that the performance of the proposed approach is competitive when compared with the best reported results. More importantly, it achieves significantly better results when dealing with difficult plagiarism cases where the plagiarized passages are highly modified and most of the words or phrases have been replaced with synonyms. This is a preprint of an article published in the Journal of the American Society for
Efficient partialduplicate detection based on sequence matching
- In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’10
, 2010
"... With the ever-increasing growth of the Internet, numer-ous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since partial-duplicates only contain a small piece of text taken from other sources and most existing near-duplicate detection app ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
With the ever-increasing growth of the Internet, numer-ous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since partial-duplicates only contain a small piece of text taken from other sources and most existing near-duplicate detection approaches focus on document level, partial dupli-cates can not be dealt with well. In this paper, we propose a novel algorithm to realize the partial-duplicate detection task. Besides the similarities between documents, our pro-posed algorithm can simultaneously locate the duplicated parts. The main idea is to divide the partial-duplicate de-tection task into two subtasks: sentence level near-duplicate detection and sequence matching. For evaluation, we com-pare the proposed method with other approaches on both English and Chinese web collections. Experimental results appear to support that our proposed method is effectively and efficiently to detect both partial-duplicates on large web collections.
Candidate Document Retrieval for Web-Scale Text Reuse Detection
- In 18th International Symposium on String Processing and Information Retrieval (SPIRE
, 2011
"... Abstract Given a document d, the task of text reuse detection is to find those passages in d which in identical or paraphrased form also appear in other documents. To solve this problem at web-scale, keywords representing d’s topics have to be combined to web queries. The retrieved web documents can ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
(Show Context)
Abstract Given a document d, the task of text reuse detection is to find those passages in d which in identical or paraphrased form also appear in other documents. To solve this problem at web-scale, keywords representing d’s topics have to be combined to web queries. The retrieved web documents can then be delivered to a text reuse detection system for an in-depth analysis. We focus on the query formulation problem as the crucial first step in the detection process and present a new query formulation strategy that achieves convincing results: compared to a maximal termset query formulation strategy [10, 14], which is the most sensible non-heuristic baseline, we save on average 70 % of the queries in realistic experiments. With respect to the candidate documents ’ quality, our heuristic retrieves documents that are, on average, more similar to the given document than the results of previously published query formulation strategies [4, 8]. 1
Hypergeometric language models for republished article finding
- In SIGIR '11
, 2011
"... Republished article finding is the task of identifying instances of ar-ticles that have been published in one source and republished more or less verbatim in another source, which is often a social media source. We address this task as an ad hoc retrieval problem, using the source article as a query ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Republished article finding is the task of identifying instances of ar-ticles that have been published in one source and republished more or less verbatim in another source, which is often a social media source. We address this task as an ad hoc retrieval problem, using the source article as a query. Our approach is based on language modeling. We revisit the assumptions underlying the unigram lan-guage model taking into account the fact that in our setup queries are as long as complete news articles. We argue that in this case, the underlying generative assumption of sampling words from a document with replacement, i.e., the multinomial modeling of doc-uments, produces less accurate query likelihood estimates. To make up for this discrepancy, we consider distributions that emerge from sampling without replacement: the central and non-central hypergeometric distributions. We present two retrieval mod-els that build on top of these distributions: a log odds model and a bayesian model where document parameters are estimated using the Dirichlet compound multinomial distribution. We analyse the behavior of our new models using a corpus of news articles and blog posts and find that for the task of repub-lished article finding, where we deal with queries whose length ap-proaches the length of the documents to be retrieved, models based on distributions associated with sampling without replacement out-perform traditional models based on multinomial distributions.
Synthetic review spamming and defense
- In KDD
, 2013
"... ABSTRACT Online reviews have been popularly adopted in many applications. Since they can either promote or harm the reputation of a product or a service, buying and selling fake reviews becomes a profitable business and a big threat. In this paper, we introduce a very simple, but powerful review sp ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
ABSTRACT Online reviews have been popularly adopted in many applications. Since they can either promote or harm the reputation of a product or a service, buying and selling fake reviews becomes a profitable business and a big threat. In this paper, we introduce a very simple, but powerful review spamming technique that could fail the existing feature-based detection algorithms easily. It uses one truthful review as a template, and replaces its sentences with those from other reviews in a repository. Fake reviews generated by this mechanism are extremely hard to detect: Both the state-of-the-art computational approaches and human readers acquire an error rate of 35%-48%, just slightly better than a random guess. While it is challenging to detect such fake reviews, we have made solid progress in suppressing them. A novel defense method that leverages the difference of semantic flows between synthetic and truthful reviews is developed, which is able to reduce the detection error rate to approximately 22%, a significant improvement over the performance of existing approaches. Nevertheless, it is still a challenging research task to further decrease the error rate.
Source Retrieval for Plagiarism Detection from Large Web Corpora: Recent Approaches
"... Abstract This paper overviews the five source retrieval approaches that have been submitted to the seventh international competition on plagiarism detection at PAN 2015. We compare the performances of these five approaches to the 14 methods submitted in the two previous years (eight from PAN 2013 an ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Abstract This paper overviews the five source retrieval approaches that have been submitted to the seventh international competition on plagiarism detection at PAN 2015. We compare the performances of these five approaches to the 14 methods submitted in the two previous years (eight from PAN 2013 and six from PAN 2014). For the third year in a row, we invited software submissions instead of run submissions, such that cross-year evaluations are possible. This year’s stand-alone source retrieval overview can thus to some extent also be used as a reference to the different ideas presented in the last three years—the text alignment subtask will be depicted in another individual overview. 1