Results 1 - 10
of
45
Efficient similarity joins for near duplicate detection
- In WWW
, 2008
"... With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given ..."
Abstract
-
Cited by 103 (9 self)
- Add to MetaCart
(Show Context)
With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the token ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. We have also studied the implementation of our proposed algorithm in stand-alone and RDBMSbased settings. Experimental results show our proposed algorithms can outperforms previous algorithms on several real datasets.
b-bit minwise hashing
- In WWW. 671–680
, 2010
"... This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, computational advertising, etc. By only storin ..."
Abstract
-
Cited by 26 (7 self)
- Add to MetaCart
This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, computational advertising, etc. By only storing b bits of each hashed value (e.g., b =1or 2), we gain substantial advantages in terms of storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using b =1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to b =64(or b =32), if one is interested in resemblance ≥ 0.5.
Highlighting Disputed Claims on the Web
- In Proceedings of the 19th international conference on World Wide Web - WWW
, 2010
"... We describe Dispute Finder, a browser extension that alerts a user when information they read online is disputed by a source that they might trust. Dispute Finder examines the text on the page that the user is browsing and highlights any phrases that resemble known disputed claims. If a user clicks ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
(Show Context)
We describe Dispute Finder, a browser extension that alerts a user when information they read online is disputed by a source that they might trust. Dispute Finder examines the text on the page that the user is browsing and highlights any phrases that resemble known disputed claims. If a user clicks on a highlighted phrase then Dispute Finder shows them a list of articles that support other points of view. Dispute Finder builds a database of known disputed claims by crawling web sites that already maintain lists of disputed claims, and by allowing users to enter claims that they believe are disputed. Dispute Finder identifies snippets that make known disputed claims by running a simple textual entailment algorithm inside the browser extension, referring to a cached local copy of the claim database. In this paper, we explain the design of Dispute Finder, and the trade-offs between the various design decisions that we explored. Figure 1: Dispute Finder highlights text snippets that make disputed claims.
Debugadvisor: A recommender system for debugging
- In Proceedings of ESEC/FSE ’09
, 2009
"... In large software development projects, when a programmer is assigned a bug to fix, she typically spends a lot of time searching (in an ad-hoc manner) for instances from the past where similar bugs have been debugged, analyzed and resolved. Systematic search tools that allow the programmer to expres ..."
Abstract
-
Cited by 23 (0 self)
- Add to MetaCart
(Show Context)
In large software development projects, when a programmer is assigned a bug to fix, she typically spends a lot of time searching (in an ad-hoc manner) for instances from the past where similar bugs have been debugged, analyzed and resolved. Systematic search tools that allow the programmer to express the context of the current bug, and search through diverse data repositories associated with large projects can greatly improve the productivity of debugging. This paper presents the design, implementation and experience from such a search tool called DebugAdvisor. The context of a bug includes all the information a programmer has about the bug, including natural language text, textual rendering of core dumps, debugger output etc. Our key insight is to allow the programmer to collate this entire context as a query to search for related information. Thus, DebugAdvisor allows the programmer to search using a fat query, which could be kilobytes of structured and unstructured data describing the contextual information for the current bug. Information retrieval in the presence of fat queries and variegated data repositories, all of which contain a mix of structured and unstructured data is a challenging problem. We present novel ideas to solve this problem. We have deployed DebugAdvisor to over 100 users inside Microsoft. In addition to standard metrics such as precision and recall, we present extensive qualitative and quantitative feedback from our users.
Efficient exact edit similarity query processing with the asymmetric signature scheme
- In SIGMOD Conference
, 2011
"... Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold τ. Most existing method answering edit similarity queries rely on a signature scheme to generate candidates given the query string. We observe that the numbe ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
(Show Context)
Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold τ. Most existing method answering edit similarity queries rely on a signature scheme to generate candidates given the query string. We observe that the number of signatures generated by existing methods is far greater than the lower bound, and this results in high query time and index space complexities. In this paper, we show that the minimum signature size lower bound is τ+1. We then propose asymmetric signature schemes that achieve this lower bound. We develop efficient query processing algorithms based on the new scheme. Several dynamic programming-based candidate pruning methods are also developed to further speed up the performance. We have conducted a comprehensive experimental study involving nine state-of-the-art algorithms. The experiment results clearly demonstrate the efficiency of our methods.
Adaptive Near-Duplicate Detection via Similarity Learning
- In Proceedings of SIGIR ‘10
, 2010
"... In this paper, we present a novel near-duplicate document detection method that can easily be tuned for a particular domain. Our method represents each document as a real-valued sparse k-gram vector, where the weights are learned to optimize for a specified similarity function, such as the cosine si ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
(Show Context)
In this paper, we present a novel near-duplicate document detection method that can easily be tuned for a particular domain. Our method represents each document as a real-valued sparse k-gram vector, where the weights are learned to optimize for a specified similarity function, such as the cosine similarity or the Jaccard coefficient. Near-duplicate documents can be reliably detected through this improved similarity measure. In addition, these vectors can be mapped to a small number of hash-values as document signatures through the locality sensitive hashing scheme for efficient similarity computation. We demonstrate our approach in two target domains: Web news articles and email messages. Our method is not only more accurate than the commonly used methods such as Shingles and I-Match, but also shows consistent improvement across the domains, which is a desired property lacked by existing methods.
Plagiarism Detection Using Stopword n-Grams
- JASIST
"... In this paper, a novel method for detecting plagiarized passages in document collections is presented. In contrast to previous work in this field that uses content terms to represent documents, the proposed method is based on a small list of stopwords (i.e., very frequent words). We show that stopwo ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
(Show Context)
In this paper, a novel method for detecting plagiarized passages in document collections is presented. In contrast to previous work in this field that uses content terms to represent documents, the proposed method is based on a small list of stopwords (i.e., very frequent words). We show that stopword n-grams reveal important information for plagiarism detection since they are able to capture syntactic similarities between suspicious and original documents and they can be used to detect the exact plagiarized passage boundaries. Experimental results on a publicly-available corpus demonstrate that the performance of the proposed approach is competitive when compared with the best reported results. More importantly, it achieves significantly better results when dealing with difficult plagiarism cases where the plagiarized passages are highly modified and most of the words or phrases have been replaced with synonyms. This is a preprint of an article published in the Journal of the American Society for
Efficient partialduplicate detection based on sequence matching
- In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’10
, 2010
"... With the ever-increasing growth of the Internet, numer-ous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since partial-duplicates only contain a small piece of text taken from other sources and most existing near-duplicate detection app ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
With the ever-increasing growth of the Internet, numer-ous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since partial-duplicates only contain a small piece of text taken from other sources and most existing near-duplicate detection approaches focus on document level, partial dupli-cates can not be dealt with well. In this paper, we propose a novel algorithm to realize the partial-duplicate detection task. Besides the similarities between documents, our pro-posed algorithm can simultaneously locate the duplicated parts. The main idea is to divide the partial-duplicate de-tection task into two subtasks: sentence level near-duplicate detection and sequence matching. For evaluation, we com-pare the proposed method with other approaches on both English and Chinese web collections. Experimental results appear to support that our proposed method is effectively and efficiently to detect both partial-duplicates on large web collections.
Near-Duplicate Detection for Web-Forums
"... Unlike web search engines, current forum search engines lack the ability to identify threads with near-duplicate content and to group these threads in the search results. As a result, forum users are overloaded with duplicate search results. Furthermore, they often create additional similar postings ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Unlike web search engines, current forum search engines lack the ability to identify threads with near-duplicate content and to group these threads in the search results. As a result, forum users are overloaded with duplicate search results. Furthermore, they often create additional similar postings without reviewing existing search results. To overcome that problem we have developed a new near-duplicate detection algorithm for forum-threads. The algorithm was implemented in a large case study using a real-world forum serving more than one million users. We compared our work with current algorithms, such as [3, 4], for detecting near-duplicates on machine generated web pages. Our preliminary results show, that we significantly outperform these algorithms and that we are able to group forum threads with a precision of 74%.
Coreex: content extraction from online news articles
- In CIKM’2008: Proceedings of the 17th ACM Conference on Information and Knowledge Management
"... We developed and tested a heuristic technique for extracting the main article from news site Web pages. We construct the DOM tree of the page and score every node based on the amount of text and the number of links it contains. The method is site-independent and does not use any language-based featu ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
We developed and tested a heuristic technique for extracting the main article from news site Web pages. We construct the DOM tree of the page and score every node based on the amount of text and the number of links it contains. The method is site-independent and does not use any language-based features. We tested our algorithm on a set of 1120 news article pages from 27 domains. This dataset was also used elsewhere to test the performance of another, state-of-the-art baseline system. Our algorithm achieved over 97% precision and 98 % recall, and an average processing speed of under 15ms per page. This precision/recall performance is slightly below the baseline system, but our approach requires significantly less computational work. 1.