• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

DMCA

Top-k Set Similarity Joins

Cached

  • Download as a PDF

Download Links

  • [www.cse.unsw.edu.au]
  • [www.cse.unsw.edu.au]
  • [www.cse.unsw.edu.au]
  • [dc-pubs.dbs.uni-leipzig.de]
  • [www.cse.unsw.edu]
  • [www.cse.unsw.edu.au]
  • [www.cse.unsw.edu.au]

  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Chuan Xiao , Wei Wang , Xuemin Lin , Haichuan Shang
Citations:42 - 2 self
  • Summary
  • Citations
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@MISC{Xiao_top-kset,
    author = {Chuan Xiao and Wei Wang and Xuemin Lin and Haichuan Shang},
    title = {Top-k Set Similarity Joins},
    year = {}
}

Share

Facebook Twitter Reddit Bibsonomy

OpenURL

 

Abstract

Abstract — Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this paper, we study a variant of the similarity join, termed top-k set similarity join. It returns the top-k pairs of records ranked by their similarities, thus eliminating the guess work users have to perform when the similarity threshold is unknown before hand. An algorithm, topk-join, is proposed to answer top-k similarity join efficiently. It is based on the prefix filtering principle and employs tight upper bounding of similarity values of unseen pairs. Experimental results demonstrate the efficiency of the proposed algorithm on large-scale real datasets. I.

Keyphrases

top-k set similarity join    similarity threshold    near duplicate web page detection    abstract similarity join    traditional similarity    guess work user    many application    similarity join    large-scale real datasets    unseen pair    pattern recognition    similarity value    top-k pair    useful primitive operation    experimental result    top-k similarity join    tight upper bounding    data integration   

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University