• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

DMCA

Distinct sampling for highly-accurate answers to distinct values queries and event reports

Cached

  • Download as a PDF

Download Links

  • [www.pittsburgh.intel-research.net]
  • [www.cs.cmu.edu]
  • [www.csd.uoc.gr]
  • [info.pittsburgh.intel-research.net]
  • [pittsburgh.intel-research.net]
  • [www.cs.cmu.edu]
  • [ece.ut.ac.ir]
  • [www.csd.uoc.gr]
  • [www.dia.uniroma3.it]
  • [www.cse.iitb.ac.in]
  • [www.cse.iitb.ac.in]
  • [www.aladdin.cs.cmu.edu]
  • [www.vldb.org]

  • Other Repositories/Bibliography

  • DBLP
  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Phillip B. Gibbons
Venue:In Proceedings of the 27th International Conference on Very Large Data Bases
Citations:119 - 5 self
  • Summary
  • Citations
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@INPROCEEDINGS{Gibbons_distinctsampling,
    author = {Phillip B. Gibbons},
    title = {Distinct sampling for highly-accurate answers to distinct values queries and event reports},
    booktitle = {In Proceedings of the 27th International Conference on Very Large Data Bases},
    year = {},
    pages = {541--550}
}

Share

Facebook Twitter Reddit Bibsonomy

OpenURL

 

Abstract

Estimating the number of distinct values is a wellstudied problem, due to its frequent occurrence in queries and its importance in selecting good query plans. Previous work has shown powerful negative results on the quality of distinct-values estimates based on sampling (or other techniques that examine only part of the input data). We present an approach, called distinct sampling, that collects a specially tailored sample over the distinct values in the input, in a single scan of the data. In contrast to the previous negative results, our small Distinct Samples are guaranteed to accurately estimate the number of distinct values. The samples can be incrementally maintained up-to-date in the presence of data insertions and deletions, with minimal time and memory overheads, so that the full scan may be performed only once. Moreover, a stored Distinct Sample can be used to accurately estimate the number of distinct values within any range specified by the query, or within any other subset of the data satisfying a query predicate. We present an extensive experimental study of distinct sampling. Using synthetic and real-world data sets, we show that distinct sampling gives distinct-values estimates to within 0%–10 % relative error, whereas previous methods typically incur 50%–250 % relative error. Next, we show how distinct sampling can provide fast, highlyaccurate approximate answers for “report ” queries in high-volume, session-based event recording environments, such as IP networks, customer service call centers, etc. For a commercial call center environment, we show that a 1 % Distinct Sample

Keyphrases

distinct value    value query    event report    distinct sampling    highly-accurate answer    distinct-values estimate    relative error    small distinct sample    ip network    memory overhead    input data    data insertion    customer service call center    single scan    query predicate    previous negative result    commercial call center environment    report query    tailored sample    stored distinct sample    previous work    powerful negative result    wellstudied problem    real-world data set    session-based event    distinct sample    full scan    previous method    frequent occurrence    extensive experimental study    good query plan    highlyaccurate approximate answer    minimal time   

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University