MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Comparing data streams using hamming norms (how to zero in (2002) [34 citations — 6 self]

Download:
pdf
by Graham Cormode, Mayur Datar, S. Muthukrishnan, Piotr Indyk
In VLDB
http://www.cs.ust.hk/vldb2002/VLDB2002-papers/S10P02.pdf
Add To MetaCart

Abstract:

Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases, and instead must be processed “on the fly ” as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on manipulating data streams, and hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently. We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalises ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have many uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the “l0 sketch” and we prove its accuracy. We test our approximation method on a large quantity of synthetic and real stream data, and show that the estimation is accurate to within a few percentage points.

Citations

338 The space complexity of approximating the frequency moments – Alon, Matias, et al. - 1996
180 Probabilistic counting algorithms for database applications – Flajolet, Martin - 1985
174 Deriving traffic demands for operational IP networks: Methodology and experience – Feldmann, Greenberg, et al. - 2001
172 Clustering Data Streams – Guha, Motwani, et al. - 2000
166 Fjording the stream: An architecture for queries over streaming sensor data – Madden, Franklin - 2002
158 Continuous queries over data streams – Babu, Widom
135 Surfing wavelets on streams: One-pass summaries for approximate aggregate queries – Gilbert, Kotidis, et al. - 2001
133 Mining time-changing data streams – Hulten, Spencer, et al. - 2001
122 Stable distributions, pseudorandom generators, embeddings and data stream computation – Indyk - 2000
96 NetScope: Traffic Engineering for IP Networks – Feldmann, Greenberg, et al. - 2000
96 Inferring internet denial of service activity – Moore, Voelker, et al. - 2001
93 Data-streams and histograms – Guha, Koudas, et al.
90 Sampling-based estimation of the number of distinct values of an attribute – Haas, Naughton, et al. - 1995
80 Finding interesting associations without support pruning – Cohen, Datar, et al.
76 Random sampling for histogram construction: How much is enough – Chaudhuri, Motwani, et al. - 1998
60 Extensions of Lipshitz mapping into Hilbert space – Johnson, Lindenstrauss - 1984
52 An approximate L1-difference algorithm for massive data – Feigenbaum, Kannan, et al. - 1999
51 Estimating simple functions on the union of data streams – Gibbons, Tirthapura - 2001
50 Probabilistic counting – Flajolet, Martin - 1983
47 Distinct sampling for highly-accurate answers to distinct values queries and event reports – Gibbons - 2001
39 Identifying Representative Trends in Massive TimeSeries Data Sets Using Sketches – Indyk, Koudas, et al.
36 Gigascope: high performance network monitoring with an sql interface – Cranor, Gao, et al. - 2002
33 Towards estimation error guarantees for distinct values – Charikar, Chaudhuri, et al. - 2000
31 Quicksand: Quick summary and analysis of network data – Gilbert, Kotidis, et al. - 2001
30 Mining database structure; or, how to build a data quality browser – Dasu, Johnson, et al. - 2002
22 Estimating the number of species: A review – BUNGE, M - 1993
11 A data stream management system for network traffic management – Babu, Subramanian, et al. - 2001
7 Synopsis structures for massive data sets – Gibbons, Matias - 1999
4 Fast mining of tabular data via approximate distance computations – Cormode, Indyk, et al. - 2002
4 Falcon: Fault management via alarm warehousing and mining – Grossglauser, Koudas, et al. - 2001
4 National Oceanic and Atmospheric Administration – NOAA - 1993
1 More details at http://www.cisco.com/warp/public/732 – NetFlow
1 Stable distributions. Available from http://academic2.american.edu/∼jpnolan/ stable/chap1.ps – Nolan
1 Multidimensional dynamic histograms – Thaper, Guha, et al. - 2002
1 Atmospheric data repository. http://www.unidata.ucar.edu – Unidata