Results 11  20
of
324
Sketchbased Change Detection: Methods, Evaluation, and Applications
 IN INTERNET MEASUREMENT CONFERENCE
, 2003
"... Traffic anomalies such as failures and attacks are commonplace in today's network, and identifying them rapidly and accurately is critical for large network operators. The detection typically treats the traffic as a collection of flows that need to be examined for significant changes in traffic ..."
Abstract

Cited by 161 (17 self)
 Add to MetaCart
Traffic anomalies such as failures and attacks are commonplace in today's network, and identifying them rapidly and accurately is critical for large network operators. The detection typically treats the traffic as a collection of flows that need to be examined for significant changes in traffic pattern (e.g., volume, number of connections) . However, as link speeds and the number of flows increase, keeping perflow state is either too expensive or too slow. We propose building compact summaries of the traffic data using the notion of sketches. We have designed a variant of the sketch data structure, kary sketch, which uses a constant, small amount of memory, and has constant perrecord update and reconstruction cost. Its linearity property enables us to summarize traffic at various levels. We then implement a variety of time series forecast models (ARIMA, HoltWinters, etc.) on top of such summaries and detect significant changes by looking for flows with large forecast errors. We also present heuristics for automatically configuring the model parameters. Using a
Clustering data streams: Theory and practice
 IEEE TKDE
, 2003
"... Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little ..."
Abstract

Cited by 154 (4 self)
 Add to MetaCart
Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams. Index Terms—Clustering, data streams, approximation algorithms. 1
Efficient Similarity Search and Classification Via Rank Aggregation
 In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data
, 2003
"... We propose a novel approach to performing efficient similarity search and classification in high dimensional data. In this framework, the database elements are vectors in a Euclidean space. Given a query vector in the same space, the goal is to find elements of the database that are similar to the ..."
Abstract

Cited by 152 (3 self)
 Add to MetaCart
We propose a novel approach to performing efficient similarity search and classification in high dimensional data. In this framework, the database elements are vectors in a Euclidean space. Given a query vector in the same space, the goal is to find elements of the database that are similar to the query. In our approach, a small number of independent "voters" rank the database elements based on similarity to the query. These rankings are then combined by a highly efficient aggregation algorithm. Our methodology leads both to techniques for computing approximate nearest neighbors and to a conceptually rich alternative to nearest neighbors.
Reductions in Streaming Algorithms, with an Application to Counting Triangles in Graphs
"... We introduce reductions in the streaming model as a tool in the design of streaming algorithms. We develop ..."
Abstract

Cited by 151 (5 self)
 Add to MetaCart
We introduce reductions in the streaming model as a tool in the design of streaming algorithms. We develop
DataStreams and Histograms
, 2001
"... Histograms have been used widely to capture data distribution, to represent the data by a small number of step functions. Dynamic programming algorithms which provide optimal construction of these histograms exist, albeit running in quadratic time and linear space. In this paper we provide linear ti ..."
Abstract

Cited by 146 (9 self)
 Add to MetaCart
(Show Context)
Histograms have been used widely to capture data distribution, to represent the data by a small number of step functions. Dynamic programming algorithms which provide optimal construction of these histograms exist, albeit running in quadratic time and linear space. In this paper we provide linear time construction of 1 + epsilon approximation of optimal histograms, running in polylogarithmic space. Our results extend to the context of datastreams, and in fact generalize to give 1 + epsilon approximation of several problems in datastreams which require partitioning the index set into intervals. The only assumptions required are that the cost of an interval is monotonic under inclusion (larger interval has larger cost) and that the cost can be computed or approximated in small space. This exhibits a nice class of problems for which we can have near optimal datastream algorithms.
Near Duplicate Image Detection: minHash and tfidf Weighting
"... This paper proposes two novel image similarity measures for fast indexing via locality sensitive hashing. The similarity measures are applied and evaluated in the context of near duplicate image detection. The proposed method uses a visual vocabulary of vector quantized local feature descriptors (SI ..."
Abstract

Cited by 131 (1 self)
 Add to MetaCart
(Show Context)
This paper proposes two novel image similarity measures for fast indexing via locality sensitive hashing. The similarity measures are applied and evaluated in the context of near duplicate image detection. The proposed method uses a visual vocabulary of vector quantized local feature descriptors (SIFT) and for retrieval exploits enhanced minHash techniques. Standard minHash uses an approximate set intersection between document descriptors was used as a similarity measure. We propose an efficient way of exploiting more sophisticated similarity measures that have proven to be essential in image / particular object retrieval. The proposed similarity measures do not require extra computational effort compared to the original measure. We focus primarily on scalability to very large image and video databases, where fast query processing is necessary. The method requires only a small amount of data need be stored for each image. We demonstrate our method on the TrecVid 2006 data set which contains approximately 146K key frames, and also on challenging the University of Kentucky image retrieval database. 1
BMultiProbe LSH: Efficient indexing for highdimensional similarity search
 in Proc. 33rd Int. Conf. Very Large Data Bases
"... Similarity indices for highdimensional data are very desirable for building contentbased search systems for featurerich data such as audio, images, videos, and other sensor data. Recently, locality sensitive hashing (LSH) and its variations have been proposed as indexing techniques for approximate ..."
Abstract

Cited by 111 (3 self)
 Add to MetaCart
(Show Context)
Similarity indices for highdimensional data are very desirable for building contentbased search systems for featurerich data such as audio, images, videos, and other sensor data. Recently, locality sensitive hashing (LSH) and its variations have been proposed as indexing techniques for approximate similarity search. A significant drawback of these approaches is the requirement for a large number of hash tables in order to achieve good search quality. This paper proposes a new indexing scheme called multiprobe LSH that overcomes this drawback. Multiprobe LSH is built on the wellknown LSH technique, but it intelligently probes multiple buckets that are likely to contain query results in a hash table. Our method is inspired by and improves upon recent theoretical work on entropybased LSH designed to reduce the space requirement of the basic LSH method. We have implemented the multiprobe LSH method and evaluated the implementation with two different highdimensional datasets. Our evaluation shows that the multiprobe LSH method substantially improves upon previously proposed methods in both space and time efficiency. To achieve the same search quality, multiprobe LSH has a similar timeefficiency as the basic LSH method while reducing the number of hash tables by an order of magnitude. In comparison with the entropybased LSH method, to achieve the same search quality, multiprobe LSH uses less query time and 5 to 8 times fewer number of hash tables. 1.
On graph problems in a semistreaming model
 In 31st International Colloquium on Automata, Languages and Programming
, 2004
"... Abstract. We formalize a potentially rich new streaming model, the semistreaming model, that we believe is necessary for the fruitful study of efficient algorithms for solving problems on massive graphs whose edge sets cannot be stored in memory. In this model, the input graph, G = (V, E), is prese ..."
Abstract

Cited by 108 (16 self)
 Add to MetaCart
(Show Context)
Abstract. We formalize a potentially rich new streaming model, the semistreaming model, that we believe is necessary for the fruitful study of efficient algorithms for solving problems on massive graphs whose edge sets cannot be stored in memory. In this model, the input graph, G = (V, E), is presented as a stream of edges (in adversarial order), and the storage space of an algorithm is bounded by O(n · polylog n), where n = V . We are particularly interested in algorithms that use only one pass over the input, but, for problems where this is provably insufficient, we also look at algorithms using constant or, in some cases, logarithmically many passes. In the course of this general study, we give semistreaming constant approximation algorithms for the unweighted and weighted matching problems, along with a further algorithm improvement for the bipartite case. We also exhibit log n / log log n semistreaming approximations to the diameter and the problem of computing the distance between specified vertices in a weighted graph. These are complemented by Ω(log (1−ɛ) n) lower bounds. 1
Secure multiparty computation of approximations
, 2001
"... Approximation algorithms can sometimes provide efficient solutions when no efficient exact computation is known. In particular, approximations are often useful in a distributed setting where the inputs are held by different parties and may be extremely large. Furthermore, for some applications, the ..."
Abstract

Cited by 107 (26 self)
 Add to MetaCart
Approximation algorithms can sometimes provide efficient solutions when no efficient exact computation is known. In particular, approximations are often useful in a distributed setting where the inputs are held by different parties and may be extremely large. Furthermore, for some applications, the parties want to compute a function of their inputs securely, without revealing more information than necessary. In this work we study the question of simultaneously addressing the above efficiency and security concerns via what we call secure approximations. We start by extending standard definitions of secure (exact) computation to the setting of secure approximations. Our definitions guarantee that no additional information is revealed by the approximation beyond what follows from the output of the function being approximated. We then study the complexity of specific secure approximation problems. In particular, we obtain a sublinearcommunication protocol for securely approximating the Hamming distance and a polynomialtime protocol for securely approximating the permanent and related #Phard problems. 1
Characterizing Memory Requirements for Queries over Continuous Data
 In PODS
, 2002
"... This paper deals with continuous conjunctive queries with arithmetic comparisons and optional aggregation over multiple data streams. An algorithm is presented for determining whether or not any given query can be evaluated using a bounded amount of memory for all possible instances of the data stre ..."
Abstract

Cited by 99 (11 self)
 Add to MetaCart
(Show Context)
This paper deals with continuous conjunctive queries with arithmetic comparisons and optional aggregation over multiple data streams. An algorithm is presented for determining whether or not any given query can be evaluated using a bounded amount of memory for all possible instances of the data streams. For queries that can be evaluated using bounded memory, an execution strategy based on constantsized synopses of the data streams is proposed. For queries that cannot be evaluated using bounded memory, data stream scenarios are identified in which evaluating the queries requires memory linear in the size of the unbounded streams