Results 1  10
of
34
The Dynamic Bloom Filters
 In Proc. IEEE Infocom
, 2006
"... Abstract—A Bloom filter is an effective, spaceefficient data structure for concisely representing a set and supporting approximate membership queries. Traditionally, the Bloom filter and its variants just focus on how to represent a static set and decrease the false positive probability to a suffic ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
(Show Context)
Abstract—A Bloom filter is an effective, spaceefficient data structure for concisely representing a set and supporting approximate membership queries. Traditionally, the Bloom filter and its variants just focus on how to represent a static set and decrease the false positive probability to a sufficiently low level. By investigating mainstream applications based on the Bloom filter, we reveal that dynamic data sets are more common and important than static sets. However, existing variants of the Bloom filter cannot support dynamic data sets well. To address this issue, we propose dynamic Bloom filters to represent dynamic sets as well as static sets and design necessary item insertion, membership query, item deletion, and filter union algorithms. The dynamic Bloom filter can control the false positive probability at a low level by expanding its capacity as the set cardinality increases. Through comprehensive mathematical analysis, we show that the dynamic Bloom filter uses less expected memory than the Bloom filter when representing dynamic sets with an upper bound on set cardinality, and also that the dynamic Bloom filter is more stable than the Bloom filter due to infrequent reconstruction when addressing dynamic sets without an upper bound on set cardinality. Moreover, the analysis results hold in standalone applications as well as distributed applications. Index Terms—Bloom filters, dynamic Bloom filters, information representation.
False Negative Problem of Counting Bloom Filter
"... Abstract—Bloom filter is effective, spaceefficient data structure for concisely representing a data set and supporting approximate membership queries. Traditionally, researchers often believe that it is possible that a Bloom filter returns a false positive, but it will never return a false negative ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
Abstract—Bloom filter is effective, spaceefficient data structure for concisely representing a data set and supporting approximate membership queries. Traditionally, researchers often believe that it is possible that a Bloom filter returns a false positive, but it will never return a false negative under wellbehaved operations. By investigating the mainstream variants, however, we observe that a Bloom filter does return false negatives in many scenarios. In this work, we show that the undetectable incorrect deletion of false positive items and detectable incorrect deletion of multiaddress items are two general causes of false negative in a Bloom filter. We then measure the potential and exposed false negatives theoretically and practically. Inspired by the fact that the potential false negatives are usually not fully exposed, we propose a novel Bloom filter scheme, which increases the ratio of bits set to a value larger than one without decreasing the ratio of bits set to zero. Mathematical analysis and comprehensive experiments show that this design can reduce the number of exposed false negatives as well as decrease the likelihood of false positives. To the best of our knowledge, this is the first work dealing with both the false positive and false negative problems of Bloom filter systematically when supporting standard usages of item insertion, query, and deletion operations. Index Terms—Bloom filter, false negative, multichoice counting Bloom filter. Ç 1
Improved approximate detection of duplicates for data streams over sliding windows
 J. of Computer Science and Technology
"... Abstract Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements in data streams are always time sensit ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
Abstract Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements in data streams are always time sensitive. These make it particular significant approximately detecting duplicates among newly arrived elements of a data stream within a fixed time frame. In this paper, we present a novel data structure, Decaying Bloom Filter (DBF), as an extension of the Counting Bloom Filter, that effectively removes stale elements as new elements continuously arrive over sliding windows. On the DBF basis we present an efficient algorithm to approximately detect duplicates over sliding windows. Our algorithm may produce false positive errors, but not false negative errors as in many previous results. We analyze the time complexity and detection accuracy, and give a tight upper bound of false positive rate. For a given space G bits and sliding window size W, our algorithm has an amortized time complexity of O( G/W). Both analytical and experimental results on synthetic data demonstrate that our algorithm is superior in both execution time and detection accuracy to the previous results.
Addressing click fraud in content delivery systems
 In In Proceedings of the 26th IEEE INFOCOM International Conference on Computer Communications
, 2007
"... ..."
(Show Context)
A LocalityAware Memory Hierarchy for EnergyEfficient GPU Architectures
"... As GPU’s compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU memory hierarchies use coarsegrained memory accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache metadata storage. These coarsegrained memory acce ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
As GPU’s compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU memory hierarchies use coarsegrained memory accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache metadata storage. These coarsegrained memory accesses, however, are a poor match for emerging GPU applications with irregular control flow and memory access patterns. Meanwhile, the massive multithreading of GPUs and the simplicity of their cache hierarchies make CPUspecific memory system enhancements ineffective for improving the performance of irregular GPU applications. We design and evaluate a localityaware memory hierarchy for throughput processors, such as GPUs. Our proposed design retains the advantages of coarsegrained accesses for spatially and temporally local programs while permitting selective finegrained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality. As such, our localityaware memory hierarchy improves GPU performance, energyefficiency, and memory throughput for a large range of applications.
Query by Document via a DecompositionBased TwoLevel Retrieval Approach
"... Retrieving similar documents from a largescale text corpus according to a given document is a fundamental technique for many applications. However, most of existing indexing techniques have difficulties to address this problem due to special properties of a document query, e.g. high dimensionality, ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Retrieving similar documents from a largescale text corpus according to a given document is a fundamental technique for many applications. However, most of existing indexing techniques have difficulties to address this problem due to special properties of a document query, e.g. high dimensionality, sparse representation and semantic issue. Towards addressing this problem, we propose a twolevel retrieval solution based on a document decomposition idea. A document is decomposed to a compact vector and a few document specific keywords by a dimension reduction approach. The compact vector embodies the major semantics of a document, and the document specific keywords complement the discriminative power lost in dimension reduction process. We adopt locality sensitive hashing (LSH) to index the compact vectors, which guarantees to quickly find a set of related documents according to the vector of a query document. Then we rerank documents in this set by their document
Finding duplicates in a data stream
 in Proc. 20th Annual Symposium on Discrete Algorithms (SODA), 2009
"... Given a data stream of length n over an alphabet [m] where n> m, we consider the problem of finding a duplicate in a single pass. We give a randomized algorithm for this problem that uses O((log m) 3) space. This answers a question of Muthukrishnan [Mut05] and Tarui [Tar07], who asked if this pro ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Given a data stream of length n over an alphabet [m] where n> m, we consider the problem of finding a duplicate in a single pass. We give a randomized algorithm for this problem that uses O((log m) 3) space. This answers a question of Muthukrishnan [Mut05] and Tarui [Tar07], who asked if this problem could be solved using sublinear space and one pass over the input. Our algorithm solves the more general problem of finding a positive frequency element in a stream given by frequency updates where the sum of all frequencies is positive. Our main tool is an Isolation Lemma that reduces this problem to the task of detecting and identifying a Dictatorial variable in a Boolean halfspace. We present various relaxations of the condition n> m, under which one can find duplicates efficiently. 1
Relevancebased Verification of VANET Safety Messages
 in Proceedings of IEEE International Conference on Communications, ICC 2012
, 2012
"... Abstract—Authentication of vehicular safety messages poses a challenge in a high density roadtraffic scenario as the verification time for gathered messages gets longer than the average interarrival time. This may expose a vehicular network entity to several different security attacks. The existin ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Abstract—Authentication of vehicular safety messages poses a challenge in a high density roadtraffic scenario as the verification time for gathered messages gets longer than the average interarrival time. This may expose a vehicular network entity to several different security attacks. The existing solutions have addressed the issue either by randomizing the verification candidates, or by using aggregated signature verification schemes, both of which have shortcomings in terms of applicability in vehicular communications. We propose a novel solution to the vehicular message authentication in dense traffic conditions by introducing a prioritized verification strategy. Based on the relevance of physical parameters of neighboring vehicles, received safety messages are assigned with different priority scores at the verifying entity. In a heavy traffic condition when the resources are scarce, a verifier randomly authenticates the selected received messages according to their priorities. Performance evaluation has shown that our approach is scalable, resourceefficient, and compatible with any underlying authentication schemes. I.
New estimation algorithms for streaming data: Countmin can do more (extended version). http://www.cs.ualberta.ca/˜fandeng/ paper/cmm_full.pdf
"... Countmin is a generalpurpose data stream summary technique, which can be used to answer multiple types of approximate queries such as multiplicity (a.k.a point) queries, join and selfjoin size estimations, and it has some nice properties such as the onesided error guarantee, better space bounds ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Countmin is a generalpurpose data stream summary technique, which can be used to answer multiple types of approximate queries such as multiplicity (a.k.a point) queries, join and selfjoin size estimations, and it has some nice properties such as the onesided error guarantee, better space bounds and more accurate estimates for highly skewed data in comparison with the best known alternatives. However, based on our experiments for multiplicity queries and selfjoin size estimations on both synthetic and real data sets, we find that in practice the previous Countmin estimation algorithms only perform well when the data set is highly skewed; in other cases, these algorithms give much less accurate results than FastAGMS (a.k.a Countsketch), which is an improvement based on the influential sketching technique, AMS sketch. In this paper, based on the existing Countmin data structure, we propose two new estimation algorithms for multiplicity queries and selfjoin size estimations, which significantly improve the estimation accuracies compared with the previous Countmin estimation algorithms when the data set is less skewed, exactly where the previous algorithms perform poorly. Moreover, we show both in theory and in practice that the performance of our algorithms are very similar to that of FastAGMS regardless of input data distributions. Thus, with both the new and previous estimation algorithms, we argue that Countmin is more flexible and powerful than FastAGMS, because Countmin performs almost the same as FastAGMS in terms of both estimation accuracy and time efficiency using our new estimation algorithms, while Countmin exhibits other nice properties using the previous estimation algorithms as mentioned before. 1
Capacity and Robustness Tradeoffs in Bloom Filters for Distributed Applications
"... Abstract—The Bloom filter is a spaceefficient data structure often employed in distributed applications to save bandwidth during data exchange. These savings, however, come at the cost of errors in the shared data, which are usually assumed low enough to not disrupt the application. We argue that t ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—The Bloom filter is a spaceefficient data structure often employed in distributed applications to save bandwidth during data exchange. These savings, however, come at the cost of errors in the shared data, which are usually assumed low enough to not disrupt the application. We argue that this assumption does not hold in a more hostile environment, such as the Internet, where attackers can send a carefully crafted Bloom filter in order to break the application. In this paper, we propose the concatenated Bloom filter (CBF), a robust Bloom filter that prevents the attacker from interfering on the shared information, protecting the application data while still providing space efficiency. Instead of using a single large filter, the CBF concatenates small subfilters to improve both the filter robustness and capacity. We propose three CBF variants and provide analytical results that show the efficacy of the CBF for different scenarios. We also evaluate the performance of our filter in an IP traceback application and simulation results confirm the effectiveness of the proposed mechanism in the face of attackers. Index Terms—Bloom filters, distributed applications, security, IP traceback Ç 1