Results 1  10
of
56
Approximation Algorithms for Projective Clustering
 Proceedings of the ACM SIGMOD International Conference on Management of data, Philadelphia
, 2000
"... We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyperstrips (resp. hypercylinders) so that the maximum width of a hyperstrip (resp., the maximum diameter of a hypercylinder) is minimized. Let w ..."
Abstract

Cited by 302 (22 self)
 Add to MetaCart
We consider the following two instances of the projective clustering problem: Given a set S of n points in R d and an integer k ? 0; cover S by k hyperstrips (resp. hypercylinders) so that the maximum width of a hyperstrip (resp., the maximum diameter of a hypercylinder) is minimized. Let w be the smallest value so that S can be covered by k hyperstrips (resp. hypercylinders), each of width (resp. diameter) at most w : In the plane, the two problems are equivalent. It is NPHard to compute k planar strips of width even at most Cw ; for any constant C ? 0 [50]. This paper contains four main results related to projective clustering: (i) For d = 2, we present a randomized algorithm that computes O(k log k) strips of width at most 6w that cover S. Its expected running time is O(nk 2 log 4 n) if k 2 log k n; it also works for larger values of k, but then the expected running time is O(n 2=3 k 8=3 log 4 n). We also propose another algorithm that computes a c...
Efficient Search for Approximate Nearest Neighbor in High Dimensional Spaces
, 1998
"... We address the problem of designing data structures that allow efficient search for approximate nearest neighbors. More specifically, given a database consisting of a set of vectors in some high dimensional Euclidean space, we want to construct a spaceefficient data structure that would allow us to ..."
Abstract

Cited by 215 (9 self)
 Add to MetaCart
(Show Context)
We address the problem of designing data structures that allow efficient search for approximate nearest neighbors. More specifically, given a database consisting of a set of vectors in some high dimensional Euclidean space, we want to construct a spaceefficient data structure that would allow us to search, given a query vector, for the closest or nearly closest vector in the database. We also address this problem when distances are measured by the L 1 norm, and in the Hamming cube. Significantly improving and extending recent results of Kleinberg, we construct data structures whose size is polynomial in the size of the database, and search algorithms that run in time nearly linear or nearly quadratic in the dimension (depending on the case; the extra factors are polylogarithmic in the size of the database). Computer Science Department, Technion  IIT, Haifa 32000, Israel. Email: eyalk@cs.technion.ac.il y Bell Communications Research, MCC1C365B, 445 South Street, Morristown, NJ ...
Multidimensional range queries in sensor networks
 in Proc. of the 1st international conference on Embedded networked sensor systems, SenSys
"... In many sensor networks, data or events are named by attributes. Many of these attributes have scalar values, so one natural way to query events of interest is to use a multidimensional range query. An example is: “List all events whose temperature lies between 50 ◦ and 60 ◦ , and whose light levels ..."
Abstract

Cited by 152 (15 self)
 Add to MetaCart
(Show Context)
In many sensor networks, data or events are named by attributes. Many of these attributes have scalar values, so one natural way to query events of interest is to use a multidimensional range query. An example is: “List all events whose temperature lies between 50 ◦ and 60 ◦ , and whose light levels lie between 10 and 15. ” Such queries are useful for correlating events occurring within the network. In this paper, we describe the design of a distributed index that scalably supports multidimensional range queries. Our distributed index for multidimensional data (or DIM) uses a novel geographic embedding of a classical index data structure, and is built upon the GPSR geographic routing algorithm. Our analysis reveals that, under reasonable assumptions about query distributions, DIMs scale quite well with network size (both insertion and query costs scale as O ( √ N)). In detailed simulations, we show that in practice, the insertion and query costs of other alternatives are sometimes an order of magnitude more than the costs of DIMs, even for moderately sized network. Finally, experiments on a small scale testbed validate the feasibility of DIMs.
OrderPreserving Symmetric Encryption
"... We initiate the cryptographic study of orderpreserving symmetric encryption (OPE), a primitive suggested in the database community by Agrawal et al. (SIGMOD ’04) for allowing efficient range queries on encrypted data. Interestingly, we first show that a straightforward relaxation of standard securi ..."
Abstract

Cited by 92 (2 self)
 Add to MetaCart
(Show Context)
We initiate the cryptographic study of orderpreserving symmetric encryption (OPE), a primitive suggested in the database community by Agrawal et al. (SIGMOD ’04) for allowing efficient range queries on encrypted data. Interestingly, we first show that a straightforward relaxation of standard security notions for encryption such as indistinguishability against chosenplaintext attack (INDCPA) is unachievable by a practical OPE scheme. Instead, we propose a security notion in the spirit of pseudorandom functions (PRFs) and related primitives asking that an OPE scheme look “asrandomaspossible ” subject to the orderpreserving constraint. We then design an efficient OPE scheme and prove its security under our notion based on pseudorandomness of an underlying blockcipher. Our construction is based on a natural relation we uncover between a random orderpreserving function and the hypergeometric probability distribution. In particular, it makes blackbox use of an efficient sampling algorithm for the latter.
CloudAV: NVersion Antivirus in the Network Cloud
"... Antivirus software is one of the most widely used tools for detecting and stopping malicious and unwanted files. However, the long term effectiveness of traditional hostbased antivirus is questionable. Antivirus software fails to detect many modern threats and its increasing complexity has resulted ..."
Abstract

Cited by 73 (6 self)
 Add to MetaCart
(Show Context)
Antivirus software is one of the most widely used tools for detecting and stopping malicious and unwanted files. However, the long term effectiveness of traditional hostbased antivirus is questionable. Antivirus software fails to detect many modern threats and its increasing complexity has resulted in vulnerabilities that are being exploited by malware. This paper advocates a new model for malware detection on end hosts based on providing antivirus as an incloud network service. This model enables identification of malicious and unwanted software by multiple, heterogeneous detection engines in parallel, a technique we term ‘Nversion protection’. This approach provides several important benefits including better detection of malicious software, enhanced forensics capabilities, retrospective detection, and improved deployability and management. To explore this idea we construct and deploy a production quality incloud antivirus system called CloudAV. CloudAV includes a lightweight, crossplatform host agent and a network service with ten antivirus engines and two behavioral detection engines. We evaluate the performance, scalability, and efficacy of the system using data from a realworld deployment lasting more than six months and a database of 7220 malware samples covering a one year period. Using this dataset we find that CloudAV provides 35% better detection coverage against recent threats compared to a single antivirus engine and a 98 % detection rate across the full dataset. We show that the average length of time to detect new threats by an antivirus engine is 48 days and that retrospective detection can greatly minimize the impact of this delay. Finally, we relate two case studies demonstrating how the forensics capabilities of CloudAV were used by operators during the deployment. 1
Entropy based nearest neighbor search in high dimensions
 In Proc. 17th Ann. ACMSIAM Symposium on Discrete Algorithm
, 1195
"... In this paper we study the problem of finding the approximate nearest neighbor of a query point in the high dimensional space, focusing on the Euclidean space. The earlier approaches use localitypreserving hash functions (that tend to map nearby points to the same value) to construct several hash ..."
Abstract

Cited by 51 (5 self)
 Add to MetaCart
In this paper we study the problem of finding the approximate nearest neighbor of a query point in the high dimensional space, focusing on the Euclidean space. The earlier approaches use localitypreserving hash functions (that tend to map nearby points to the same value) to construct several hash tables to ensure that the query point hashes to the same bucket as its nearest neighbor in at least one table. Our approach is different – we use one (or a few) hash table and hash several randomly chosen points in the neighborhood of the query point showing that at least one of them will hash to the bucket containing its nearest neighbor. We show that the number of randomly chosen points in the neighborhood of the query point q required depends on the entropy of the hash value h(p) of a random point p at the same distance from q at its nearest neighbor, given q and the locality preserving hash function h chosen randomly from the hash family. Precisely, we show that if the entropy I(h(p)q, h) = M and g is a bound on the probability that two faroff points will hash to the same bucket, then we can find the approximate nearest neighbor in O(nρ) time and near linear Õ(n) space where ρ = M / log(1/g). Alternatively we can build a data structure of size Õ(n1/(1−ρ)) to answer queries in Õ(d) time. By applying this analysis to the locality preserving hash functions in [17, 21, 6] and adjusting the parameters we show that the c nearest neighbor can be computed in time Õ(nρ) and near linear space where ρ ≈ 2.06/c as c becomes large. 1
Reductions Among High Dimensional Proximity Problems
, 2000
"... We present improved running times for a wide range of approximate high dimensional proximity problems. We obtain subquadratic running time for each of these problems. These improved running times are obtained by reduction to Nearest Neighbour queries. The problems we consider in this paper are Ap ..."
Abstract

Cited by 36 (4 self)
 Add to MetaCart
We present improved running times for a wide range of approximate high dimensional proximity problems. We obtain subquadratic running time for each of these problems. These improved running times are obtained by reduction to Nearest Neighbour queries. The problems we consider in this paper are Approximate Diameter, Approximate Furthest Neighbours, Approximate Discrete Center, Approximate Line Center, Approximate Metric Facility Location, Approximate Bottleneck Matching, and Approximate Minimum Weight Matching. University of Southern California. Email: agoel@cs.usc.edu . y Stanford University. Email: indyk@cs.stanford.edu . z University of Iowa. Email: kvaradar@cs.uiowa.edu . 0 Problem Ref Approx. Time Comments Diameter [10] p 3 O(dn) [12] 1 + ffl O(dn log n + n 2 ) [2] 1 + ffl ~ O(n 2\GammaO(ffl 2 ) + dn) [18] 1 + ffl ~ O(n 1+1=(1+ffl=6) + dn) here 1 + ffl ~ O(n 1+1=(1+ffl) + dn) ~ O(n) (1 + ffl)NNS queries here p 2 ~ O(dn) see Section 3 for some e...
Approximation Algorithms for kLine Center
, 2002
"... Given a set P of n points in Rd and an integer k> = 1, let w * denote the minimumvalue so that P can be covered by k cylinders of width at most w*. We describe analgorithm that, given P and an "> 0, computes k cylinders of width at most (1 + ")w*that cover P. The running time ..."
Abstract

Cited by 32 (5 self)
 Add to MetaCart
Given a set P of n points in Rd and an integer k> = 1, let w * denote the minimumvalue so that P can be covered by k cylinders of width at most w*. We describe analgorithm that, given P and an &quot;> 0, computes k cylinders of width at most (1 + &quot;)w*that cover P. The running time of the algorithm is O(n log n), with the constant ofproportionality depending on k, d, and &quot;. The running times of the fastest algorithmsthat compute w * exactly are of the order of nO(dk). An approximation algorithm withnearlinear dependence on n for k> 1 was only known for the planar 2line centerproblem, i.e., the case k = 2, d = 2.We believe that the techniques used in showing this result are quite useful in themselves. We first show that there exists a small &quot;certificate &quot; Q ` P, whose size doesnot depend on n, such that for any kcylinders that cover Q, an enlargement of thesecylinders by a factor of (1 + &quot;) covers P. We only establish the existence of a small certificate and our proof does not give us an efficient way of constructing one. We then observe that a wellknown scheme based on sampling and iterated reweighting gives usan efficient algorithm for solving the problem. Only the existence of a small certificate is used to establish the correctness of the algorithm. This technique is quite generaland can be used in other contexts as well.
A Framework for Semantic Link Discovery over Relational Data
 In CIKM 2009
"... In this paper, we present a framework for online discovery of semantic links from relational data. Our framework is based on declarative specification of the linkage requirements by the user, that allows matching data items in many realworld scenarios. These requirements are translated to queries t ..."
Abstract

Cited by 27 (6 self)
 Add to MetaCart
(Show Context)
In this paper, we present a framework for online discovery of semantic links from relational data. Our framework is based on declarative specification of the linkage requirements by the user, that allows matching data items in many realworld scenarios. These requirements are translated to queries that can run over the relational data source, potentially using the semantic knowledge to enhance the accuracy of link discovery. Our framework lets data publishers to easily find and publish highquality links to other data sources, and therefore could significantly enhance the value of the data in the next generation of web.