Results 1  10
of
17
Hashing algorithms for largescale learning
 In NIPS
, 2011
"... Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of bbit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that bbit minwise has ..."
Abstract

Cited by 24 (9 self)
 Add to MetaCart
Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of bbit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that bbit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve largescale and highdimensional statistical learning tasks, especially when the data do not fit in memory. We compare bbit minwise hashing with the CountMin (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that bbit minwise hashing is significantly more accurate (at the same storage cost) than VW (and random projections) for binary data. 1
One permutation hashing.
 In NIPS, Lake Tahoe, NV,
, 2012
"... Abstract Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similarities in massive binary data such as text. Recently, bbit minwise hashing has been applied to largescale learning and sublinear time nearneighbor search. The major drawback of minwise ..."
Abstract

Cited by 16 (9 self)
 Add to MetaCart
(Show Context)
Abstract Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similarities in massive binary data such as text. Recently, bbit minwise hashing has been applied to largescale learning and sublinear time nearneighbor search. The major drawback of minwise hashing is the expensive preprocessing, as the method requires applying (e.g.,) k = 200 to 500 permutations on the data. This paper presents a simple solution called one permutation hashing. Conceptually, given a binary data matrix, we permute the columns once and divide the permuted columns evenly into k bins; and we store, for each data vector, the smallest nonzero location in each bin. The probability analysis illustrates that this one permutation scheme should perform similarly to the original (kpermutation) minwise hashing. Our experiments with training SVM and logistic regression confirm that one permutation hashing can achieve similar (or even better) accuracies compared to the kpermutation scheme. See more details in arXiv:1208.1259.
Densifying one permutation hashing via rotation for fast near neighbor search
, 2013
"... The query complexity of locality sensitive hashing (LSH) based similarity search is dominated by the number of hash evaluations, and this number grows with the data size (Indyk & Motwani, 1998). In industrial applications such as search where the data are often highdimensional and binary (e.g ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
The query complexity of locality sensitive hashing (LSH) based similarity search is dominated by the number of hash evaluations, and this number grows with the data size (Indyk & Motwani, 1998). In industrial applications such as search where the data are often highdimensional and binary (e.g., text ngrams), minwise hashing is widely adopted, which requires applying a large number of permutations on the data. This is costly in computation and energyconsumption. In this paper, we propose a hashing technique which generates all the necessary hash evaluations needed for similarity search, using one single permutation. The heart of the proposed hash function is a “rotation ” scheme which densifies the sparse sketches of one permutation hashing (Li et al., 2012) in an unbiased fashion thereby maintaining the LSH property. This makes the obtained sketches suitable for hash table construction. This idea of rotation presented in this paper could be of independent interest for densifying other types of sparse sketches. Using our proposed hashing method, the query time of a (K,L)parameterized LSH is reduced from the typical O(dKL) complexity to merely O(KL + dL), where d is the number of nonzeros of the data vector,K is the number of hashes in each hash table, and L is the number of hash tables. Our experimental evaluation on real data confirms that the proposed scheme significantly reduces the query processing time over minwise hashing without loss in retrieval accuracies. 1.
Efficient Document Clustering via Online Nonnegative Matrix Factorizations
"... In recent years, Nonnegative Matrix Factorization (NMF) has received considerable interest from the data mining and information retrieval fields. NMF has been successfully applied in document clustering, image representation, and other domains. This study proposes an online NMF (ONMF) algorithm to e ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
In recent years, Nonnegative Matrix Factorization (NMF) has received considerable interest from the data mining and information retrieval fields. NMF has been successfully applied in document clustering, image representation, and other domains. This study proposes an online NMF (ONMF) algorithm to efficiently handle very largescale and/or streaming datasets. Unlike conventional NMF solutions which require the entire data matrix to reside in the memory, our ONMF algorithm proceeds with one data point or one chunk of data points at a time. Experiments with onepass and multipass ONMF on real datasets are presented. 1
Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). arXiv
, 2014
"... ar ..."
(Show Context)
Beyond pairwise: Provably fast algorithms for approximate kway similarity search
 In NIPS, Lake Tahoe, NV
, 2013
"... We go beyond the notion of pairwise similarity and look into search problems with kway similarity functions. In this paper, we focus on problems related to 3way Jaccard similarity: R3way = S1∩S2∩S3S1∪S2∪S3  , S1; S2; S3 ∈ C, where C is a size n collection of sets (or binary vectors). We show t ..."
Abstract

Cited by 5 (5 self)
 Add to MetaCart
We go beyond the notion of pairwise similarity and look into search problems with kway similarity functions. In this paper, we focus on problems related to 3way Jaccard similarity: R3way = S1∩S2∩S3S1∪S2∪S3  , S1; S2; S3 ∈ C, where C is a size n collection of sets (or binary vectors). We show that approximate R3way similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to kway resemblance. In the process, we extend traditional framework of locality sensitive hashing (LSH) to handle higherorder similarities, which could be of independent theoretical interest. The applicability ofR3way search is shown on the “Google Sets ” application. In addition, we demonstrate the advantage of R3way resemblance over the pairwise case in improving retrieval quality. 1
bbit minwise hashing in practice
 In Internetware
, 2013
"... Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of bbit minwise hashing [23, 24] for efficient search and learning on massive, highdimensional, binary data (which are typical for many appli ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of bbit minwise hashing [23, 24] for efficient search and learning on massive, highdimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical issues which must be addressed before one can apply bbit minwise hashing to the volumes of data often used industrial applications. Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20 80 and becomes substantially smaller than the data loading time. Reducing the preprocessing time is highly beneficial in practice, e.g., for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers. Another critical issue is that for very large data sets it becomes impossible to store a (fully) random permutation matrix, due to its space requirements. Our paper is the first study to demonstrate that bbit minwise hashing implemented using simple hash functions, e.g., the 2universal (2U) and 4universal (4U) hash families, can produce very similar learning results as using fully random permutations. Experiments on datasets of up to 200GB are presented.
Hashing for Similarity Search: A Survey
, 2014
"... Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this pap ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space.
Is MinWise Hashing Optimal for Summarizing Set Intersection?
"... Minwise hashing is an important method for estimating the size of the intersection of sets, based on a succinct summary (a “minhash”) independently computed for each set. One application is estimation of the number of data points that satisfy the conjunction of m ≥ 2 simple predicates, where a min ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Minwise hashing is an important method for estimating the size of the intersection of sets, based on a succinct summary (a “minhash”) independently computed for each set. One application is estimation of the number of data points that satisfy the conjunction of m ≥ 2 simple predicates, where a minhash is available for the set of points satisfying each predicate. This has applications in query optimization and for approximate computation of COUNT aggregates. In this paper we address the question: How many bits is it necessary to allocate to each summary in order to get an estimate with 1 ± ε relative error? The stateoftheart technique for minimizing the encoding size, for any desired estimation error, is bbit minwise hashing due to Li and König (Communications of the ACM, 2011). We give new lower and upper bounds: • Using information complexity arguments, we show that bbit minwise hashing is space optimal for m = 2 predicates in the sense that the estimator’s variance is within a constant factor of the smallest possible among all summaries with the given space usage. But for conjunctions of m> 2 predicates we show that the performance of bbit minwise hashing (and more generally any method based on “kpermutation ” minhash) deteriorates as m grows. • We describe a new summary that nearly matches our lower bound for m ≥ 2. It asymptotically outperform all kpermutation schemes (by around a factor Ω(m / logm)), as well as methods based on subsampling (by a factor Ω(lognmax), where nmax is the maximum set size).