14 citations found. Retrieving documents...
Andrei Z. Broder. Filtering near-duplicate documents. In Proc. of FUN, Isola d'Elba, Italy, 1998.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Estimating Rarity and Similarity over Data Stream Windows - Datar, Muthukrishnan (2002)   (5 citations)  (Correct)

....between these two windowed data streams at time t is defined as oe t = The problem is to estimate the similarity oe t at any time t. This is the classical notion of similarity between two sets. It has been useful in estimating transitive closures [11] web page duplicate detection [4] and data mining [12, 35] among other things. Rarity ae finds many data mining applications. For example, consider the data stream of IPaddresses that access any online service like a search engine, online store like Amazon, onlinenewspapers etc. The set of rare IP address (for the appropriate ....

....passes over it. One may be tempted to randomly sample each item, but we need to coordinate the sampling of an item when it applies multiple times, and this calls for sampling from the universe, rather than the individual items in the data stream. We use min wise hashing, which was introduced in [11, 4] to sample from the universe, and are able to derive an unbiased estimator for rarity. Min wise hashing, per se is expensive to compute, but suitable approximations can be generated [38, 41] and we show that these suffice to solve our problem. Now let us consider the additional complication in ....

A. Broder. Filtering Near-Duplicate Documents. In Proceedings of FUN, 1998.


Evaluating Strategies for Similarity Search on the Web - Haveliwala, Gionis, Klein.. (2002)   (20 citations)  (Correct)

....a priori assumption about what are the best features for document representation. Rather, we develop an evaluation methodology that allows us to select the best features from among a set of di erent candidates. Approaches algorithmically related to the ones presented in Section 6 have been used in [7, 4], although for the di erent problem of identifying mirror pages. 9. ACKNOWLEDGMENTS We would like to thank Professor Chris Manning, Professor Je Ullman, and Mayur Datar for their insights and invaluable feedback. 10. ....

A. Broder. Filtering Near-duplicate Documents. Proceedings of FUN, 1998.


Background Readings for Collection Synthesis - Bibliography (2002)   (Correct)

....also be based on link structure [46] If the same page is linked to by two other pages, say, those two pages might be similar. Problems include the fact that newer pages might not yet have links to them, the tightly knit community e#ect [77] and user idiosyncracies. Locality sensitive hashing [57, 64, 20] is a clever method that uses a hash function such that the probability of collision of sensitive pages will be much higher than the probability for dissimilar pages. Page resemblance techniques are useful for many purposes: detecting updated copies, mirrors, plageurism, etc. It is probably not ....

A. Broder. Filtering near-duplicate documents. In Proceedings of Fun'98, 1998.


Scalable Techniques for Clustering the Web (Extended.. - Haveliwala, Gionis, Indyk (2000)   (Correct)

.... the number of documents, df i is the overall document frequency of word i, and f u i is as before [16] Finally we normalize the frequencies within each bag so all frequencies sum to a fixed number (in our implementation 100) 4 A similar algorithm has been independently discovered by Broder [3] 2.1 Content Based Bags The generation of content based bags is straightforward. We scan through the web repository, outputting normalized word occurrence frequencies for each document in turn. The following three heuristics are used to improve the quality of the word bags generated: All ....

A. Broder, "Filtering near-duplicate documents", FUN'98.


Min-Wise Independent Permutations - Broder, Charikar, Frieze.. (1998)   (51 citations)  Self-citation (Broder)   (Correct)

....Then we can readily estimate the resemblance of A and B by computing how many corresponding elements in SA and SB are common. For a set of documents, we avoid quadratic processing time, because a particular value for any coordinate is usually shared by only a few documents. For details see [7, 8, 11]. In practice, as in the case of hashing discussed above, we have to deal with the sad reality that it is impossible to choose # uniformly at random in S n . We are thus led to consider smaller families of permutations that still satisfy the minwise independence condition given by equation (4) ....

A. Z. Broder. Filtering near-duplicate documents. In Proceedings of FUN 98, 1998. To appear.


Completeness and Robustness Properties of Min-Wise.. - Broder, Mitzenmacher (1999)   (1 citation)  Self-citation (Broder)   (Correct)

.... where the meaning is clear) if for any set X [n] and any x 2 X , when is chosen at random in P we have fi fi fi fi Pr Gamma minf(X)g = x) Delta Gamma 1 jX j fi fi fi fi ffl jX j : 5) For further details about the use of these ideas to estimate document similarity see [6, 1, 2]. An optimal (size wise) construction for a MW I family was obtained by Takei, Itoh, and Shinozaki [13] Explicit constructions of approximately MW I families were obtained by Indyk [8] and by Saks al. 11] For an application of these families to derandomization see [5] We also note that ....

A. Z. Broder. Filtering near-duplicate documents. In Proceedings of FUN 98, 1998. To appear.


A Derandomization Using Min-Wise Independent Permutations - Broder, Charikar..   (4 citations)  Self-citation (Broder)   (Correct)

.... alternative notion of limited independence based on what we call min wise independent permutations [4] Our motivation was the connection to an approach for determining the resemblance of sets, which can be used for example to identify documents on the World Wide Web that are essentially the same [2, 3, 5]. In this paper we demonstrate that the notion of min wise independence can also prove useful for derandomization. Specifically, we use a polynomial sized construction of approximate min wise independent permutations due to Indyk to derandomize the parallel approximate set cover algorithm of ....

....be generalized to give us an appropriate family of polynomial size when k is a constant. We note in passing that for estimating the resemblance of documents as in [2] and [5] with a sketch of size k we need one sample from a k minima wise independent family, while for the method presented in [3], we need k separate samples from a min wise independent family. There is an interesting meta principle behind our derandomizations, which appears worth emphasizing here. Remark 6. Let E be an event that depends only on the order of the first k elements of a random permutation. Then any bound on ....

A. Z. Broder. Filtering near-duplicate documents. In Proceedings of FUN 98, 1998. To appear.


A Comparison of Techniques to Find Mirrored Hosts on the WWW - Bharat, Broder, Dean, al. (1999)   (17 citations)  Self-citation (Broder)   (Correct)

....the resemblance of the corresponding documents can be computed in time lineasr in the size of the sketches. Furthermore, clustering a collection of m documents into sets of closely resembling documents can be done in time proportional to m log m rather than m 2 . For further details see [7, 5, 6]. 4.2 Experimental Results 4.2.1 Precision vs Rank Figure 1 plots precision vs. rank up to 25000 for all our algorithms. The naive hosts algorithm, which uses the least information performs the worst, with a terminal precision of 0.27 at rank 25000. The hconn1 and hconn2 algorithms, which ....

A. Z. Broder. Filtering near-duplicate documents. In Proceedings of FUN 98, 1998. To appear.


A Derandomization Using Min-Wise Independent.. - Broder, Charikar.. (1998)   (4 citations)  Self-citation (Broder)   (Correct)

.... alternative notion of limited independence based on what we call min wise independent permutations [4] Our motivation was the connection to an approach for determining the resemblance of sets, which can be used for example to identify documents on the World Wide Web that are essentially the same [2, 3, 5]. In this paper we demonstrate that the notion of min wise independence can also prove useful for derandomization. Specifically, we use a polynomial sized construction of approximate minwise independent permutations due to Indyk to derandomize the parallel approximate set cover algorithm of ....

....be generalized to give us an appropriate family of polynomial size when k is a constant. We note in passing that for estimating the resemblance of documents as in [2] and [5] with a sketch of size k we need one sample from a k minima wise independent family, while for the method presented in [3], we need k separate samples from a min wise independent family. There is an interesting meta principle behind our derandomizations, which appears worth emphasizing here. Remark 1 Let E be an event that depends only on the order of the first k elements of a random permutation. Then any bound on ....

A. Z. Broder. Filtering near-duplicate documents. In Proceedings of FUN 98, 1998. To appear.


Min-Wise Independent Permutations - Broder, Charikar, Frieze.. (1998)   (51 citations)  Self-citation (Broder)   (Correct)

....Then we can readily estimate the resemblance of A and B by computing how many corresponding elements in SA and SB are common. For a set of documents, we avoid quadratic processing time, because a particular value for any coordinate is usually shared by only a few documents. For details see [7, 8, 11]. In practice, as in the case of hashing discussed above, we have to deal with the sad reality that it is impossible to choose # uniformly at random in S n . We are thus led to consider smaller families of permutations that still satisfy the minwise independence condition given by equation (4) ....

A. Z. Broder. Filtering near-duplicate documents. In Proceedings of FUN 98, 1998. To appear.


A Sketch-based Sampling Algorithm on Sparse Data - Ping Li Pingli   (Correct)

No context found.

Andrei Z. Broder. Filtering near-duplicate documents. In Proc. of FUN, Isola d'Elba, Italy, 1998.


Evaluating Strategies for Similarity Search on the Web - Haveliwala, Gionis, Klein.. (2002)   (20 citations)  (Correct)

No context found.

A. Broder. Filtering Near-duplicate Documents. Proceedings of FUN, 1998.


Nearest Neighbors In High-Dimensional Spaces - Indyk (2004)   (1 citation)  (Correct)

No context found.

A. Broder. Filtering near-duplicate documents. Proc. FUN, 1998.


Evaluating Strategies for Similarity Search on the Web - Haveliwala, Gionis, Klein.. (2002)   (20 citations)  (Correct)

No context found.

A. Broder. Filtering Near-duplicate Documents. Proceedings of FUN, 1998.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC