| Taher H. Haveliwala, Aristides Gionis, and Piotr Indyk. Scalable techniques for clustering the web. In Proc. of WebDB, pages 129--134, 2000. |
....[2] Some experiments on small web page collections using minimum branching and several faster heuristics are given in [38] which show significant differences in compression performance between different approaches. For very large collections, general document clustering techniques such as [10, 30, 22] could be applied, or specialized heuristics such as [9, 13, 15] for the case of web documents. In particular, 13] demonstrates that there is significant benefit in choosing more than one reference page to compress a web page. One example of long reference chains arises when dealing with many ....
....require a priori knowledge of upper bounds on the distances between the two files. We now discuss protocols for estimating these distances. We note that we could apply the sampling techniques in [10, 30] to construct fingerprints of the files that could be efficiently transmitted (see also [22] for the problem of finding similar files in larger collections) While these techniques may work well in practice to decide how similar two files are, they are not designed with any of the common distance measures in mind, but based on the idea of estimating the number of common substrings of a ....
T.H. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. In Proc. of the WebDB Workshop, Dallas, TX, May 2000.
....We present a general framework, called cluster based delta compression, for efficiently computing nearoptimal delta encoding schemes on large collections of files. The framework combines the branching approach with two recently proposed hash based techniques for clustering files by similarity [3, 10, 14, 17]. Within this framework, we evaluate a number of different algorithms and heuristics in terms of compression and running time. Our results show that compression very close to that achieved by the optimal branching algorithm can be achieved in time that is within a small multiplicative factor of ....
.... with quadratic complexity called min wise independent hashing proposed by Broder in [3] see also Manber and Wu [17] for a similar technique) and a very recent nearly linear time technique called localitysensitive hashing proposed by Indyk and Motwani in [14] and applied to web documents in [10]. 2 Delta Compression Based on Optimum Branchings Delta compressors such as vcdiff or zdelta provide an efficient way to encode the difference between two similar files. However, given a collection of files, we are faced with the problem of succinctly representing the entire collection through ....
[Article contains additional citation context not shown here]
T.H. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. In Proc. of the WebDB Workshop, Dallas, TX, May 2000.
....matched one of those in the sample. Viewing the pages as multisets of object lengths, we chose Jaccard s coefficient (Sim(X, IXnYl ixv I as our metric [28] using the standard definitions of mul tiset intersection and union minimum number of rep etitions for intersection, maximum for union [16]. To evaluate the success of our hypothetical adver sary, we defined the following categories for pages in our sample and target subsample: 1. Identifiable page: given a set T of target pages and a page t E T (identified as t when fetched a second time) t is an identifiable page with respect ....
T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. In WebDB (Informal Proceedings), pages 129-134, 2000. (http://dbpubs.stanford.edu:8090/pub/2000-23).
....We present a general framework, called cluster based delta compression, for efficiently com2 puting near optimal delta encoding schemes on large collections of files. The framework combines the branching approach with two recently proposed hash based techniques for clustering files by similarity [3, 10, 14, 17]. Within this framework, we evaluate a number of different algorithms and heuristics in terms of compression and running time. Our results show that compression very close to that achieved by the optimal branching algorithm can be achieved in time that is within a small multiplicative factor of ....
.... with quadratic complexity called min wise independent hashing proposed by Broder in [3] see also Manber and Wu [17] for a similar technique) and a very recent nearly linear time technique called localitysensitive hashing proposed by Indyk and Motwani in [14] and applied to web documents in [10]. Finally, Chan and Woo [5] observe that in the case of web pages, similarities in the URL provide a powerful heuristic for identifying good reference files for delta compression. Thus, another web page from a close by subdirectory on the same web server often shares a lot of content and ....
[Article contains additional citation context not shown here]
T.H. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. In Proc. of the WebDB Workshop, Dallas, TX, May 2000.
....also be based on link structure [46] If the same page is linked to by two other pages, say, those two pages might be similar. Problems include the fact that newer pages might not yet have links to them, the tightly knit community e#ect [77] and user idiosyncracies. Locality sensitive hashing [57, 64, 20] is a clever method that uses a hash function such that the probability of collision of sensitive pages will be much higher than the probability for dissimilar pages. Page resemblance techniques are useful for many purposes: detecting updated copies, mirrors, plageurism, etc. It is probably not ....
T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the Web. In WebDB'2000.
.... by the same document might be similar to each other [19] Content analysis looks at the word similarity between documents [46] This is based on the premise that two documents related to the same subject will use the same words [39] There are hybrid approaches which combine link and text analysis [13, 24]. The point of both link and content analysis is to keep related documents together. This leads naturally to the concept of a distance metric; in the past researchers have used citation relationships [21] term similarity [3, 39] and co authorship [36] For content analysis, we use term ....
T. H. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the Web. In WebDB'2000.
No context found.
T. Haveliwala, A. Gionis, and P. Indyk. Scalable techniques for clustering the web. WebDB Workshop, 2000.
No context found.
Taher H. Haveliwala, Aristides Gionis, and Piotr Indyk. Scalable techniques for clustering the web. In Proc. of WebDB, pages 129--134, 2000.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC