| Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian. The connectivity server: fast access to linkage information on the web. In Proceedings of the Seventh International World Wide Web Conference (WWW 7), 1998. |
.... is visited by a random surfer on the Web [9] which have been successfully applied to various web mining tasks [13, 23, 7] Prerequisite for this type of research is the availability of extended document indexing techniques that allow fast access to both outgoing and incoming hyperlinks of a page [6]. An excellent overview of recent research in the area, which is nowadays referred to as web mining, can be found in [11] Not surprisingly, the potential of hyperlinks as additional information sources for text categorization tasks has also been looked at in recent research. Previous work on ....
Krishna Bharat, Andrei Broder, Monika R. Henzinger, Puneet Kumar, and Suresh Venkatasubramanian. The connectivity server: Fast access to linkage information on the Web. Computer Networks, 30(1--7):469--477, 1998. Proceedings of the 7th International World Wide Web Conference (WWW-7), Brisbane, Australia.
....The toplevel graph serves the role of an index, allowing the relevant lower level graphs to be quickly located. Our experiments indicate that S Node representations significantly reduce query execution time (often by a factor of 10 or 15) when compared with other proposed Web graph representations [14, 12, 13]. The rest of this paper is organized as follows. In Section 2, we describe the overall structure of the S Node representation. In Section 3, we describe how an S Node representation is constructed and organized on disk. In Section 4 we present detailed experimental results that demonstrate the ....
....to the S Node representation, we also implemented the following schemes for representing Web graphs. Connectivity Server Link3 scheme. We implemented a recently proposed Web graph compression scheme called Link3 [12, 13] developed as an extension to the Connectivity Server described in [14]. See Section 5 for an expanded description of this scheme. Huffman encoded representation. This representation scheme assigns Huffman codes to each page based on in degree. Specifically, pages with higher in degree are assigned smaller codes since they occur more frequently in adjacency ....
Krishna Bharat et. al. The connectivity server: Fast access to linkage information on the Web. In Proc. of the 7th Intl. World Wide Web Conf., April 1998.
....of the Internet Archive [6] uses a Bloom filter stored in memory; this results in a very compact representation, but also gives false positives, i.e. some pages are never downloaded since they collide with other pages in the Bloom filter. Lossless compression can reduce URL size to below 10 bytes [4, 24], though this is still too high for large crawls. In both cases main memory will eventually become a bottleneck, although partitioning the application will also partition the data structures over several machines. A more scalable solution uses a disk resident structure, as for example done in ....
K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: Fast access to linkage information on the web. In 7th Int. World Wide Web Conference, May 1998.
....through term based techniques and then performs an analysis of only the graph neighborhood of these pages. The success of such link based ranking techniques has also motivated a large amount of research focusing on the basic structure of the web [12] e# cient computation with massive web graphs [9, 37, 40], and other applications of link based techniques such as finding related pages [10] classifying pages [14] crawling important pages [20] or pages on a particular topic [16] or web data mining [29, 30] to name just a few. Both Pagerank and HITS are based on an iterative process defined on the ....
....a large graph will not fit into the main memory of most machines if we use standard graph data structures, and there are two approaches to overcoming this problem. The first approach is to try to fit the graph structure into main memory by using compression techniques for the web graph proposed in [9, 37, 40]; this results in very fast running times but still requires substantial amounts of memory. The second approach, taken by Haveliwala [24] implements the Pagerank computation in an I O e#cient manner through a sequence of scan operations on data files containing the graph structure and the ....
[Article contains additional citation context not shown here]
K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: Fast access to linkage information on the web. In 7th Int. World Wide Web Conference, May 1998.
....example: The architecture of the Mercator web crawler is reported by Heydon and Najork [12] Brin and Page [5] document many design details of the early Google search engine prototype. Design possibilities and tradeoffs for a repository of web pages are covered by Hirai, et al. [13] Bharat, et al. [4] describe their experiences in building a fast web page linkage connectivity server. Different architectures for distributed inverted indexing schemes are discussed by Melnik, et al. [24] and Ribeiro Neto, et al. [31] In contrast, this paper primarily focuses on design and implementation details ....
Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian. The Connectivity Server: fast access to linkage information on the Web. In Proceedings of 7th International World Wide Web Conference, 14--18 April 1998.
....generic web page storage, update, and retrieval. Haveliwala [12] described a disk e#cient non distributed method to compute PageRank [19] for large web linkage graphs. Stata, et al. [22] showed a design of how to construct a filtered database of interesting words for all web pages. Bharat, et al. [5] described their experience building a server that constructs and queries a compact representation of forward and backward linkage among a set of web pages. The common property of these e#orts is that each concentrates on one or few search engine data preparation tasks and develops an e#cient ....
....one data modification cannot necessarily be reused by the processing of subsequent updates. One can somewhat improve the performance of the control driven approach by using (virtual striped) disks with better characteristics [21] using more RAM or fitting more data into RAM with smart encodings [5, 22], or start moving away from the purely control driven computation by explicitly exploiting data patterns and access locality that exists in a particular workload [13, 8] or abandon the control driven approach in favor of developing very task specific algorithms and data structures [12, 20, 6] ....
Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian. The Connectivity Server: fast access to linkage information on the Web. In Proceedings of 7th International World Wide Web Conference, 14--18 April 1998.
....1999 by running a simple web crawler that collects web pages from given seed pages in the breadth first order. From the archive, we built a connectivity database that can search outgoing and incoming links of a given page. Basic functions of the database were similar to the connectivity server [1] developed in DIGITAL, Systems Research Center. Our database indexed about 120 million hyperlinks between about 30 million pages (17 million pages of pages in the archive, and 13 million pages pointed to by pages in the archive) We implemented the whole system on Sun Enterprise Server 6500 with 8 ....
Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian. The Connectivity Server: fast access to linkage information on the Web. In Proceedings of the 7th International World Wide Web Conference, 1998.
....several hundreds of millions vertices. Therefore, the ecient encoding of the graph becomes a crucial issue. The challenge is then to nd a good balance between space and time requirements. Related works. Until now, the main work concerning graph encoding is the Connectivity Server presented in [2]. This server maintains the graph in memory and is able to compute the neighborhood of one or more vertices. In the rst version of the server, the graph is stored as an array of adjacency lists, describing the successors and predecessors of each vertex. The URLs are compressed using a delta ....
....for both links and URLs. The space needed to store a link has been reduced from 8 to 1.7 bytes in average, and the space needed to store a URL has been reduced from 16 to 10 bytes in average. Notice however that a full description of the method is available only for the rst version of the server [2], the newer (and more ecient) ones being only shortly described in [3, 12] Experimental protocol. Our aim is to provide an ecient and simple solution to the problem of encoding large sets of URLs and large parts of the Web graph using only standard, free and widely available tools, namely sort, ....
Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian, The Connectivity Server: fast access to linkage information on the Web, Computer Networks and ISDN Systems 30 (1998), no. 1-7, 469-477.
.... for a wide variety of Web algorithms, including algorithms for ranking pages based on their connectivity [11, 3] and finding natural communities of pages on a shared topic [13] Indeed, at least one major search engine has designed a tool called the connectivity server for storing the Web graph [2, 4]. Given this previous work, a natural question to ask is how well the Web graph and Web like graphs can be compressed, in order to save on the memory required to store or transfer such graphs. Good compression requires using the structural properties of the Web graph, and hence an important first ....
....the graph structure itself. 2. Compressing the underlying graph for storage or transmission, maintaining a given ordering of the nodes. As an example of this setting, we might order the nodes according to the sorted order of the URLs (so that the URLs can be compressed by delta encoding, as in [2]) 3. Compressing the underlying graph for use in its compressed form. That is, we desire a compressed form of the graph that still allows for efficient computation on the compressed form. Our primary focus in the paper is the second setting, where we are given a node ordering and are concerned ....
[Article contains additional citation context not shown here]
K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The Connectivity Server: fast access to linkage information on the Web. In Proceedings of the 7th World Wide Web Conference, 1998.
....pair of adjacency list structures described above. Small graphs of hundreds or even thousands of nodes can be e#ciently represented by any one of a variety of well known data structures [1] However, doing the same for a graph with several million nodes and edges is an engineering challenge. In [7], Bharath et al. describe the Connectivity Server, a system to scalably deliver linkage information for all pages retrieved and indexed by the Altavista search engine. Text index: Even though link based techniques are used to enhance the quality and relevance of search results, text based ....
Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian. The connectivity server: Fast access to linkage information on the web. In Proceedings of the Seventh International World-Wide Web Conference, April 1998. 38
..... 50 7.8 Metaphors We Surf the Web By[9] 50 7.9 Description of Collections and Encyclopaedias on the Web using XML[10] 51 7. 10 The Connectivity Server: fast access to linkage information on the Web[1] . 51 7.11 Introduction to WordNet: An On line Lexical Database[11] 52 7.12 The Maximum Clique Problem[2] 52 8 Conclusions 53 8.1 Indexing music as possible application of Evolving Association ....
....pool The WWW seems to be actually the biggest, fastest growing and evolving pool of information. Furthermore due to its decentralized, unorganized nature it is quite di#cult (maybe event impossible) to provide a high quality keyword index according former improvements. Some recent publications[7, 4, 6, 5, 3, 1] specialize on information retrieval on the WWW without focusing on the use of keywords. The methods described in these papers use the hyper linked structure of the Internet to retrieve and rank valuable information. The heuristic used is, that links are provided by a lot of di#erent authors ....
[Article contains additional citation context not shown here]
Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian. The connectivity server: fast access to linkage information on the web. Computer Networks and ISDN Systems, 1998.
....like the global distribution reported by Broder and others. We o#er a heuristic explanation for this observation. How topic biased are breadth first crawls Several production crawlers follow an approximate breadth first strategy. A breadth first crawler was used to build the Connectivity Server [5, 7] at Alta Vista. Najork and Weiner [27] report that a breadth first crawl visits pages with high PageRank early, a valuable property for a search engine. A crawl of over 80 million pages at the NEC Research Institute broadly follows a breadth first policy. However, can we be sure that a ....
....same order of magnitude as with the detailed topics, but the L1 inter walk distance achieved is much lower. Figure 4: An estimate of the background distribution across the 12 top level topics in our taxonomy. populates Alta Vista s Connectivity Servers follows largely a breadth first strategy [5, 7]. Najork and Weiner [27] have demonstrated that a breadth first crawler also tends to visit nodes with large PageRank early (because good authorities tend to be connected from many places by short paths) A crawler of substantial scale deployed in NEC Research uses breadth first scheduling as ....
K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The Connectivity Server: Fast access to linkage information on the Web. In 7th World Wide Web Conference, Brisbane, Australia, 1998.
No context found.
K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: Fast access to linkage information on the Web. In Conference, pages 104--111, 1998.
....in the index and the other data structures. When results are displayed, these document ids need to be converted back to the URL. This is done using the URL database. The graph representation keeps for each document all the documents pointing to it and all the documents it points to. See [4] for a potential implementation. The document repository stores for each document id the original document. The related pages nder stores for each document id all document ids of pages that are related to the document. A related page is a page that addresses the same topic as the original page, ....
K. Bharat, A. Z. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: Fast access to linkage information on the Web. In Proceedings of the Seventh International World Wide Web Conference 1998, pages 469-477.
No context found.
Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian. The connectivity server: fast access to linkage information on the web. In Proceedings of the Seventh International World Wide Web Conference (WWW 7), 1998.
No context found.
K. Bharat, A. Broder, M. R. Henzinger, P. Kumar, and S. Venkatasubramanian. The Connectivity server: Fast access to Linkage Information on the Web. In Proceedings of the 7th International World Wide Web Conference (WWW-7), pages 469--477, Brisbane, Australia, 1998.
No context found.
K. Bharat, A. Broder, M. R. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: Fast access to linkage information on the Web. Computer Networks, 30(1--7):469--477, 1998. Proceedings of the 7th International World Wide Web Conference (WWW-7), Brisbane, Australia.
No context found.
Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian. The Connectivity Server: Fast access to linkage information on the Web. In Proceedings of the Seventh International World--Wide Web Conference, pages 469--477, Brisbane, Australia, 1998.
No context found.
K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: fast access to linkage information on the web. In Proceedings of 7th International World Wide Web Conference, pages 14--18, 1998.
No context found.
K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: Fast access to linkage information on the web. In 7th Int. World Wide Web Conference, May 1998.
No context found.
K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: fast access to linkage information on the web. In Proceedings of 7th International World Wide Web Conference, pages 14--18, 1998.
No context found.
K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian, The Connectivity Server: fast access to linkage information on the Web, In Proc. of the Seventh International World Wide Web Conference, Brisbane, Australia, April 14-18, 1998.
No context found.
K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian, The Connectivity Server: fast access to linkage information on the Web, In Proc. of the Seventh International World Wide Web Conference, Brisbane, Australia, April 14-18, 1998.
No context found.
K.Bharat,A.Broder,M.Henzinger,P.Kumar,and S. Venkatasubramanian. The Connectivity Server: fast access to linkage information on the Web. In Proc. 7th International WWW Conference, 1998.
No context found.
Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian. The Connectivity Server: fast access to linkage information on the Web. Computer Networks and ISDN Systems, 30(1--7):469--477, 1998.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC