| SHKAPENYUK, V. and SUEL, T. 2002. Design and implementation of a high-performance distributed Web crawler. In Proceedings of 8th International Conference on Data Engineering. San Jose, CA, 357-368. |
....to gather data about the web. Even if no code is available, in several cases the basic design has been made public: this is the case, for instance, of Mercator [17] the Altavista crawler) of the original Google crawler [6] and of some research crawlers developed by the academic community [22, 23, 21]. Nonetheless, little published work actually investigates the fundamental issues underlying the parallelization of the different tasks involved with the crawling process. In At the time, the name of the crawler was Trovatore, later changed into UbiCrawler when the authors learned about the ....
.... commercial crawlers are not public, there are some highly performant, scalable crawling systems whose structure has been described and discussed by the authors; among them, two distributed crawlers that might be compared to UbiCrawler are Mercator [17] used by AltaVista, the spider discussed in [21], and the spider discussed in [22] Mercator is high performance web crawler whose components are loosely coupled; indeed, they can be distributed across several computing units. However, there is a central element, the frontier, which keeps track of all the URLs that have been crawled up to now ....
[Article contains additional citation context not shown here]
Vladislav Shkapenyuk and Torsten Suel. Design and implementation of a highperformance distributed web crawler. In IEEE International Conference on Data Engineering (ICDE), 2002.
No context found.
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proc. of the Int. Conf. on Data Engineering, 2002.
No context found.
V. Shkapenyuk and T. Suel. Design and implementation of a highperformance distributed web crawler. In Proc. of the Int. Conf. on Data Engineering, February 2002.
No context found.
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proc. of the Int. Conf. on Data Engineering, 2002.
No context found.
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proc. of the Int. Conf. on Data Engineering, February 2002.
No context found.
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proc. of the Int. Conf. on Data Engineering, February 2002. 10
....section where the best combination of index structures is used. Crawling punt: As mentioned, we assume that in the web search application crawling would be performed by crawling clients that fetch and insert documents. The main reason is that from our own experiences with large scale crawling [44] we are not sure a P2P solution is appropriate. Large crawls generate many management issues due to queries or complaints from web site operators and network administrators. It is important to be able to reconfigure a crawler quickly to avoid certain web sites or subnetworks or to modify its ....
....beyond BFS are hard to implement in a P2P environment unless there is a centralized scheduler. We refer to [6, 14, 15] for work on highly distributed crawling. Thus, we would expect that a handful of powerful crawling clients would provide most documents, and we plan to use our Polybot crawler [44] to initially populate the system with data. It might be more appropriate to incorporate recrawling into the system, though. Thus, an inserted page could be labeled with an expiration date, after which it is automatically refreshed by the node holding the page. Alternatively, web sites could also ....
[Article contains additional citation context not shown here]
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proc. of the Int. Conf. on Data Engineering, February 2002.
....used only one disk for the index structure. Software and Data Sets: Our experiments were run on a search engine prototype, named pingo, that is currently being developed in our research group. The document collection consisted of about million web pages crawled by the PolyBot web crawler [35] in October of 2002. Not all of the pages are distinct and the set contains a significant number of duplicates due to pages being repeatedly downloaded because of crawl interruptions. The crawl started at a hundred homepages of US Universities, and was performed in a breadth first manner. As ....
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proc. of the Int. Conf. on Data Engineering, February 2002.
No context found.
SHKAPENYUK, V. and SUEL, T. 2002. Design and implementation of a high-performance distributed Web crawler. In Proceedings of 8th International Conference on Data Engineering. San Jose, CA, 357-368.
No context found.
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proceedings of International Conference on Data Engineering(ICDE), 2002.
No context found.
T. Suel and V. Shkapenyuk. Design and implementation of a high-performance distributed web crawler. In Proceedings of the IEEE International Conference on Data Engineering, February 2002.
No context found.
T. Suel and V. Shkapenyuk. Design and implementation of a high-performance distributed web crawler. In Proceedings of the IEEE International Conference on Data Engineering, February 2002.
No context found.
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proceedings of International Conference on Data Engineering(ICDE), 2002.
No context found.
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In IEEE International Conference on Data Engineering (ICDE), Feb. 2002.
No context found.
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Proceedings of the 18th International Conference on Data Enginering, 2002. 293
No context found.
Vladislav Shkapenyuk and Torsten Suel. Design and implementation of a highperformance distributed web crawler. In ICDE, 2002.
No context found.
V. Shkapenyuk and T. Suel, "Design and implementation of a high-performance distributed web crawler," in ICDE, 2002. [Online]. Available: citeseer.nj.nec.com/shkapenyuk01design.html
No context found.
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In ICDE, 2002.
No context found.
Vladislav Shkapenyuk and Torsten Suel. Design and implementation of a high-performance distributed web crawler. In Proceedings of the 18th International Conference on Data Engineering (ICDE), pages 357 -- 368, San Jose, California, February 2002. IEEE Cs. Press.
No context found.
V. Shkapenyuk and T. Suel, "Design and implementation of a high-performance distributed web crawler," in ICDE, 2002. [Online]. Available: citeseer.nj.nec.com/shkapenyuk01design.html
No context found.
Vladislav Shkapenyuk and Torsten Suel. Design and implementation of a high-performance distributed web crawler. In IEEE International Conference on Data Engineering (ICDE), 2002.
No context found.
Vladislav Shkapenyuk and Torsten Suel. Design and implementation of a high-performance distributed web crawler. In Proceedings of International Conference on Data Engineering, 2002.
No context found.
Vladislav Shkapenyuk and Torsten Suel. Design and implementation of a high-performance distributed web crawler. In IEEE International Conference on Data Engineering (ICDE), 2002.
No context found.
Vladislav Shkapenyuk and Torsten Suel. Design and implementation of a high-performance distributed web crawler. In Proceedings of International Conference on Data Engineering, 2002.
No context found.
V. Shkapenyuk and T. Suel. Design and Implementation of a High-Performance Distributed Web Crawler. In Proceedings of the 18th International Conference on Data Engineering, San Jose, California, February 2002.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC