| F Menczer, G Pant, and P Srinivasan. Evaluating topic-driven web crawlers. In Proceeding of the 24th Annual Intl. ACM SIGIR Conf. On Research and Development in Information Retrieval, 2001. |
....by techniques in this model. This crawler also follows URLs that include keywords of focused topic. It sets a threshold for each branch of the web and stops following URLs of this branch if the relevance score of the area falls below the threshold. An evaluation of focused crawlers is reported in [60]. Initially a set of classifiers for 100 topics was built to be used in the evaluation of the crawled documents. The researchers in this approach believe that a good focused crawler should remain in the vicinity of the topic. They evaluated the crawlers based on ability to remain on topic during ....
F. Menczer, G. Pant, P. Srinivasan and M. Ruiz, Evaluating Topic-Driven web Crawlers, In Proceedings of the 24th Annual International ACM/SIGIR Conference, 2001.
....and used for Web experimentation. Chakrabarti et al. implemented a focused crawler using off the shelf database and storage managers [7] Rennie and McCallum [25] de signed a focused crawler that attempted to crawl pages only of a certain type by using feedback during the crawl. Menczer et al. [19, 20] have done work since 1999 on designing and evaluating focused crawlers. Mukherjee s WTMS [21] reports being able to build a topic based collection with high precision. 6] is a 2 ww .mathorum. org. good summarization of results in focused crawling as of the end of 1999. An interesting technique ....
MENCZER, F., PANT, G., AND SRINIVASAN, P. Evaluating topic-driven Web crawlers. In SIGIR '01, September 9-12 (New Orleans, La. USA, 2001).
....range between client and site level tools. Letizia [18] Powerscout, and WebWatcher [17] are such systems. Menczer and Belew proposed InfoSpiders [24] a collection of autonomous goal driven crawlers without global control or state, in the style of genetic algorithms. A recent extensive study [25] comparing several topic driven crawlers including the best first crawler and InfoSpiders found the best first approach to show the highest harvest rate (which our new system outperforms) In all the systems mentioned above, improving the chances of a successful leap of faith will clearly ....
....and keeps removing the highest priority node and visiting it, expanding its outlinks and checking them into the priority queue with the relevance score of v in turn. Despite its extreme simplicity, the best first crawler has been found to have very high harvest rates in extensive evaluations [25]. Why do we need negative examples and negative classes at all Instead of using class probabilities, we could maintain a priority queue on, say, the TFIDF cosine similarity between u and the centroid of the seed pages (acting as an estimate for the corresponding similarity between v and the ....
F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In SIGIR, New Orleans, Sept. 2001. ACM. Online at http://dollar.biz.uiowa.edu/~fil/Papers/ sigir-01.pdf.
No context found.
F Menczer, G Pant, M Ruiz, and P Srinivasan. Evaluating topic-driven Web crawlers. In Donald H. Kraft, W. Bruce Croft, David J. Harper, and Justin Zobel, editors, Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 241--249, New York, NY, 2001. ACM Press.
No context found.
F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001.
No context found.
F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001.
No context found.
F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001.
No context found.
F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001.
No context found.
F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001.
No context found.
F Menczer, G Pant, M Ruiz, and P Srinivasan. Evaluating topic-driven Web crawlers. In Donald H. Kraft, W. Bruce Croft, David J. Harper, and Justin Zobel, editors, Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 241--249, New York, NY, 2001. ACM Press.
No context found.
F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual Intl. ACM SIGIR Conference on Research and Development in Information Retrieval, 2001.
....uses as well, for example many sites collect emails by advertising freebies of various sorts, and then sell the email lists to spammers as opt in requests. One way for an attacker to automatically locate and collect forms to be used as launch pads is by employing a topic driven crawler [6, 7]. Such a software searches the method= POST name= Email value= your email here name= submit value= Subscribe Figure 1: A typical Web form that can be exploited by our attack (left) and the HTML code that can be used to detect, parse, and submit such a form (right) base = free ....
F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In D. H. Kraft, W. B. Croft, D. J. Harper, and J. Zobel, editors, Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 241--249, New York, NY, 2001. ACM Press.
....in the heuristics they use to score the unvisited URLs with some algorithms adapting and tuning their parameters before or during the crawl. 3. 1 Naive Best First Crawler A naive best first was one of the crawlers detailed and evaluated by the authors in an extensive study of crawler evaluation [21]. This crawler represents a fetched Web page as a vector of words weighted by occurrence frequency. The crawler then computes the cosine similarity of the page to the query or description provided by the user, and scores the unvisited URLs on the Frontier ,jOe [done] rmlnatlon ) end Lcck ....
....each agent kept its frontier limited to the links on the page that was last fetched by the agent. Due to this limited memory approach the crawler was limited to following the links on the current page and it was outperformed by the naive best first crawler on a number of evalu ation criterion [21]. Since then a number of improvements (inspired by naive best first) to the original algorithm have been designed while retaining its capability to learn link estimates via neural nets and focus its search toward more promising areas by selective reproduction. In fact the redesigned version of the ....
[Article contains additional citation context not shown here]
F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 2th Annual Intl. ACM SICIR Conf. on Research and Development in Information Retrieval, 2001.
....offthe shelf text mining, indexing and ranking tools. Topical crawlers, also called focused crawlers, have been studied extensively in the past [6, 8, 3, 12, 7, 1, 2] In our previous evaluation studies of topical crawlers, we found a similarity based Best First crawler to be quite effective [13, 14, 18]. However, the Best First crawler made no use of the inherent structure available in an HTML document. We also studied algorithms that attempt to identify the context of a link using a sliding window and a distance measure based on number of links separating a word from a given link. However, the ....
....its relevance. Chakrabarti et al. 3] use a classifier to find the rate of relevant page acquisition. Diligenti et al. 7] compute the average relevance of the crawled pages to measure the performance where the relevance is judged by a Naive Bayes model of the seed set. A study by Menczer et al. [13] on the evaluation of topical crawlers looks at a number of ways to compare different crawlers. A more general framework to evaluate crawlers has been outlined by Srinivasan et al. 18] 9. CONCLUSIONS We investigated the problem of creating a small but effective collection of Web documents for ....
F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001.
....be a day, a week or a month based on the need for index freshness, the time available, and the maximum size of the collection (MAX IN INDEX) 3. IMPLEMENTATION Currently, the search engine crawler symbiosis is implemented using a search engine called Rosetta [9, 7, 8] and a Best First crawler [29, 30, 33, 38]. Both the search engine and the crawler were not built specifically for this application. They have been used before for independent searching [9] and crawling [30, 32] tasks. We want to demonstrate the use of the symbiotic model by picking an offthe shelf search engine and a generic topical ....
....based on the score. Every time the crawler needs to fetch a page, it picks the best one in the queue. In our previous evaluation studies, we have found the Best First crawler to a be a strong competitor among other algorithms for short crawls of a few thousand pages, on general crawling tasks [29, 30, 38]. A multi threaded Java based infra structure to implement the algorithm is described in detail elsewhere [32] The crawler can have a number of threads that share a single crawl frontier. Each thread picks the best URL to crawl from the frontier, fetches the corresponding page, scores the ....
F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001.
.... used by Google: pages containing the query s lexical features are ranked using query independent link analysis [4] Links are also used in conjunction with text to identify hub and authority pages for a certain subject [17] guide search agents crawling on behalf of users or topical search engines [25, 26, 5, 27, 32], and identify Web communities [13, 18, 10, 11] The hidden assumption behind all of these retrieval, ranking and crawling algorithms that use link analysis to make semantic inferences is a correlation between the graph topology of the Web and the meaning of pages, or more precisely the conjecture ....
F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In D. H. Kraft, W. B. Croft, D. J. Harper, and J. Zobel, editors, Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 241--249, New York, NY, 2001. ACM Press.
....it is quite reasonable to explore crawlers in a context where the parameters of crawl time and crawl distance may be beyond the limits of human acceptance imposed by user based experimentation. Our analysis of the crawler literature [1, 2, 4, 6, 8, 11, 10, 17, 18, 29, 37] and our own experience [21, 24, 25, 26, 22, 31, 32, 27] indicate that in general, when embarking upon an experiment comparing crawling algorithms, several critical decisions are made. These impact not only the immediate outcome and value of the study but also the ability to make comparisons with future crawler evaluations. In this paper we o#er a ....
....for evaluating crawlers, since we may examine their ability to retrieve pages that are on topic. Topics may be obtained from di#erent sources as for instance asking users to specify them. One approach is to derive topics from a hierarchical index of concepts such as Yahoo or the Open Directory [11, 26, 32]. A key point to note is that all topics are not equal. Topics such as 2002 US Opens and trade embargo are much more specific than Sports and Business respectively. Moreover, a given topic may be defined in several di#erent ways, as we describe below. Topic specification has a very ....
[Article contains additional citation context not shown here]
F Menczer, G Pant, M Ruiz, and P Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001.
....will continue to grow. Hence the solution o#ered by search engines, i.e. the capacity to answer any query from any user is recognized as being limited. It therefore comes as no surprise that the development of topic driven crawler algorithms has received significant attention in recent years [1, 6, 8, 16, 26, 28, 24]. Topic driven crawlers (also known as focused crawlers) respond to the particular information needs expressed by topical queries or interest profiles. These could be the needs of an individual user (query time or online crawlers) or those of a community with shared interests (topical search ....
....specific to Web crawlers is that the magnitude of retrieval results limits the availability of user based relevance judgments. In previous research we have started to explore several alternative approaches both for assessing the quality of Web pages as well as for summarizing crawler performance [28]. In a companion paper [35] we expand such a methodology by describing in detail a framework developed for the fair evaluation of topic driven crawlers. Performance analysis is based on both quality and on the use of space resources. We formalize a class of crawling tasks of increasing di#culty, ....
[Article contains additional citation context not shown here]
F Menczer, G Pant, M Ruiz, and P Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001.
.... are being applied to a wide range of complex problems from interface agents [14, 19] to recommender systems [3] and autonomous and comparative shopping agents [10, 17, 24] Our interest is in the design of retrieval agents that seek out relevant Web pages in response to user generated topics [11, 22, 23]. Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. In fact a recent estimate of the visible Web is at around 7 ....
....reports at its Web site [12] Therefore it is essential to develop e#ective agents able to conduct real time searches for users. This goal is reflected in our previous research in which we have explored a variety of Web crawling agents that operate using both lexical and link based criteria [23]. We have assessed their performance with topics derived from the Yahoo and Open Directory (DMOZ) hierarchies. We used several alternative measures and have also compared those that are dominantly exploratory in nature with those that are more exploitative of the available evidence [28] In ....
[Article contains additional citation context not shown here]
F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In Proc. 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001.
No context found.
F Menczer, G Pant, and P Srinivasan. Evaluating topic-driven web crawlers. In Proceeding of the 24th Annual Intl. ACM SIGIR Conf. On Research and Development in Information Retrieval, 2001.
No context found.
F. Menczer, G. Pant and P. Srinivasan. Evaluating topicdriven web crawlers. In Proceeding of the 24th Annual Intl. ACM SIGIR Conf. On Research and Development in Information Retrieval, 2001.
No context found.
MENCZER, F., PANT, G., SRINIVASAN, P. and RUIZ, M. E. 2001. Evaluating Topic-Driven Web Crawlers. In Proceedings of 24th Annual Int. ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR'01). New Orleans, Louisiana, USA, 241-249.
No context found.
Menczer, F., Pant, G., Srinivasan, P., Ruiz, M.E.: Evaluating Topic-Driven Web Crawlers. In Proc. 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2001) 241--249.
No context found.
Menczer, F., Pant, G., Ruiz, M., and Srinivasan, P., Evaluating topic-driven Web crawlers, in D. H. Kraft, W. B. Croft, D. J. Harper, and J. Zobel (eds.), Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 241--249, Association of Computing Machinery, New York (2001).
No context found.
Filippo Menczer, Gautam Pant, Padmini Srinivasan, Miguel E. Ruiz. Evaluating Topic-Driven Web Crawlers, SIGIR '01, September 9-12, 2001, New Orleans, Louisiana, USA
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC