MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  WIC: A General-Purpose Algorithm for Monitoring Web Information Sources

Download:
Download as a PDF
by Eep P, Kedar Dhamdhere, Christopher Olston
http://www-2.cs.cmu.edu/~olston/publications/wic.pdf
Add To MetaCart

Abstract:

The Web is becoming a universal information dissemination medium, due to a number of factors including its support for content dynamicity. A growing number of Web information providers post near real-time updates in domains such as auctions, stock markets, bulletin boards, news, weather, roadway conditions, sports scores, etc. External parties often wish to capture this information for a wide variety of purposes ranging from online data mining to automated synthesis of information from multiple sources. There has been a great deal of work on the design of systems that can process streams of data from Web sources, but little attention has been paid to how to produce these data streams, given that Web pages generally require “pull-based ” access. In this paper we introduce a new generalpurpose algorithm for monitoring Web information sources, effectively converting pull-based sources into push-based ones. Our algorithm can be used in conjunction with continuous query systems that assume information is fed into the query engine in a push-based fashion. Ideally, a Web monitoring algorithm for this purpose should achieve two objectives: (1) timeliness and (2) completeness of information captured. However, we demonstrate both analytically and empirically using real-world data that these objectives are fundamentally at odds. When resources available for Web monitoring are limited, and the number of sources to monitor is large, it may be necessary to sacrifice some timeliness to achieve better completeness, or vice versa. To take this fact into account, our algorithm is highly parameterized and targets an application-specified balance between timeliness and completeness. In this paper we formalize the problem of optimizing for a flexible combination of timeliness and completeness, and prove that our parameterized algorithm is a

Citations

156 Nonserial Dynamic Programming – Bertele, Brioschi - 1972
139 Continual queries for Internet scale event-driven information delivery – LIU, PU, et al. - 1999
128 Synchronizing a database to improve freshness – Cho, Garcia-Molina - 2000
104 The Content and Access Dynamics of a Busy Web Site: Findings and Implications – Padmanabhan, Qiu - 2000
99 Resource allocation problems: Algorithmic approaches – Ibaraki, Katoh - 1988
91 A large-scale study of the evolution of web pages – Fetterly, Manasse, et al. - 2003
56 An adaptive model for optimizing performance of an incremental web crawler – Edwards, McCurley, et al. - 2001
50 The Niagara Internet Query System – Naughton, DeWitt, et al. - 2001
37 Optimal crawling strategies for web search engines – Wolf, Squillante, et al. - 2002
35 Maintaining time-decaying stream aggregates – Cohen, Strauss - 2003
33 CONQUER: A continual query system for update monitoring – Liu, Pu, et al. - 1999
20 WebCQ: Detecting and delivering information changes on the web – Liu, Pu, et al. - 2000
12 Monitoring the dynamic web to respond to continuous queries – Pandey, Ramamritham, et al. - 2003
11 WebCQ: Detecting and Delivering Information Changes on the Web – Liu, Pu, et al. - 2000
9 WIC: A General-Purpose Algorithm for Monitoring Web Information Sources – Pandey, Dhamdhere, et al. - 2004