MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  z

Download:
Download as a PDF | Download as a PS
by Laurent Mignet, Mihai Preda, Serge Abiteboul, Bernd Amann, Amlie Marian
ftp://ftp.inria.fr/INRIA/Projects/verso/VersoReport-188.ps.gz
Add To MetaCart

Abstract:

We consider the acquisition and maintenance of XML data found on the web. More precisely, we study the problem of discovering XML data on the web, i.e., in a world still dominated by HTML, and keeping it up to date with the web as best as possible, under set resources. We present a distributed architecture that is designed to scale to the billions of pages of the web. In particular, the distributed management of metadata about HTML and XML pages turns out to be an interesting issue. The scheduling of the fetching of the page is guided by the importance of pages, their expected change rate, and subscriptions / publications of users. The importance of XML pages is dened in the standard manner based on the link structure of the web graph. It is computed by a matrix xpoint computation. HTML pages are of interest for us only in that they lead to XML pages. Thus their importance is dened in a dioeerent manner and their computation also involves a xpoint but on the transposed link matrix this time. The general scheduling problem is stated as an optimization problem that dispatches the resources to various tasks such as y

Citations

1632 The anatomy of a large-scale hypertextual web search engine – Brin, Page - 1998
1524 Authoritative sources in a hyperlinked environment – Kleinberg - 1999
405 Data on the Web : from Relations to Semistructured Data and XML – Abiteboul, Buneman, et al. - 2000
326 Focused crawling: a new approach to topic-specific Web resource discovery – Chakrabarti, Berg, et al. - 1999
221 Research problems in data warehousing – Widom - 1995
200 Web Consortium. Extensible Markup Language – Wide - 1997
130 The Evolution of the Web and Implications for an Incremental Crawler – Cho, Garica-Molina - 2000
106 Efficient Computation of PageRank – Haveliwala - 1999
81 Efficient storage of XML data – Kanne, Moerkotte
68 Estimating Frequency of Change – Cho, GarcĂ­a-Molina - 2000
51 Graphes et algorithmes (Eyrolles – Gondran, Minoux - 1979
49 Change-Centric Management of Versions in an XML Warehouse – Marian, Abiteboul, et al. - 2001
44 Improving memory-system performance of sparse matrix-vector multiplication – Toledo - 1997
41 What can you do with a Web in Your Pocket – Brin, Motwani, et al. - 1998
24 Querying XML Documents in Xyleme – Aguilera, Cluet, et al. - 2000
18 Web Consortium. HyperText Markup Language (HTML – Wide - 1997
3 Query subscription in an XML webhouse – Nguyen, Abiteboul, et al. - 2000
2 Data acquisition for an xml warehouse – Preda - 2000