Download:
|
by Laurent Mignet, Mihai Preda, Serge Abiteboul, Bernd Amann, Amlie Marian
ftp://ftp.inria.fr/INRIA/Projects/verso/VersoReport-188.ps.gz
Add To MetaCart
Abstract:
We consider the acquisition and maintenance of XML data found on the web. More precisely, we study the problem of discovering XML data on the web, i.e., in a world still dominated by HTML, and keeping it up to date with the web as best as possible, under set resources. We present a distributed architecture that is designed to scale to the billions of pages of the web. In particular, the distributed management of metadata about HTML and XML pages turns out to be an interesting issue. The scheduling of the fetching of the page is guided by the importance of pages, their expected change rate, and subscriptions / publications of users. The importance of XML pages is dened in the standard manner based on the link structure of the web graph. It is computed by a matrix xpoint computation. HTML pages are of interest for us only in that they lead to XML pages. Thus their importance is dened in a dioeerent manner and their computation also involves a xpoint but on the transposed link matrix this time. The general scheduling problem is stated as an optimization problem that dispatches the resources to various tasks such as y
Citations
|
1632
|
The anatomy of a large-scale hypertextual web search engine
– Brin, Page
- 1998
|
|
1524
|
Authoritative sources in a hyperlinked environment
– Kleinberg
- 1999
|
|
405
|
Data on the Web : from Relations to Semistructured Data and XML
– Abiteboul, Buneman, et al.
- 2000
|
|
326
|
Focused crawling: a new approach to topic-specific Web resource discovery
– Chakrabarti, Berg, et al.
- 1999
|
|
221
|
Research problems in data warehousing
– Widom
- 1995
|
|
200
|
Web Consortium. Extensible Markup Language
– Wide
- 1997
|
|
130
|
The Evolution of the Web and Implications for an Incremental Crawler
– Cho, Garica-Molina
- 2000
|
|
106
|
Efficient Computation of PageRank
– Haveliwala
- 1999
|
|
81
|
Efficient storage of XML data
– Kanne, Moerkotte
|
|
68
|
Estimating Frequency of Change
– Cho, GarcĂa-Molina
- 2000
|
|
51
|
Graphes et algorithmes (Eyrolles
– Gondran, Minoux
- 1979
|
|
49
|
Change-Centric Management of Versions in an XML Warehouse
– Marian, Abiteboul, et al.
- 2001
|
|
44
|
Improving memory-system performance of sparse matrix-vector multiplication
– Toledo
- 1997
|
|
41
|
What can you do with a Web in Your Pocket
– Brin, Motwani, et al.
- 1998
|
|
24
|
Querying XML Documents in Xyleme
– Aguilera, Cluet, et al.
- 2000
|
|
18
|
Web Consortium. HyperText Markup Language (HTML
– Wide
- 1997
|
|
3
|
Query subscription in an XML webhouse
– Nguyen, Abiteboul, et al.
- 2000
|
|
2
|
Data acquisition for an xml warehouse
– Preda
- 2000
|