• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

DMCA

Intelligent crawling of Web applications for Web archiving

Cached

  • Download as a PDF

Download Links

  • [perso.telecom-paristech.fr]
  • [www2012.wwwconference.org]
  • [perso.telecom-paristech.fr]
  • [www2012.org]
  • [www2012.org]
  • [hal-institut-mines-telecom.archives-ouvertes.fr]

  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Muhammad Faheem , Supervised Pierre Senellart
Citations:4 - 2 self
  • Summary
  • Citations
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@MISC{Faheem_intelligentcrawling,
    author = {Muhammad Faheem and Supervised Pierre Senellart},
    title = {Intelligent crawling of Web applications for Web archiving},
    year = {}
}

Share

Facebook Twitter Reddit Bibsonomy

OpenURL

 

Abstract

The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently accessed (which leads to suboptimal crawling strategies) and whatever structured content is contained in Web pages (which results in page-level archives whose content is hard to exploit). We focus in this PhD work in the crawling and archiving of publicly accessible Web applications, especially those of the social Web. A Web application is any application that uses Web standards such as HTML and HTTP to publish information on the Web, accessible by Web browsers. Examples include Web forums, social networks, geolocation services, etc. We claim that the best strategy to crawl these applications is to make the Web crawler aware of the kind of application currently processed, allowing it to refine the list of URLs to process, and to annotate the archive with information about the structure of crawled content. We add adaptive characteristics to an archiving Web crawler: being able to identify when a Web page belongs to a given Web application and applying the appropriate crawling and content extraction methodology.

Keyphrases

web application    intelligent crawling    meaningful web data    social network    web site    web archivist    phd work    web standard    geolocation service    web crawler    store web page    web page belongs    structured content    adaptive characteristic    web forum    appropriate crawling    social web    world wide web    web crawler aware    page-level archive    content extraction methodology    accessible web application    steady growth    web browser    crawled content    web page   

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University