Results 1 - 10
of
22
Demonstrating Intelligent Crawling and Archiving of Web Applications
, 2013
"... We demonstrate here a new approach to Web archival crawling, based on an application-aware helper that drives crawls of Web applications according to their types (especially, according to their content management systems). By adapting the crawling strategy to the Web application type, one is able to ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We demonstrate here a new approach to Web archival crawling, based on an application-aware helper that drives crawls of Web applications according to their types (especially, according to their content management systems). By adapting the crawling strategy to the Web application type, one is able
Une démonstration d’un crawler intelligent pour les applications Web
"... We demonstrate here a new approach to Web archival crawling, based on an applicationaware helper that drives crawls of Web applications according to their types (especially, according to their content management systems). By adapting the crawling strategy to the Web application type, one is able to ..."
Abstract
- Add to MetaCart
to crawl a given Web application (say, a given forum or blog) with fewer requests than traditional crawling techniques. Additionally, the application-aware helper is able to extract semantic content from the Web pages crawled, which results in a Web archive of richer value to an archive user. In our
Intelligent and adaptive crawling of Web applications for Web archiving
- In ICWE
, 2013
"... Abstract. Web sites are dynamic in nature with content and structure changing overtime. Many pages on the Web are produced by content management systems (CMSs) such as WordPress, vBulletin, or phpBB. Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store We ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
Web pages, disregarding the CMS the site is based on (leading to suboptimal crawling strategies) and whatever structured content is contained in Web pages (resulting in page-level archives whose content is hard to exploit). We present in this paper an application-aware helper (AAH) that fits
Intelligent crawling of Web applications for Web archiving
"... The steady growth of the World Wide Web raises challenges regarding the preservation of meaningful Web data. Tools used currently by Web archivists blindly crawl and store Web pages found while crawling, disregarding the kind of Web site currently accessed (which leads to suboptimal crawling strateg ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
application that uses Web standards such as HTML and HTTP to publish information on the Web, accessible by Web browsers. Examples include Web forums, social networks, geolocation services, etc. We claim that the best strategy to crawl these applications is to make the Web crawler aware of the kind
AF T Intelligent and Adaptive Crawling of Web Applications for Web Archiving
, 2012
"... Web sites are dynamic in nature with their content and structure changing overtime. Many pages on the Web are produced by content management systems (CMSs) such as WordPress, vBulletin, or phpBB. Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pa ..."
Abstract
- Add to MetaCart
pages, disregarding the CMS the site is based on (which leads to suboptimal crawling strategies) and whatever structured content is contained in Web pages (which results in page-level archives whose content is hard to exploit). We present in this paper an application-aware helper (AAH) that fits
Workload-Aware Web Crawling and Server Workload Detection
- In Network Research Workshop, 18th Asian Pacific Advanced Network Meeting (APAN 2004
, 2004
"... With the development of search engines, more and more web crawlers are used to gather web pages. The rising crawling tra#c has brought the concern that crawlers may impact web sites. On the other hand, more e#cient crawling strategy is required for the coverage and freshness of search engine index. ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
server workload-aware crawling strategy is proposed. By measuring the web service time with a hybrid back-to-back packets pair, server workload is detected on the client side, thus crawler can adapt its crawling speed to web server. The experiment results show the power of our workload detection approach
Rank-Aware Crawling of Hidden Web sites
"... An ever-increasing amount of valuable information on the Web today is stored inside online databases and is accessible only after the users issue a query through a search interface. Such information is collectively called the“Hidden Web”and is mostly inaccessible by traditional search engine crawler ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
, if we can crawl a Hidden Web site in breadth, i.e. download just the top results for all potential queries, we can enable such applications without the need for allocating resources for fully crawling a potentially huge Hidden Web site. In this paper we present algorithms for crawling a Hidden Web site
E.: An agent-based focused crawling framework for topic- and genre-related web document discovery
- In: Proc. of the 24th IEEE International Conference on Tools with Artificial Intelligence (ICTAI
, 2012
"... Abstract—The discovery of web documents about certain topics is an important task for web-based applications including web document retrieval, opinion mining and knowledge extraction. In this paper, we propose an agent-based focused crawling framework able to retrieve topic- and genre-related web do ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—The discovery of web documents about certain topics is an important task for web-based applications including web document retrieval, opinion mining and knowledge extraction. In this paper, we propose an agent-based focused crawling framework able to retrieve topic- and genre-related web
Towards Designing an Efficient Crawling Window to Analysis and Annotate Changes in Linked Data Sources
"... Today the popularity of data quality is increasing in linked data, and its changes are being annotated. Linked data consuming applications need to be aware of changes in a dataset. Changes such as update, remove or creation links may occur for a time so it is necessary to detect them to update local ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Today the popularity of data quality is increasing in linked data, and its changes are being annotated. Linked data consuming applications need to be aware of changes in a dataset. Changes such as update, remove or creation links may occur for a time so it is necessary to detect them to update
Uncovering the relational web
- In under review
, 2008
"... The World-Wide Web consists of a huge number of unstructured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small “schema ” of labeled and typed columns, making each such table a small structured dat ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
to recover column label and type information. Our mix of hand-written detectors and statistical classifiers takes a raw Web crawl as input, and generates a collection of databases that is five orders of magnitude larger than any other collection we are aware of. Relation recovery achieves precision
Results 1 - 10
of
22