DMCA
Demonstrating Intelligent Crawling and Archiving of Web Applications (2013)
Citations: | 1 - 1 self |
Citations
179 | Path sharing and predicate evaluation for high-performance XML filtering.
- Diao, Altinel, et al.
- 2003
(Show Context)
Citation Context ... application type and level will grow with the addition of knowledge about new Web applications. To optimize this detection, we maintain an index of these patterns, that uses a version of the YFilter =-=[4]-=- NFA-based filtering system for XPath expressions adapted to our purposes. 3. Once the system receives a crawling request, it first makes a lookup on the YFilter index to detect the Web application ty... |
62 | The volume and evolution of web page templates.
- Gibson, Punera, et al.
- 2005
(Show Context)
Citation Context ... can at best host collections with billions of URLs. A large part of the content on the Web comes from Web sites powered by content management systems (CMSs) incorporating content in a fixed template =-=[6]-=-. This includes in particular a number of Web 2.0 and social Web applications such as blogs, forums, wikis. As we argue in [5], content published on this range of Web applications does not only includ... |
52 |
Reinventing discovery: the new era of networked science.
- Nielsen
- 2011
(Show Context)
Citation Context ...feedback [2]; Web forums have become a common way for political dissidents to discuss their agenda [10]; initiatives like the Polymath Project3 transform blogs into collaborative research whiteboards =-=[11]-=-; user-contributed wikis such as Wikipedia contain quality information to the level of traditional reference materials [7]. Despite the need for durable archiving of this precious content and limited ... |
30 |
You’ve Got Dissent! Chinese Dissident Use of the Internet and Beijing’s Counter-Strategies
- Chase, Mulvenon
- 2002
(Show Context)
Citation Context ...by politicians more and more, both to advertise their political platforms and to listen to citizens’ feedback [2]; Web forums have become a common way for political dissidents to discuss their agenda =-=[10]-=-; initiatives like the Polymath Project3 transform blogs into collaborative research whiteboards [11]; user-contributed wikis such as Wikipedia contain quality information to the level of traditional ... |
16 | Incremental crawling with Heritrix
- Sigurðsson
- 2005
(Show Context)
Citation Context ...one mode for testing and demonstration purposes, the AAH has also been integrated, in the framework of the ARCOMEM project4 , in the crawl processing chain of both Internet Archive’s Heritrix crawler =-=[13]-=- (modified for our purposes) and Internet Memory’s5 proprietary crawler. 3 Demonstration Scenario We now describe a specific use case where the AAH helps building richer archives with less resource wa... |
13 |
The Blogs and the New Politics of Listening
- Coleman
(Show Context)
Citation Context ...on that are newsworthy today or will be valuable to tomorrow’s historians. Blogs are used by politicians more and more, both to advertise their political platforms and to listen to citizens’ feedback =-=[2]-=-; Web forums have become a common way for political dissidents to discuss their agenda [10]; initiatives like the Polymath Project3 transform blogs into collaborative research whiteboards [11]; user-c... |
13 | H2RDF: adaptive query processing on RDF data in the cloud.
- Papailiou, Konstantinou, et al.
- 2012
(Show Context)
Citation Context ...does not only crawl the Web application in an intelligent manner but also extracts Web objects (e.g., timestamp, comments, author). Crawled Web pages and objects are stored in a large-scale RDF store =-=[12]-=- in the form of RDF triples. The user can run semantic queries on the triple store. For instance, she can look for the posts or comments posted by specific users by specifying their names. The interfa... |
10 |
Hajaj N.: We knew the web was big..., http://googleblog.blogspot
- Alpert
(Show Context)
Citation Context ...the number of Web pages and the amount of user-generated content on the Web: Web users are now billions [9], Web search engine robots such as Google have discovered more than a trillion 1unique URLs =-=[1]-=-, and one of the most popular content-management system for blogs, WordPress, is powering dozens of millions of Web sites [14]. On the other hand, only a small fraction of this Web content can be capt... |
10 | P.: Intelligent and adaptive crawling of web applications for web archiving
- Faheem, Senellart
- 2013
(Show Context)
Citation Context ...ntent management systems (CMSs) incorporating content in a fixed template [6]. This includes in particular a number of Web 2.0 and social Web applications such as blogs, forums, wikis. As we argue in =-=[5]-=-, content published on this range of Web applications does not only include the ramblings of common Web users but also pieces of information that are newsworthy today or will be valuable to tomorrow’s... |
7 |
28500:2009, Information and documentation – WARC file format
- ISO
(Show Context)
Citation Context ...e). 8. In the process of adaptation, the system also automatically maintains the knowledge base with the newly discovered patterns and actions. 9. The crawled Web pages are stored in the form of WARC =-=[8]-=- files, the standard preservation format for Web archiving. 10. Structured content (individual Web objects with their semantic metadata) is extracted from each crawled page, as described in the knowle... |
3 |
The indexed Web. http://www.worldwidewebsize.com
- Kunder
- 2013
(Show Context)
Citation Context ...cessing costs, or simply the need for selecting high-quality content. This is true for Web search engines: the number of pages indexed by Google in February 2013 is estimated to be around 40 billions =-=[3]-=-, to be contrasted to the trillion URLs or so in the frontier. This is all the truer for Web archiving institutions such as Internet Archive 1 and Internet Memory 2 whose mission is to preserve Web co... |
1 |
Internet users to exceed 2 billion this year. http://www.reuters.com/ article/2010/10/19/us-telecoms-internet-idUSTRE69I24720101019
- Lynn
- 2010
(Show Context)
Citation Context ...DRAFT 1 Introduction The advent of the Web 2.0 in the past decade has had significant impact on the number of Web pages and the amount of user-generated content on the Web: Web users are now billions =-=[9]-=-, Web search engine robots such as Google have discovered more than a trillion 1unique URLs [1], and one of the most popular content-management system for blogs, WordPress, is powering dozens of mill... |