DMCA
Article ARCOMEM Crawling Architecture (2014)
Cached
Download Links
Citations: | 1 - 1 self |
Citations
4668 | The anatomy of a large-scale hypertextual Web search engine
- Brin, Page
- 1998
(Show Context)
Citation Context ... the index of a Web search engine, or to archive them for future reference. 2.1. Web Crawling Descriptions of early versions of Google’s and Internet Archive’s large-scale crawler systems appeared in =-=[9,10]-=-, respectively. However, one of the first detailed descriptions of a scalable Web crawler is that of Mercator by Heydon and Najork [11], who provide information on the various modules of the crawler a... |
637 | Focused crawling: a new approach to topic-specific Web resource discovery, in:
- Chakrabarti, Berg, et al.
- 1999
(Show Context)
Citation Context ...spects of data collection from the Web [23], by selectively crawling pages that are relevant to a set of topics, defined as a set of keywords [24], by example documents mapped to a taxonomy of topics =-=[25]-=-, or by ontologies [26,27]. Recent approaches also address the crawling of information for specific geographical locations [28,29]. The main challenges in focused crawling relate to the prioritization... |
281 | The evolution of the web and implications for an incremental crawler.
- Cho, Garcia-Molina
- 2000
(Show Context)
Citation Context ...er crawling billions of Web pages. As the Web evolves, and Web pages are created, modified, or deleted [17,18], effective crawling approaches are needed to handle these changes. Cho and Garcia Molina =-=[19]-=- describe an incremental crawler for optimizing the average freshness of crawled Web data. Olston and Pandey [20] describe re-crawling strategies to optimize freshness based on the longevity of inform... |
255 | Focused crawling using context graphs.
- Diligenti, Coetzee, et al.
- 2000
(Show Context)
Citation Context ...4,26], hyperlink distance-based limits [30,31], or combinations of text and hyperlink analysis with Latent Semantic Indexing (LSI) [32]. Machine learning approaches, including naïve Bayes classifiers =-=[25,33]-=-, Hidden Markov Models [34], reinforcement learning [35], genetic algorithms [36], and neural networks [37], have also been applied to prioritize the unvisited URLs. Focused crawlers and crawlers in g... |
241 | A largescale study of the evolution of web pages.
- Fetterly, Manasse, et al.
- 2003
(Show Context)
Citation Context ...as been seen previously. The use of DRUM allows IRLBot to maintain a high crawling rate, even after crawling billions of Web pages. As the Web evolves, and Web pages are created, modified, or deleted =-=[17,18]-=-, effective crawling approaches are needed to handle these changes. Cho and Garcia Molina [19] describe an incremental crawler for optimizing the average freshness of crawled Web data. Olston and Pand... |
220 | What’s new on the web?: the evolution of the web from a search engine perspective.
- Ntoulas, Cho, et al.
- 2004
(Show Context)
Citation Context ...as been seen previously. The use of DRUM allows IRLBot to maintain a high crawling rate, even after crawling billions of Web pages. As the Web evolves, and Web pages are created, modified, or deleted =-=[17,18]-=-, effective crawling approaches are needed to handle these changes. Cho and Garcia Molina [19] describe an incremental crawler for optimizing the average freshness of crawled Web data. Olston and Pand... |
175 | Ubicrawler: A scalable fully distributed web crawler.
- Boldi, Codenotti, et al.
- 2002
(Show Context)
Citation Context ...e crawler, developed at Future Internet 2014, 6 521 the Internet Archive. In Section 4.4 we will describe how we have adapted Heritrix in order to fit in ARCOMEM’s crawling architecture. Boldi et al. =-=[15]-=- describe UBICrawler, a distributed Web crawler, implemented in Java, which operates in a decentralized way and uses consistent hashing to partition the domains to crawl across the crawling servers. L... |
174 | Mercator: a scalable, extensible Web crawler
- Heydon, Najork
- 1999
(Show Context)
Citation Context ...and Internet Archive’s large-scale crawler systems appeared in [9,10], respectively. However, one of the first detailed descriptions of a scalable Web crawler is that of Mercator by Heydon and Najork =-=[11]-=-, who provide information on the various modules of the crawler and the design options. Najork and Heydon also describe a distributed crawler based on Mercator in [12]. Shkapenyuk and Suel [13] introd... |
107 | Design and implementation of a high performance distributed web crawler.
- Shkapenyuk, Suel
- 2002
(Show Context)
Citation Context ...Najork [11], who provide information on the various modules of the crawler and the design options. Najork and Heydon also describe a distributed crawler based on Mercator in [12]. Shkapenyuk and Suel =-=[13]-=- introduce a distributed and robust crawler, managing the failure of individual servers. Heritrix [14] is an archival-quality and modular open source crawler, developed at Future Internet 2014, 6 521 ... |
104 | Evaluating topic-driven web crawlers.
- Menczer, Pant, et al.
- 2001
(Show Context)
Citation Context ...effective way to balance the cost, coverage, and quality aspects of data collection from the Web [23], by selectively crawling pages that are relevant to a set of topics, defined as a set of keywords =-=[24]-=-, by example documents mapped to a taxonomy of topics [25], or by ontologies [26,27]. Recent approaches also address the crawling of information for specific geographical locations [28,29]. The main c... |
91 |
The shark-search algorithm – an application: tailored Web site mapping.
- Hersovici, Jacovi, et al.
- 1998
(Show Context)
Citation Context ...cal locations [28,29]. The main challenges in focused crawling relate to the prioritization of URLs not yet visited, which may be based on similarity measures [24,26], hyperlink distance-based limits =-=[30,31]-=-, or combinations of text and hyperlink analysis with Latent Semantic Indexing (LSI) [32]. Machine learning approaches, including naïve Bayes classifiers [25,33], Hidden Markov Models [34], reinforcem... |
73 | Siphoning hidden-web data through keyword-based interfaces.
- Barbosa, Freire
- 2010
(Show Context)
Citation Context ...ML forms [38]. Such forms are easy to complete by human users. Automatic deep-Web crawlers, however, need to complete HTML forms and retrieve results from the underlying databases. Barbosa and Freire =-=[39]-=- develop mechanisms for generating simple keyword queries that cover the underlying database through unstructured simple search forms. Madhavan et al. [40] handle structured forms by automatically com... |
69 |
Crawling towards eternity: Building an archive of the world wide web.
- Burner
- 1997
(Show Context)
Citation Context ... the index of a Web search engine, or to archive them for future reference. 2.1. Web Crawling Descriptions of early versions of Google’s and Internet Archive’s large-scale crawler systems appeared in =-=[9,10]-=-, respectively. However, one of the first detailed descriptions of a scalable Web crawler is that of Mercator by Heydon and Najork [11], who provide information on the various modules of the crawler a... |
65 | Information retrieval in the world-wide web: Making client-based searching feasible.
- Bra, Post
- 1994
(Show Context)
Citation Context ...cal locations [28,29]. The main challenges in focused crawling relate to the prioritization of URLs not yet visited, which may be based on similarity measures [24,26], hyperlink distance-based limits =-=[30,31]-=-, or combinations of text and hyperlink analysis with Latent Semantic Indexing (LSI) [32]. Machine learning approaches, including naïve Bayes classifiers [25,33], Hidden Markov Models [34], reinforcem... |
52 | Performance Web Crawling
- Najork, Heydon, et al.
- 2001
(Show Context)
Citation Context ...of Mercator by Heydon and Najork [11], who provide information on the various modules of the crawler and the design options. Najork and Heydon also describe a distributed crawler based on Mercator in =-=[12]-=-. Shkapenyuk and Suel [13] introduce a distributed and robust crawler, managing the failure of individual servers. Heritrix [14] is an archival-quality and modular open source crawler, developed at Fu... |
45 | Ontology-focused Crawling of Web Documents”,
- Ehrig
- 2003
(Show Context)
Citation Context ...on from the Web [23], by selectively crawling pages that are relevant to a set of topics, defined as a set of keywords [24], by example documents mapped to a taxonomy of topics [25], or by ontologies =-=[26,27]-=-. Recent approaches also address the crawling of information for specific geographical locations [28,29]. The main challenges in focused crawling relate to the prioritization of URLs not yet visited, ... |
44 |
Introduction to Heritrix, an archival quality web crawler.
- Mohr, Stack, et al.
- 2004
(Show Context)
Citation Context ...ork and Heydon also describe a distributed crawler based on Mercator in [12]. Shkapenyuk and Suel [13] introduce a distributed and robust crawler, managing the failure of individual servers. Heritrix =-=[14]-=- is an archival-quality and modular open source crawler, developed at Future Internet 2014, 6 521 the Internet Archive. In Section 4.4 we will describe how we have adapted Heritrix in order to fit in ... |
41 | Recrawl scheduling based on information longevity,”
- Olston, Pandey
- 2008
(Show Context)
Citation Context ...ffective crawling approaches are needed to handle these changes. Cho and Garcia Molina [19] describe an incremental crawler for optimizing the average freshness of crawled Web data. Olston and Pandey =-=[20]-=- describe re-crawling strategies to optimize freshness based on the longevity of information on Web pages. Pandey and Olston [21] also introduce a parameterized algorithm for monitoring Web resources ... |
29 |
A longitudinal study of web pages continued: a consideration of document persistence
- Koehler
- 2004
(Show Context)
Citation Context ...acquisition 1. Introduction The World Wide Web is the largest information repository. However, this information is very volatile: the typical half-life of content referenced by URLs is of a few years =-=[1]-=-; this trend is even aggravated in social media, where social networking APIs sometimes only extend to a week’s worth of content [2]. Web archiving [3] deals with the collection, enrichment, curation,... |
28 | Wic: A general-purpose algorithm for monitoring web information sources
- Pandey, Dhamdhere, et al.
- 2004
(Show Context)
Citation Context ...ptimizing the average freshness of crawled Web data. Olston and Pandey [20] describe re-crawling strategies to optimize freshness based on the longevity of information on Web pages. Pandey and Olston =-=[21]-=- also introduce a parameterized algorithm for monitoring Web resources for updates and optimizing timeliness or completeness depending on application-specific requirements. 2.2. Focused and Deep-Web C... |
23 | THESUS: Organizing Web document collections based on link semantics
- Halkidi, Nguyen, et al.
- 2003
(Show Context)
Citation Context ...on from the Web [23], by selectively crawling pages that are relevant to a set of topics, defined as a set of keywords [24], by example documents mapped to a taxonomy of topics [25], or by ontologies =-=[26,27]-=-. Recent approaches also address the crawling of information for specific geographical locations [28,29]. The main challenges in focused crawling relate to the prioritization of URLs not yet visited, ... |
23 | Combining text and link analysis for focused crawling
- Almpanidis, Kotropoulos
(Show Context)
Citation Context ...of URLs not yet visited, which may be based on similarity measures [24,26], hyperlink distance-based limits [30,31], or combinations of text and hyperlink analysis with Latent Semantic Indexing (LSI) =-=[32]-=-. Machine learning approaches, including naïve Bayes classifiers [25,33], Hidden Markov Models [34], reinforcement learning [35], genetic algorithms [36], and neural networks [37], have also been appl... |
20 |
White paper: the deep web: surfacing hidden value.
- Bergman
- 2001
(Show Context)
Citation Context ...ers and crawlers in general can harvest data from the publicly indexable Web by following hyperlinks between Web pages. However, there is a very large part of the Web that is hidden behind HTML forms =-=[38]-=-. Such forms are easy to complete by human users. Automatic deep-Web crawlers, however, need to complete HTML forms and retrieve results from the underlying databases. Barbosa and Freire [39] develop ... |
19 | Data quality in web archiving.
- Spaniol, Denev, et al.
- 2009
(Show Context)
Citation Context ...ing complete snapshots of a domain taken at regular intervals. A drawback of this approach is the lack of knowledge about changes of Web pages between crawls and the consistency of the collected data =-=[44]-=-. The latter approach results in higher quality collections restricted only to selected Web sites. Denev et al. [45] introduce a framework for assessing the quality of archives and tune the crawling s... |
17 |
Evolving strategies for focused web crawling
- Johnson, Tsioutsiouliklis, et al.
(Show Context)
Citation Context ...k analysis with Latent Semantic Indexing (LSI) [32]. Machine learning approaches, including naïve Bayes classifiers [25,33], Hidden Markov Models [34], reinforcement learning [35], genetic algorithms =-=[36]-=-, and neural networks [37], have also been applied to prioritize the unvisited URLs. Focused crawlers and crawlers in general can harvest data from the publicly indexable Web by following hyperlinks b... |
17 | BA survey on web archiving initiatives,[ in
- Gomes, Miranda, et al.
- 2011
(Show Context)
Citation Context ...icted only to selected Web sites. Denev et al. [45] introduce a framework for assessing the quality of archives and tune the crawling strategies to optimize quality with given resources. Gomes et al. =-=[46]-=- provide a survey of Web archiving initiatives. Focused crawlers, as described above, can be used for creating focused Web archives, by relying on a selective content acquisition approach. The crawlin... |
16 | Incremental crawling with Heritrix
- Sigurðsson
- 2005
(Show Context)
Citation Context ...tive crawls require a lot of manual work for the crawl preparation, crawler control, and quality assurance. On the technical level, current-day archiving crawlers, such as Internet Archive’s Heritrix =-=[4]-=-, crawl the Web in a conceptually simple manner (See Figure 1). They start from a seed list of URLs (typically provided by a Web archivist) to be stored in a queue. Web pages are then fetched from thi... |
15 | Sharc: Framework for qualityconscious web archiving
- Denev, Mazeika, et al.
(Show Context)
Citation Context ...about changes of Web pages between crawls and the consistency of the collected data [44]. The latter approach results in higher quality collections restricted only to selected Web sites. Denev et al. =-=[45]-=- introduce a framework for assessing the quality of archives and tune the crawling strategies to optimize quality with given resources. Gomes et al. [46] provide a survey of Web archiving initiatives.... |
13 | Focused crawling for both topical relevance and quality of medical information
- Tang, Hawking, et al.
- 2005
(Show Context)
Citation Context ...-specific requirements. 2.2. Focused and Deep-Web Crawling Focused or topical crawlers [22] provide an effective way to balance the cost, coverage, and quality aspects of data collection from the Web =-=[23]-=-, by selectively crawling pages that are relevant to a set of topics, defined as a set of keywords [24], by example documents mapped to a taxonomy of topics [25], or by ontologies [26,27]. Recent appr... |
11 | Exploiting the social and semantic Web for guided Web archiving. In TPDL,
- Risse, Dietze, et al.
- 2012
(Show Context)
Citation Context ...tion and quality assurance. It is the aim of the ARCOMEM project [5] to support the selective crawling on the technical level by leveraging social media and semantics to build meaningful Web archives =-=[6]-=-. This requires, in particular, a change of paradigm in how content is collected technically via Web crawling, which is the topic of the present article. This traditional processing chain of a Web cra... |
11 |
Neer: An unsupervised method for named entity evolution recognition.
- Tahmasebi, Gossen, et al.
- 2012
(Show Context)
Citation Context ...ons of Web objects that have been collected over time and can cover several crawls. Analysis implemented exemplary on this level within the ARCOMEM system is used to recognize Named Entity Evolutions =-=[47]-=- and to analyze the evolutions of associations between interesting terms and tweets (Twitter Dynamics) [48]. 3.5. Applications For the interaction with the crawler and exploration of the content a num... |
10 | Adaptive geospatially focused crawling
- Ahlers, Boll
- 2009
(Show Context)
Citation Context ...et of keywords [24], by example documents mapped to a taxonomy of topics [25], or by ontologies [26,27]. Recent approaches also address the crawling of information for specific geographical locations =-=[28,29]-=-. The main challenges in focused crawling relate to the prioritization of URLs not yet visited, which may be based on similarity measures [24,26], hyperlink distance-based limits [30,31], or combinati... |
10 | P.: Intelligent and adaptive crawling of web applications for web archiving
- Faheem, Senellart
- 2013
(Show Context)
Citation Context ...ces in the process. Application-aware crawling also helps adding semantic information to the ARCOMEM database. More details about the functioning and independent evaluation of the AAH are provided in =-=[49,50]-=-. 4.2. Online Analysis Within the online analysis, several modules analyze crawled Web objects in order to guide the crawler. The purpose of this process is to provide scores for detected URLs. These ... |
9 |
Using HMM to learn user browsing patterns for focused web crawling.
- Liu, Janssen, et al.
- 2006
(Show Context)
Citation Context ...d limits [30,31], or combinations of text and hyperlink analysis with Latent Semantic Indexing (LSI) [32]. Machine learning approaches, including naïve Bayes classifiers [25,33], Hidden Markov Models =-=[34]-=-, reinforcement learning [35], genetic algorithms [36], and neural networks [37], have also been applied to prioritize the unvisited URLs. Focused crawlers and crawlers in general can harvest data fro... |
8 |
IRLbot: Scaling to 6
- Lee, Leonard, et al.
- 2009
(Show Context)
Citation Context ...UBICrawler, a distributed Web crawler, implemented in Java, which operates in a decentralized way and uses consistent hashing to partition the domains to crawl across the crawling servers. Lee et al. =-=[16]-=- describe the architecture and main data structures of IRLBot, a crawler which implements DRUM (Disk Repository with Update Management) for checking whether a URL has been seen previously. The use of ... |
8 |
Geographically focused collaborative crawling.
- Gao, HC, et al.
- 2006
(Show Context)
Citation Context ...et of keywords [24], by example documents mapped to a taxonomy of topics [25], or by ontologies [26,27]. Recent approaches also address the crawling of information for specific geographical locations =-=[28,29]-=-. The main challenges in focused crawling relate to the prioritization of URLs not yet visited, which may be based on similarity measures [24,26], hyperlink distance-based limits [30,31], or combinati... |
8 |
UKWAC: Building the UK‘s First Public Web Archive’, D-Lib Magazine 12 (1), URL (consulted February 2008): http://www.dlib. org/dlib/january06/thompson/01thompson.html Beagrie, N. (2003) National Digital Preservation Initiatives: An Overview of Development
- Bailey, Thompson
- 2006
(Show Context)
Citation Context ...ze and dynamics, there have been several national initiatives for preserving the Web of a country, based on full crawls in Sweden [41] and on a selective collection of Web pages in the United Kingdom =-=[42]-=- and Australia [43]. The former Future Internet 2014, 6 522 approach aims at providing complete snapshots of a domain taken at regular intervals. A drawback of this approach is the lack of knowledge a... |
6 |
An ontology-based approach to learnable focused crawling. Information Sciences 178(23): 4512–4522. BIOGRAPHICAL NOTES Enrique Jiménez-Domingo is a researcher, teaching assistant and PhD candidate of the Computer Science Department at Universidad Carlos II
- ZENG, KANG, et al.
- 2008
(Show Context)
Citation Context ...antic Indexing (LSI) [32]. Machine learning approaches, including naïve Bayes classifiers [25,33], Hidden Markov Models [34], reinforcement learning [35], genetic algorithms [36], and neural networks =-=[37]-=-, have also been applied to prioritize the unvisited URLs. Focused crawlers and crawlers in general can harvest data from the publicly indexable Web by following hyperlinks between Web pages. However,... |
5 |
Archiving the Web: The PANDORA Archive at the National Library Australia. Available online: http://www.nla.gov.au/openpublish/index.php/nlasp/article/view/ 1314/1600 (accessed on 10
- Cathro, Webb, et al.
- 2014
(Show Context)
Citation Context ...ere have been several national initiatives for preserving the Web of a country, based on full crawls in Sweden [41] and on a selective collection of Web pages in the United Kingdom [42] and Australia =-=[43]-=-. The former Future Internet 2014, 6 522 approach aims at providing complete snapshots of a domain taken at regular intervals. A drawback of this approach is the lack of knowledge about changes of Web... |
3 | Reinforcement Learning with Classifier Selection for Focused Crawling
- Partalas, Vlahavas
- 2008
(Show Context)
Citation Context ...ions of text and hyperlink analysis with Latent Semantic Indexing (LSI) [32]. Machine learning approaches, including naïve Bayes classifiers [25,33], Hidden Markov Models [34], reinforcement learning =-=[35]-=-, genetic algorithms [36], and neural networks [37], have also been applied to prioritize the unvisited URLs. Focused crawlers and crawlers in general can harvest data from the publicly indexable Web ... |
2 |
adaptive systems for focused crawling
- GOURITEN, MANIU, et al.
(Show Context)
Citation Context ...rithm for monitoring Web resources for updates and optimizing timeliness or completeness depending on application-specific requirements. 2.2. Focused and Deep-Web Crawling Focused or topical crawlers =-=[22]-=- provide an effective way to balance the cost, coverage, and quality aspects of data collection from the Web [23], by selectively crawling pages that are relevant to a set of topics, defined as a set ... |
2 |
The Kulturarw3 project—The Royal Swedish Web Archiw3e—An example of “complete” collection of Web pages. Paper presented at
- Mannerheim, Arvidson, et al.
- 2000
(Show Context)
Citation Context ...Since archiving the whole Web is a very challenging task due to its size and dynamics, there have been several national initiatives for preserving the Web of a country, based on full crawls in Sweden =-=[41]-=- and on a selective collection of Web pages in the United Kingdom [42] and Australia [43]. The former Future Internet 2014, 6 522 approach aims at providing complete snapshots of a domain taken at reg... |
1 |
Archiving; Springer-Verlag
- Masanès
- 2006
(Show Context)
Citation Context ... of content referenced by URLs is of a few years [1]; this trend is even aggravated in social media, where social networking APIs sometimes only extend to a week’s worth of content [2]. Web archiving =-=[3]-=- deals with the collection, enrichment, curation, and preservation of today’s volatile Web content in an archive that remains accessible to tomorrow’s historians. Different strategies for Web archivin... |
1 | An Architecture for Selective Web Harvesting: The Use Case of Heritrix
- Plachouras, Carpentier, et al.
- 2013
(Show Context)
Citation Context ...s article is to present an overview of this crawling architecture, and of its performance (both in terms of efficiency and of quality of the archive obtained) on real-Web crawls. This article extends =-=[7]-=-. The remainder of this work is organized as follows. We first discuss in Section 2 the related work. Then we present in Section 3 a high-level view of the ARCOMEM architecture, before reviewing indiv... |
1 | Assessing the Coverage of Data Collection Campaigns on Twitter: A Case Study
- Plachouras, Stavrakas, et al.
(Show Context)
Citation Context ...emplary on this level within the ARCOMEM system is used to recognize Named Entity Evolutions [47] and to analyze the evolutions of associations between interesting terms and tweets (Twitter Dynamics) =-=[48]-=-. 3.5. Applications For the interaction with the crawler and exploration of the content a number of applications are used around the ARCOMEM core system. The crawler cockpit is used to create the craw... |
1 | Demonstrating intelligent crawling and archiving of web applications
- Faheem, Senellart
- 2013
(Show Context)
Citation Context ...ces in the process. Application-aware crawling also helps adding semantic information to the ARCOMEM database. More details about the functioning and independent evaluation of the AAH are provided in =-=[49,50]-=-. 4.2. Online Analysis Within the online analysis, several modules analyze crawled Web objects in order to guide the crawler. The purpose of this process is to provide scores for detected URLs. These ... |