Results 1 -
2 of
2
ABSTRACT WEB ARCHIVE SERVICES FRAMEWORK FOR TIGHTER INTEGRATION BETWEEN THE PAST AND PRESENT WEB
, 2014
"... Web archives have contained the cultural history of the web for many years, but they still have a limited capability for access. Most of the web archiving research has focused on crawling and preser-vation activities, with little focus on the delivery methods. The current access methods are tightly ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Web archives have contained the cultural history of the web for many years, but they still have a limited capability for access. Most of the web archiving research has focused on crawling and preser-vation activities, with little focus on the delivery methods. The current access methods are tightly coupled with web archive infrastructure, hard to replicate or integrate with other web archives, and do not cover all the users ’ needs. In this dissertation, we focus on the access methods for archived web data to enable users, third-party developers, researchers, and others to gain knowledge from the web archives. We build ArcSys, a new service framework that extracts, preserves, and exposes APIs for the web archive corpus. The dissertation introduces a novel categorization technique to divide the archived corpus into four levels. For each level, we will propose suitable services and APIs that enable both users and third-party developers to build new interfaces. The first level is the content level that extracts the content from the archived web data. We develop ArcContent to expose the web archive content processed through various filters. The second level is the metadata level; we extract the metadata from the archived web data and make it available to users. We implement two services, ArcLink for temporal web graph and ArcThumb for optimizing the thumbnail creation in the web archives. The third level is the URI level that focuses on using the URI HTTP redirection
Noname manuscript No. (will be inserted by the editor) Profiling Web Archive Coverage for Top-Level Domain and Content Language
"... Abstract The Memento Aggregator currently polls ev-ery known public web archive when serving a request for an archived web page, even though some web archives focus on only specific domains and ignore the others. Similar to query routing in distributed search, we inves-tigate the impact on aggregate ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract The Memento Aggregator currently polls ev-ery known public web archive when serving a request for an archived web page, even though some web archives focus on only specific domains and ignore the others. Similar to query routing in distributed search, we inves-tigate the impact on aggregated Memento TimeMaps (lists of when and where a web page was archived) by only sending queries to archives likely to hold the archived page. We profile fifteen public web archives using data from a variety of sources (the web, archives’ access logs, and fulltext queries to archives) and use these profiles as resource descriptor. These profiles are used in matching the URI-lookup requests to the most probable web archives. We define RecallTM (n) as the percentage of a TimeMap that was returned using n web archives. We discover that only sending queries to the top three web archives (i.e., 80 % reduction in the number of queries) for any request reaches on average RecallTM = 0.96. If we exclude the Internet Archive from the list, we can reach RecallTM = 0.647 on aver-age using only the remaining top three web archives.