Results 1 - 10
of
174
Extended Static Checking for Java
, 2002
"... Software development and maintenance are costly endeavors. The cost can be reduced if more software defects are detected earlier in the development cycle. This paper introduces the Extended Static Checker for Java (ESC/Java), an experimental compile-time program checker that finds common programming ..."
Abstract
-
Cited by 638 (24 self)
- Add to MetaCart
(Show Context)
Software development and maintenance are costly endeavors. The cost can be reduced if more software defects are detected earlier in the development cycle. This paper introduces the Extended Static Checker for Java (ESC/Java), an experimental compile-time program checker that finds common programming errors. The checker is powered by verification-condition generation and automatic theoremproving techniques. It provides programmers with a simple annotation language with which programmer design decisions can be expressed formally. ESC/Java examines the annotated software and warns of inconsistencies between the design decisions recorded in the annotations and the actual code, and also warns of potential runtime errors in the code. This paper gives an overview of the checker architecture and annotation language and describes our experience applying the checker to tens of thousands of lines of Java programs.
Crawling the Hidden Web
- In VLDB
, 2001
"... Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. ..."
Abstract
-
Cited by 279 (2 self)
- Add to MetaCart
(Show Context)
Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration.
Parallel crawlers
- In Proceedings of the 11th international conference on World Wide Web
, 2002
"... In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and ident ..."
Abstract
-
Cited by 133 (3 self)
- Add to MetaCart
(Show Context)
In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture. 1
Breadth-first search crawling yields high-quality pages
- In Proc. 10th International World Wide Web Conference
, 2001
"... This paper examines the average page quality over time of pages downloaded during a web crawl of 328 million unique pages. We use the connectivity-based metric PageRank to measure the quality of a page. We show that traversing the web graph in breadth-first search order is a good crawling strategy, ..."
Abstract
-
Cited by 126 (4 self)
- Add to MetaCart
This paper examines the average page quality over time of pages downloaded during a web crawl of 328 million unique pages. We use the connectivity-based metric PageRank to measure the quality of a page. We show that traversing the web graph in breadth-first search order is a good crawling strategy, as it tends to discover high-quality pages early on in the crawl.
On near-uniform URL sampling
, 2000
"... We consider the problem of sampling URLs uniformly at random from the Web. A tool for sampling URLs uniformly can be used to estimate various properties of Web pages, such as the fraction of pages in various Internet domains or written in various languages. Moreover, uniform URL sampling can be used ..."
Abstract
-
Cited by 124 (6 self)
- Add to MetaCart
We consider the problem of sampling URLs uniformly at random from the Web. A tool for sampling URLs uniformly can be used to estimate various properties of Web pages, such as the fraction of pages in various Internet domains or written in various languages. Moreover, uniform URL sampling can be used to determine the sizes of various search engines relative to the entire Web. In this paper, we consider sampling approaches based on random walks of the Web graph. In particular, we suggest ways of improving sampling based on random walks to make the samples closer to uniform. We suggest a natural test bed based on random graphs for testing the effectiveness of our procedures. We then use our sampling approach to estimate the distribution of pages over various Internet domains and to estimate the coverage of
Effective Page Refresh Policies for Web Crawlers
- ACM TRANSACTIONS ON DATABASE SYSTEMS
, 2003
"... In this paper we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context ..."
Abstract
-
Cited by 91 (3 self)
- Add to MetaCart
In this paper we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Web sites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes significant time and resources, it is very difficult to keep the copies completely up-to-date. This paper
The Link Database: Fast Access to Graphs of the Web
"... ... graph where URLs are nodes and hyperlinks are directed edges. The Link Database provides fast access to the hyperlinks. To support a wide range of graph algorithms, we find it important to fit the Link Database into memory. In the first version of the Link Database, we achieved this fit by using ..."
Abstract
-
Cited by 48 (2 self)
- Add to MetaCart
... graph where URLs are nodes and hyperlinks are directed edges. The Link Database provides fast access to the hyperlinks. To support a wide range of graph algorithms, we find it important to fit the Link Database into memory. In the first version of the Link Database, we achieved this fit by using machines with lots of memory (8GB), and storing each hyperlink in 32 bits. However, this approach was limited to roughly 100 million Web pages. This paper presents techniques to compress the links to accommodate larger graphs. Our techniques combine well-known compression methods with methods that depend on the properties of the web graph. The first compression technique takes advantage of the fact that most hyperlinks on most Web pages point to other pages on the same host as the page itself. The second technique takes advantage of the fact that many pages on the same host share hyperlinks, that is, they tend to point to a common set of pages. Together, these techniques reduce space requirements to under 6 bits per link. While (de)compression adds latency to the hyperlink access time, we can still compute the strongly connected components of a 6 billion-edge graph in under 20 minutes and run applications such as Kleinberg's HITS in real time. This paper describes our techniques for compressing the Link Database, and provides performance numbers for compression ratios and decompression speed.
Effective Change Detection using Sampling
"... For a large-scale data-intensive environment, such as the World-Wide Web or data warehousing, we often make local copies of remote data sources. Due to limited network and computational resources, however, it is often difficult to monitor the sources constantly to check for changes and to download c ..."
Abstract
-
Cited by 48 (7 self)
- Add to MetaCart
For a large-scale data-intensive environment, such as the World-Wide Web or data warehousing, we often make local copies of remote data sources. Due to limited network and computational resources, however, it is often difficult to monitor the sources constantly to check for changes and to download changed data items to the copies. In this scenario, our goal is to detect as many changes as we can using the fixed download resources that we have. In this paper we propose three sampling-based download policies that can identify more changed data items effectively. In our sampling-based approach, we first sample a small number of data items from each data source and download more data items from the sources with more changed samples. We analyze the effectiveness of the sampling-based policies and compare our proposed policies to existing ones, including the state-of-the-art frequency-based policy in [7, 9]. Our experiments on synthetic and real-world data will show the relative merits of various policies and the great potential of our samplingbased policy. In certain cases, our samplingbased policy could download twice as many changed items as the best existing policy.
MultiCrawler: A Pipelined Architecture for Crawling and Indexing Semantic Web Data
- Proceedings of the 5th International Semantic Web Conference
, 2006
"... Abstract. The goal of the work presented in this paper is to obtain large amounts of semistructured data from the web. Harvesting semistructured data is a prerequisite to enabling large-scale query answering over web sources. We contrast our approach to conventional web crawlers, and de-scribe and e ..."
Abstract
-
Cited by 41 (15 self)
- Add to MetaCart
(Show Context)
Abstract. The goal of the work presented in this paper is to obtain large amounts of semistructured data from the web. Harvesting semistructured data is a prerequisite to enabling large-scale query answering over web sources. We contrast our approach to conventional web crawlers, and de-scribe and evaluate a ve-step pipelined architecture to crawl and index data from both the traditional and the Semantic Web. 1
IRLbot: Scaling to 6 Billion Pages and Beyond
"... Abstract—This paper shares our experience in designing a web crawler that can download billions of pages using a singleserver implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host ratelimi ..."
Abstract
-
Cited by 41 (2 self)
- Add to MetaCart
(Show Context)
Abstract—This paper shares our experience in designing a web crawler that can download billions of pages using a singleserver implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host ratelimiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1, 789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the web graph with 41 billion unique nodes. I.