Results 1 -
4 of
4
Optimizing Recursive Information Gathering Plans
, 1999
"... In this paper we describe two optimization techniques that are specially tailored for information gathering. The first is a greedy minimization algorithm that minimizes an information gathering plan by removing redundant and overlapping information sources without loss of completeness. We then discu ..."
Abstract
-
Cited by 50 (10 self)
- Add to MetaCart
In this paper we describe two optimization techniques that are specially tailored for information gathering. The first is a greedy minimization algorithm that minimizes an information gathering plan by removing redundant and overlapping information sources without loss of completeness. We then discuss a set of...
Optimizing Recursive Information Gathering Plans in EMERAC
- Journal of Intelligent Information Systems
, 2004
"... In this paper we describe two optimization techniques that are specially tailored for information gathering. The first is a greedy minimization algorithm that minimizes an information gathering plan by removing redundant and overlapping information sources without loss of completeness. We then discu ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
In this paper we describe two optimization techniques that are specially tailored for information gathering. The first is a greedy minimization algorithm that minimizes an information gathering plan by removing redundant and overlapping information sources without loss of completeness. We then discuss a set of heuristics that guide the greedy minimization algorithm so as to remove costlier information sources first. In contrast to previous work, our approach can handle recursive query plans that arise commonly in the presence of constrained sources. Second, we present a method for ordering the access to sources to reduce the execution cost. This problem differs significantly from the traditional database query optimization problem as sources on the Internet have a variety of access limitations and the execution cost in information gathering is affected both by network traffic and by the connection setup costs. Furthermore, because of the autonomous and decentralized nature of the Web, very little cost statistics about the sources may be available. In this paper, we propose a heuristic algorithm for ordering source calls that takes these constraints into account. Specifically, our algorithm takes both access costs and traffic costs into account, and is able to operate with very coarse statistics about sources (i.e., without depending on full source statistics). Finally, we will discuss implementation and empirical evaluation of these methods in Emerac, our prototype information gathering system.
Joint Optimization of Cost and Coverage of Information Gathering Plans
"... Existing approaches for optimizing queries in information integration use decoupled strategies--attempting to optimize coverage and cost in two separate phases. Since sources tend to have a variety of access limitations, this type of phased optimization of cost and coverage can unfortunately le ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Existing approaches for optimizing queries in information integration use decoupled strategies--attempting to optimize coverage and cost in two separate phases. Since sources tend to have a variety of access limitations, this type of phased optimization of cost and coverage can unfortunately lead to expensive planning as well as highly inefficient plans. In this paper we present techniques for joint optimization of cost and coverage of the query plans. Our algorithms search in the space of parallel query plans that support multiple sources for each subgoal conjunct. The refinement of the partial plans takes into account the potential parallelism between source calls, and the binding compatibilities between the sources included in the plan. We start by introducing and motivating our query plan representation, and arguing that our way of searching in the space of parallel plans can improve both the plan generation and plan execution costs compared to existing approaches. We then briefly review how to compute the cost and coverage of a parallel plan. Next, we provide both a System-R style query optimization algorithm as well as a greedy local search algorithm for searching in the space of such query plans. Finally we present an empirical evaluation that demonstrates the flexibility and efficiency afforded by our algorithms in handling cost-coverage tradeoffs, in comparison to the existing approaches. 1
SYSTEM R STYLE JOIN ORDER OPTIMIZATION FOR INTERNET INFORMATION APPROVED: GATHERING
, 2001
"... Internet information gathering is the process of gathering data from sources that include those scattered over the Internet. Query optimization problems for Internet information gathering are different from that of traditional databases due to the lack of knowledge of the behavior of sources and a m ..."
Abstract
- Add to MetaCart
Internet information gathering is the process of gathering data from sources that include those scattered over the Internet. Query optimization problems for Internet information gathering are different from that of traditional databases due to the lack of knowledge of the behavior of sources and a myriad of binding constraints that exist for many sources over the Web. Traditional System R style optimizers lose their efficacy when sources are spread across the Internet with high access costs compared to secondary storage media. Such optimizers cannot be used for Internet sources due to various binding restrictions and query capacities. This research proposes a System R style optimizer that takes binding patterns and restrictions that most Internet sources have. It considers both left and right linear evaluations along with bushy joins. The proposed algorithm assumes full knowledge of statistics and generates a join order accordingly. However, in the absence of full statistics, it degrades gracefully and maintains its improvement over previous algorithms.

