| P. J. Haas and A. N. Swami. Sequential sampling procedures for query size estimation. In Proc. of SIGMOD Conf., pages 341--350, 1992. |
....the cdf of the attribute X. Note that, with respect to attribute X, the set of tuples in relation R and the fdf and cdf of X are equivalent; i.e. they carry the same information and one can be derived from the other. The first category of the estimation techniques is tuple sampling (e.g. HS92, HS95, GM98, HHW97, HH99, LNS90] The tuple sampling technique summarizes a relation R by taking uniform samples from the tuples in R. As shown in Figure 1(a) the summarized version of relation R is the sample set r. Intuitively, when a query is posed to the estimator, the estimator logically ....
Peter J. Haas and Arun N. Swami. Sequential sampling procedures for query size estimation. In Proceedings of 1992 ACM SIGMOD international conference on Management of data, pages 341--350, 1992.
....information about data and query distributions. The use of histograms is crucial for effective query optimization, and has received considerable research attention. Existing approaches can be classified into two categories depending on whether they take into account only the data distribution [HS92, IP95, GM98, APR99, WAA01], or also consider the query patterns [CR94, GLR00, BCG01, WAA02] Although our framework can be used with any histogram, for the shake of simplicity and generality, we adopt the equi length method (in fact more sophisticated histograms lead to even better performance) Specifically, the data ....
Haas, P., Swami, A. Sequential Sampling Procedures for Query Size Estimation. ACM S1GMOD, 1992.
....scan sampling algorithms may be more efficient due to reduced seek time of sequential vs. random disk reads. While such efficiencies may be insignificant for hashed files, they are potentially significant (e.g. a factor of 3 4) for B tree files. In a subsequent paper, Haas Swami [HS92a, HS92b] developed improved stopping rules for sequential sampling of selectivity estimation. Haas Swami first observed that Lipton, et al. were using apriori bounds for the mean and variance of the population in their stopping rule. Haas Swami therefore suggested estimating the mean and variance for ....
Peter J. Haas and Arun N. Swami. Sequential Sampling Procedures for Query Size Estimation. In ACM SIGMOD International Conference on the Management of Data, pages 341--350, June 1992.
....Sequential scan sampling algorithms may be more efficient due to reduced seek time of sequential vs. random disk reads. While such efficiencies may be insignificant for hashed files, they are potentially significant (e.g. a factor of 3 4) for B tree files. In a subsequent paper, Haas Swami [HS92a, HS92b] developed improved stopping rules for sequential sampling of selectivity estimation. Haas Swami first observed that Lipton, et al. were using apriori bounds for the mean and variance of the population in their stopping rule. Haas Swami therefore suggested estimating the mean and ....
Peter J. Haas and Arun N. Swami. Sequential Sampling Procedures for Query Size Estimation. Technical Report RJ 8558, IBM Alamaden, January 1992.
....taken over some logical underlying domain. Statistical sampling and related techniques are frequently proposed for approximating selectivity and projectivity where the uniform distribution assumption is violated. Such approaches include Hou et.al. HOD91] Lipton et.al. LNS90] Haas and Swami [HS92] and Haas et.al [HNSS95] Histogram techniques [PC84] are also used to improve selectivity estimates. As an alternative to sampling, Sun et al. propose using a regression model to approximate the underlying distribution of the data [SLRD93] Initial results combining statistical sampling ....
Peter J. Haas and Arun N. Swami. Sequential sampling procedures for query size estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 341--350, 1992.
....kernel density estimators to efficiently address the multi dimensional range query selectivity problem. We used Scott s rule for setting the bandwidths. We presented an experimental study that shows performance improvements over traditional techniques for density estimation, including sampling (Haas Swami, 1992), multi dimensional histograms (Poosala Ioannidis, 1997) and wavelets (Vitter et al. 1998) The main advantage of kernel density estimators is that the estimator can be computed very efficiently in one dataset pass, during which we both sample the dataset and approximate the standard deviation ....
Haas, P. J., & Swami, A. N. (1992). Sequential Sampling Procedures for Query Size Estimation. Proc. of the ACM SIGMOD Intern. Conf. on Management of Data.
....mathematical distribution or a polynomial. Although requiring very little overhead, these approaches are typically inaccurate because 26 most often real data does not follow any mathematical function. On the other hand, those based on sampling primarily operate at run time [OR86, LNS90, HS92, HS95] and compute their estimates by collecting and possibly processing random samples of the data. Although producing highly accurate estimates, sampling is quite expensive and, therefore, its practicality in query optimization is questionable, especially since optimizers need query result size ....
P. Haas and A. Swami. Sequential sampling procedures for query size estimation. In Proc. of the 1992 ACM-SIGMOD Conference on the Management of Data, pages 341--350, San Diego, CA, June 1992.
.... estimators) and most of the existing techniques for estimating the selectivity of multidimensional range queries for real attributes (wavelet transform [35] multi dimensional histogram MHIST [27] one dimensional estimation techniques with the attribute independence assumption, and sampling [13]) We include the attribute independence assumption in our study as a baseline comparison. The experimental results show that we can efficiently build selectivity estimators for multi dimensional datasets with real attributes. Although the accuracy of all the techniques drops rapidly with the ....
P.J. Haas, A.N. Swami. Sequential Sampling Procedures for Query Size Estimation. Proc. of the 1992 ACM SIGMOD, pp. 341-350, June 1992.
....the resulting size of a query. In this paper, we are particularly interested in estimating the size of selection or range queries that are defined over a single attribute of a relational table. Random samples of tuples from the base relation of database can be used for selectivity estimation [HS92, HS95, GM98, HHW97, HH99, LNS90] The AQUA system [GPA 98, GM98, AGPR99] uses random samples of tuples for general purpose query result estimation. The idea to to create a down sized copy of the original relation and run the queries against the down sized copy, which is significantly smaller ....
Peter J. Haas and Arun N. Swami. Sequential sampling procedures for query size estimation. In Proceedings of 1992 ACM SIGMOD international conference on Management of data, pages 341--350, 1992.
.... Other works on incremental maintenance of approximate synopses include [FM83, FM85, WVZT90, HNSS95, AMS96, GMP97b, GP97] Finally, there has been considerable work on sampling based estimation algorithms for use within a query optimizer (e.g. H OT88, H OT89, LN89, LN90, LNS90, H OD91, HS92, LS92, LNSS93, HNSS93, HNS94, LN95, HNSS95, GGMS96] None of this previous work uses the new techniques described in this paper. 9 Conclusions This paper describes the Aqua system, for fast, highly accurate approximate query answers. It is well known that join operators seriously degrade ....
P. J. Haas and A. N. Swami. Sequential sampling procedures for query size estimation. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 1--11, June 1992.
....operations with smaller overheads while reporting an approximate min in response to findmin and deletemin operations. These data structures have linear space footprints. The design of sampling based estimation algorithms is a popular area of research [H OT88, H OT89, LN89, LN90, LNS90, H OD91, HS92, LS92, LNSS93, HNSS93, HNS94, LN95, HNSS95, GGMS96] Results in [LNS90, H OD91, HS92, HNS94] and elsewhere demonstrate the practicality of estimation procedures based on sampling by showing that the time taken to compute the estimate is a small fraction of the time taken to compute the actual ....
....and deletemin operations. These data structures have linear space footprints. The design of sampling based estimation algorithms is a popular area of research [H OT88, H OT89, LN89, LN90, LNS90, H OD91, HS92, LS92, LNSS93, HNSS93, HNS94, LN95, HNSS95, GGMS96] Results in [LNS90, H OD91, HS92, HNS94] and elsewhere demonstrate the practicality of estimation procedures based on sampling by showing that the time taken to compute the estimate is a small fraction of the time taken to compute the actual query. Studies of the relative merits of various types of histograms in estimating ....
P. J. Haas and A. N. Swami. Sequential sampling procedures for query size estimation. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 1--11, June 1992.
....the density estimator for attributes with finite discrete domains. They include computing multi dimensional histograms [25] 1] 18] 2] using the wavelet transformation [33] 21] SVD [25] 2 or the discrete cosine transform [20] on the data, using kernel estimators [3] and sampling [23] 19] [11]. Density estimator techniques attempt to define a function that approximates the data distribution. Since we must be able to derive the approximate solution to a query quickly, the description of the function must be kept in memory. Further, we may have to answer queries on many datasets, so the ....
.... estimators) and most of the existing techniques for estimating the selectivity of multidimensional range queries for real attributes (wavelet transform [33] multi dimensional histogram MHIST [25] one dimensional estimation techniques with the attribute independence assumption, and sampling [11]) We include the attribute independence assumption in our study as a baseline comparison. The experimental results show that we can efficiently build selectivity estimators for multi dimensional datasets with real attributes. Although the accuracy of all the techniques drops rapidly with the ....
P.J. Haas, A.N. Swami. Sequential Sampling Procedures for Query Size Estimation. In Proc. of the 1992 ACM SIGMOD Intern. Conf. on Management of Data, June 1992.
....[GES85] In the database community, the problem has been studied in the field of query optimization and more specifically in the context of selectivity estimation for relational operators. Several techniques have been proposed [MCS88] including histograms [Koo80, SC84, Ioa93, IP95] sampling [OR86, LNS90, HS92], and parametric techniques. Histograms are the most commonly used form of statistics in practice (e.g. they are used in DB2, Oracle, and Microsoft SQL Server) because they incur almost no run time overhead and are effective even with a very small amount of storage space. Several types of ....
P. Haas and A. Swami. Sequential Sampling Procedures for Query Size Estimation. Proceedings of ACM SIGMOD, San Diego, CA, pages 341--
....system state (a recently changed resident volume) The time in this state is set to a number of transaction completions that provides statistical significance. We currently set it to 50 in all cases, but this length could also be dynamically determined for each class using sampling techniques [Haas 91] If response time goals are being met at the end of 50 completions, then the class is moved to steady state, otherwise new target residencies are set, statistics are reset, and the class moves to transition up or transition down. ffl Steady State: A class enters steady state when its response ....
P. Haas, A. Swami, "Sequential Sampling Procedures for Query Size Estimation," Proc. ACM SIGMOD '92 Conf., San Diego, CA, June 1992.
....delivers sufficient accuracy for a few nested operators above the retrieval nodes. Sampling techniques for a variety of stored data structures are described in [OlRo89] OlRo90] Ant93] OlRo93] Algorithms and stop rules for sampling estimation of joins and selects are presented in [LiNS90] [HaSw92]. The more operators are involved in a subquery subject to direct sampling estimation, the larger certainty areas can be potentially uncovered. However, restrictions on sample sizes aiming to keep estimation cost lower than execution cost limit the number of nested operators to a few for any ....
P. Haas and A. Swami, "Sequential Sampling Procedures for Query Size Estimation," Proceedings of the ACM SIGMOD Conference, (June 1992).
....Good estimates for the cost of database operations are thus critical to the effective operation of query optimisers and ultimately of the database systems that rely on them. This paper proposes a novel sampling based method to improve such cost estimation for the join operation. Most previous work [7, 9, 6, 3, 4] on sampling based methods has focused on simple random sampling (SRS) whereby each unit (tuple) in the population (relation) of interest has an equal chance to be selected in the sample. Simple random sampling can be performed under two distinct regimes. The first is with replacement; that is, ....
....it is simple to implement. The second scheme does not allow replacement; any unit (tuple) already selected can not be selected again. This scheme which we call SRSWOR requires a more sophisticated data structure to do the sampling. The simple random sampling methods proposed in the literature [9, 6, 3] differ from one another primarily in their stopping conditions, i.e. when to stop sampling. Systematic sampling was first proposed by [12] in the context of multidatabase systems; this work made no assumptions about the sortedness of the underlying relations. In this paper, we suggest that a ....
P. J. Haas and A. N. Swami. Sequential Sampling Procedures for Query Size Estimation. In ACM SIGMOD Conference on the Management of Data, pages 341--350, 1992.
....assumption is not essential but will simplify our presentation. Also, all tuples are assumed to have the same size. In the presence of certain database characteristics and data skew, we only have to modify the formula for estimating the cardinalities of resulting relations from joins accordingly [20, 23] when applying our join sequence scheduling and processor allocation schemes. Results on the effect of data skew can be found in [27, 51] 3 Determining the Execution Sequence of Joins In this section, we shall propose and evaluate various join sequence heuristics. Specifically, we focus on ....
P. Haas and A. Swami. Sequential Sampling Procedures for Query Size Estimation. Proceedings of ACM SIGMOD, pages 341--350, June 1992.
....thus critical to the effective operation of query optimisers and ultimately of the database systems that rely on them. This paper proposes a novel method to improve such cost estimation. There has been a considerable amount of work on the issue of selectivity estimation over one and a half decades [22, 6, 7, 19, 13, 11, 17, 18, 16, 8, 23, 5]. This work can be classified into four categories [23, 5] namely parametric, histogram, curve fitting and sampling. Let us briefly describe each of them; the reader can find more details in the references given above. Parametric The parametric methods [22, 6, 7] are ones which depend upon ....
P. Haas and A. Swami. Sequential Sampling Procedures for Query Size Estimation. In ACM SIGMOD Conference on the Management of Data, pages 341--350, 1992.
....thus critical to the effective operation of query optimisers and ultimately of the database systems that rely on them. This paper proposes a novel method to improve such cost estimation. There has been a considerable amount of work on the issue of selectivity estimation over one and a half decades [19, 5, 6, 16, 11, 9, 14, 15, 13, 7, 20, 4]. This previous work can be classified into four categories [20, 4] namely non parametric, parametric, sampling and curve fitting. Let us briefly describe each of them; the reader can find more details in the references given above. The non parametric method is table or histogrambased [16, 15] ....
....The method will give accurate query size estimates if the actual data distribution follows the a priori assumption. In reality, data distributions in real databases may not fit well with the assumptions and, consequently, the quality of the size estimates could be unreliable. The sampling method [13, 7] has recently received considerable interest. The accuracy of this method depends upon the size of samples; the higher the sample size, the better the estimation. Given complex queries which consist of several selection and join operations, the method may require a nontrivial amount of time to do ....
P. Haas and A. Swami. Sequential Sampling Procedures for Query Size Estimation. In ACM SIGMOD Conference on the Management of Data, pages 341--350, 1992.
....in general. 2.3.1 Pruning Unlikely Candidates We would like to compare only the most likely candidate patterns with the entire set. The main question from an optimization point of view is which candidates to compare. Our strategy is as follows. We use simple random sampling without replacement [28, 38, 51, 64] to select sample sequences from the set. Consider a candidate pattern P . Let D (a, respectively) denote the number of sequences in the entire set D (the sample A, respectively) that contain P within the allowed number of distance. Let N be the database size and n the sample size; F = D=N and f = ....
P. J. Haas and A. N. Swami, "Sequential sampling procedures for query size estimation," in Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data, San Diego, CA, pp. 341--350, June 1992.
....database which includes the sets of coefficients of the fitting functions and the overflow arrays. Yet, our approach is achieved at the cost of the accuracy of the query results. Recent work on query optimization essentially attempt to map the distribution of the actual data by statistical methods [7, 9, 12, 15, 18, 19, 20, 23, 25, 27]. We intend to map the actual data directly against regression functions. With histogram methods [18, 19, 23, 25] one needs to store the detailed statistics about the database. Parametric methods have a problem [7] if no known statistical model fits the actual distribution, any attempt to ....
....19, 23, 25] one needs to store the detailed statistics about the database. Parametric methods have a problem [7] if no known statistical model fits the actual distribution, any attempt to approximate the distribution will be in vain. Sampling methods are rather costly due to run time disk I O [9, 12, 15, 20]. Only curve fitting methods are similar to our approach [27] Our approach can provide accurate approximating functions to any kind of data set since the functions are derived from the characteristics of the actual data. Chen and Roussopoulos proposed a method of approximating the attribute value ....
P. J. Haas and A. N. Swami, Sequential Sampling Procedures for Query Size Estimation, In Proceeding of ACM-SIGMOD International Conference on Management of Data, San Diego, CA, 1992, 341-350.
....Good estimates for the cost of database operations are thus critical to the effective operation of query optimisers and ultimately of the database systems that rely on them. This paper proposes a novel sampling based method to improve such cost estimation for the join operation. Most previous work [7, 9, 6, 3, 4] on sampling based methods has focused on simple random sampling (SRS) whereby each unit (tuple) in the population (relation) of interest has an equal chance to be selected in the sample. Simple random sampling can be performed under two distinct regimes. The first is with replacement; that is, ....
....to its simpler implementation. The second scheme does not allow replacement; any unit (tuple) already selected can not be selected again. This scheme which we call SRSWOR requires a more sophisticated data structure to do the sampling. The simple random sampling methods proposed in the literature [9, 6, 3] differ from one another primarily in their stopping conditions, i.e. when to stop sampling. Systematic sampling was first proposed by [12] in the context of multidatabase systems, and with unsorted data. In this paper, we suggest that a new systematic sampling method (SYSSMP) that exploits the ....
P. J. Haas and A. N. Swami. Sequential Sampling Procedures for Query Size Estimation. In ACM SIGMOD Conference on the Management of Data, pages 341--350, 1992.
....the resulting sizes of queries. Sample tuples are taken from the relations, and queries are performed against these samples to collect the statistics. Sufficient samples must be examined before desired accuracy can be achieved. Variations of this method have been proposed in [HOT88, LN90, HS92] Though the sampling method usually gives more accurate estimation than all other methods (suppose suffucient samples are taken) it is primarily used in answering statistical queries (such as COUNT( In the context of query optimization where selectivity estimation is much more frequent, ....
P. Haas and A. Swami. Sequential sampling procedures for query size estimation. In Proceedings of the ACM-SIGMOD Intl. Conf. on Management of Data, pages 341--350, San Diego, CA, 1992.
....accurate if the actual data distribution follows the a priori assumptions well, but data distributions in real databases may not fit well with the assumed distributions. These methods also have problems if the data distribution changes over time as a result of database updates. Sampling methods [11, 5] execute the queries to be optimised on small subsets (samples) of the real database, and use the results of these trials to determine cost estimates. These methods can give very accurate estimates for complex query plans, since they are effectively pre executing plans to determine the costs. ....
P. Haas and A. Swami. Sequential Sampling Procedures for Query Size Estimation. In ACM SIGMOD Conference on the Management of Data, pages 341--350, 1992.
....We have succeeded in decomposing the original signal into a lower resolution (two value) version and a pair of detail coefficients. By repeating this process recursively on the averages, we get the full decomposition: Resolution Averages Detail Coefficients 4 [2, 2, 7, 9] 2 [2, 8] 0, 2] 1 [5] [6] We define the wavelet transform (also called wavelet decomposition) of the original four value signal to be the single coefficient representing the overall average of the original signal, followed by the detail coefficients in the order of increasing resolution. Thus, for the onedimensional ....
....signal to be the single coefficient representing the overall average of the original signal, followed by the detail coefficients in the order of increasing resolution. Thus, for the onedimensional Haar basis, the wavelet transform of our original cumulative frequencies is given by b S = [5, 6, 0, 2]: The individual entries are called the wavelet coefficients. The wavelet decomposition is very efficient computationally, requiring only O(N) time to compute for a signal of N frequencies. No information has been gained or lost by this process. The original signal has four values, and so does the ....
[Article contains additional citation context not shown here]
P. Haas and A. Swami. Sequential sampling procedures for query size estimation. In Proceedings of the 1992 ACM SIGMOD Conference, 1992.
....on accurate cost estimation of various query reorderings [BGI] Estimating predicate selectivity, or the fraction of rows in a database that satisfy a selection predicate, is key to determining the optimal join order. Previous work has concentrated on estimating selectivity for numeric fields [ASW, HaSa, IoP, LNS, SAC, WVT]. With the popularity of textual data being stored in databases, it has become important to estimate selectivity accurately for alphanumeric fields. A particularly problematic predicate used against alphanumeric fields is the SQL like predicate [Dat] Techniques used for estimating numeric ....
....consulted to estimate selectivity; the processing in the query optimization phase must be minimal. Further, the space available in the metadata descriptors for any one column of the database is limited. Models already exist in current day relational DBMS to estimate selectivity for numeric fields [ASW, HaSa, IoP, LNS, SAC, WVT]. Typically, in the runstats phase, a few numbers that capture the distribution of data are accumulated and stored in the metadata, as histograms, for example. The problem of estimating alphanumeric selectivity is a natural extension to the problem of estimating numeric selectivity: the like ....
[Article contains additional citation context not shown here]
P. Haas and A. Swami, "Sequential Sampling Procedures for Query Size Estimation," Proc. of the 1992 ACM SIGMOD Conference, 341--350.
....0 critical importance of good quality estimation. Several techniques have been proposed in the literature to estimate query result sizes, most of them contained in the extensive survey by Mannino, Chu, and Sager [16] and elsewhere [4] Those based on sampling primarily operate at run time [7, 8, 15] and compute their estimates by collecting and possibly processing random samples of the data. Sampling is quite expensive and, therefore, its practicality is questionable, especially since it is performed mostly at run time and optimizers need query result size estimations frequently. ....
P. Haas and A. Swami. Sequential sampling procedures for query size estimation. In Proc. of the 1992 ACMSIGMOD Conf., pages 341--350, San Diego, CA, June 1992.
....is taken over some logical underlying domain. Statistical sampling and related techniques are frequently proposed for approximating selectivity and projectivity where the uniform distribution assumption is violated. Such approaches include Hou et.al. HOD91] Lipton et.al. LNS90] Haas and Swami [HS92] and Haas et.al [HNSS95] Histogram techniques [PC84] are also used to improve selectivity estimates. Initial results combining statistical sampling techniques with our semantic estimation techniques is promising. As an alternative to sampling, Sun et.al. use a regression model to approximate ....
Peter J. Haas and Arun N. Swami. Sequential sampling procedures for query size estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 341--350, 1992.
....factor in evaluating the cost of each query execution plan is selectivity the number of tuples in the result of a relational selection or join operator. Various methods based on maintaining attribute value distributions [Chr83, PSC84, MD88, SLRD93, Ioa93] and query sampling [HOT88, LN90, HS92] have been proposed to facilitate selectivity estimation. ADMS uses and adaptively maintains approximating functions for value distributions of attributes used in query predicates. We implemented both polynomials and splines 1 and an adaptive curvefitting module which feeds back accurate ....
P. Haas and A. Swami. Sequential Sampling Procedures for Query Size Estimation. In Procs. of the ACM SIGMOD Intl. Conf. on Management of Data, pages 341--350, San Diego, CA, 1992.
....numbers per bucket by taking advantage of the one dimensional order of the cells, as shown in Figure 1. 4.1.2. Random Sampling Several random sampling techniques, in which a large set of data is represented by a smaller random sample of the data, have been developed for database optimization [8, 9, 15, 14]. We can approximate the raw data cube by taking a random sample of a certain size from the nonzero cells of the raw data cube. When a range sum query is presented in the online phase, the query is evaluated against the sample, and the approximate answer is extrapolated in the obvious way: If the ....
P. Haas and A. Swami. Sequential sampling procedures for query size estimation. In Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data, 1992.
....more efficient due to reduced seek time of sequential vs. random disk reads. CHAPTER 2. LITERATURE SURVEY 28 While such efficiencies may be insignificant for hashed files, they are potentially significant (e.g. a factor of 3 4) for B tree files. In a subsequent paper, Haas Swami [HS92a, HS92b] developed improved stopping rules for sequential sampling of selectivity estimation. Haas Swami first observed that Lipton, et al. were using apriori bounds for the mean and variance of the population in their stopping rule. Haas Swami therefore suggested estimating the mean and variance for ....
Peter J. Haas and Arun N. Swami. Sequential Sampling Procedures for Query Size Estimation. In ACM SIGMOD International Conference on the Management of Data, pages 341--350, June 1992.
....may be more efficient due to reduced seek time of sequential vs. random disk reads. CHAPTER 2. LITERATURE SURVEY 28 While such efficiencies may be insignificant for hashed files, they are potentially significant (e.g. a factor of 3 4) for B tree files. In a subsequent paper, Haas Swami [HS92a, HS92b] developed improved stopping rules for sequential sampling of selectivity estimation. Haas Swami first observed that Lipton, et al. were using apriori bounds for the mean and variance of the population in their stopping rule. Haas Swami therefore suggested estimating the mean and ....
Peter J. Haas and Arun N. Swami. Sequential Sampling Procedures for Query Size Estimation. Technical Report RJ 8558, IBM Alamaden, January 1992.
....the entire spectrum of possible algorithms alluded to above. 1.2. Related Work The idea of sampling from base relations in order to quickly estimate the answer to a COUNT query goes back to the work of Hou, et al. HOT88, HOT89] This topic also has been treated in [GGMS96, HNSS96, HNS94, HS92, HS95, HOD91, LN90, LNS90, LNSS93] Figure 2: The elements of R Theta S that have been seen after n sampling steps of a square ripple join (n = 1; 2; 3; 4) usually under the assumption that there exists an index on one or more of the base relations. Techniques that are applicable to other ....
P. J. Haas and A. N. Swami. Sequential sampling procedures for query size estimation. In Proc. 1992 ACM SIGMOD Intl. Conf. Managment of Data, pages 1--11. ACM Press, 1992.
....lead to inaccurate cost estimates, which in turn can cause the optimizer to select an expensive query execution plan. In an effort to avoid these problems, a number of researchers have considered approaches in which selectivities and costs are estimated directly from a sample; see, for example, [GGMS96, HNSS96, HS92, HS95, HOD91, LNS90, LNSS93, NS90]. Several authors have outlined complete sampling based approaches to query optimization [Ant93a, SBM93, Wil91] ffl Parallel processing of queries Balancing the workload between processors is a critical objective of any parallel query processing algorithm. Typically, records are assigned to ....
....from 28 that a sample size of 1060 records is sufficient to ensure that, with probability at least 99 , b n (f) estimates (f) to within 10 . In practice, either two phase or sequential procedures can be used to estimate oe 2 (f) and control the sample size automatically; see, for example, [HS92, HOD91]. Similarly, a priori bounds on the function f often are available in practice, so that 28 can be used to determine the required sample size. The above calculations also can be turned around to yield estimates of the precision of b n for a specified sample size n. For example, fix n (with n ....
P. J. Haas and A. N. Swami. Sequential sampling procedures for query size estimation. In Proc. 1992 ACM SIGMOD Intl. Conf. Managment of Data, pages 1--11. ACM Press, 1992.
No context found.
P. J. Haas and A. N. Swami. Sequential sampling procedures for query size estimation. In Proc. of SIGMOD Conf., pages 341--350, 1992.
No context found.
P.J. Haas, A.N. Swami. Sequential Sampling Procedures for Query Size Estimation. In Proc. of the 1992.
No context found.
P. J. Haas and A. N. Swami. Sequential sampling procedures for query size estimation. In Proceedings of VLDB, pages 341-- 50, 1992.
No context found.
Haas, P. J. and Swami, A. N. (1992b). Sequential sampling procedures for query size estimation, ACM SIGMOD International Conference on the Management of Data, pp. 341--350.
No context found.
Haas, P. J. and Swami, A. N. (1992a). Sequential sampling procedures for query size estimation, Technical Report RJ 8558, IBM Alamaden.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC