| F. Olken. Random Sampling from Databases. PhD thesis, Berkeley, 1997. |
....have a long history; see [BDF 97] for a recent survey. GM99] presented a formal framework for evaluating such sublinear space synopsis data structures, and a survey of some of the results in this area. There has been a flurry of recent work in approximate query answering (e.g. VL93, Olk93, BDF 97, HHW97, GM98, AGPR99, HH99, VW99, IP99, AGP00, GLR00, CCMN00, CGRS00, MVW00, CDN01, LM01, Gib01, GKS01] The work in [HHW97, AGPR99, HH99, IP99, CGRS00] looked at the problem of providing approximate answers to queries seeking aggregates (e.g. count, sum, avg) of attribute values for ....
F. Olken. Random Sampling from Databases. PhD thesis, Computer Science, U.C. Berkeley, April 1993.
....C API Updates encrypted using asymmetric cryptography on network. Only those with appropriate keys Use of Oracle is not a requirement identification mapped to database users and roles site to site (tentative) Figure 1. RGIS Structure. into database systems (Olken s dissertation [22] is a good introduction to this) reducing cardinality of results [4] and incremental queries [29] current database systems do not support these features. RGIS builds its random sampling on top of unmodified ordinary database systems using query rewriting, schema extensions, indices, and ....
OLKEN,F.Random Sampling from Databases. PhD thesis, University of California, Berkeley, 1993.
....the supports computed in Step 2, obtain a reduced sample S0 from S by trimming away outlier transactions as described in Section 3.3 below. 4. Run a standard association rule algorithm against S0 with minimum support p and confidence c to obtain the final set of association rules. Olken [19] provides a review of techniques that can be used in Step 1 to obtain a random sample of transaction records. In general, the cost of obtaining a sample depends upon how the data is stored. In our implementation of fast trim, the transaction data is stored in a flat file and we use a sampling ....
F. Olken. Random Sampling from Databases. Ph.D. Dissertation, University of California, Berkeley, CA, 1993. Available as Tech. Report LBL-32883, Lawrence Berkeley Laboratories, Berkeley, CA.
....other receiving, then this is termed one way communication complexity. In a one way protocol, it is critical to specify which player is the transmitter and which the receiver. Only the receiver needs to be able to compute f . 1. 4 Related previous work Estimation of order statistics and outliers [ARS97, AS95, JC85, RML97, Olk93] has received much attention in in the context of sorting [DNS91] selectivity estimation [PIHS96] query optimization [SALP79] and in providing online user feedback [Hel] The survey by Yannakakis [Yan90] is a comprehensive account of graph theoretic methods in database theory. Classical work on ....
....lower and upper bounds when multiple passes can be performed over the data. They might also imply interesting new results about communication complexity. From a practical perspective, algorithms are needed for a wider class of problems than the selection problem that has been extensively studied [ARS97, AS95, JC85, RML97, Olk93]. 2. Can we design algorithms that minimize the number of passes performed over the data given the amount of memory available This would be useful when, for instance, the number of active concurrent threads governs the memory available at runtime 3. How can we arrange the data physically in a ....
F. Olken. Random Sampling from Databases. PhD thesis, University of California Berkeley, 1993. 14
.... While work has been done in the database community on returning partial results, particularly in extending SQL to enable query writers to explicitly limit the cardinality of a result [6] to get some answers quickly, and perhaps more later [42] and to do statistical random sampling of databases [31], we are sure that by focusing on the specific needs of GIS systems and users, we can produce specific solutions that provide better performance and more flexibility with less overhead. It is important to note that we are not promising a solution applicable to database query processing in ....
OLKEN, F. Random Sampling from Databases. PhD thesis, University of California, Berkeley, 1993.
....the second approach requires only projection. Two other approaches directly supported in FRAQL are sampling and discretization. Sampling means to obtain a representative subset of data. Various kinds of sampling were developed recently. Currently, in FRAQL only random sampling as presented in [29, 22] is implemented. A sample of a relation or query result with the given size is derived with the help of the LIMIT SAMPLE clause: limit sample 30 percent; Obviously, the sampled data should be stored in a new relation for further efficient processing. Sampling reduces the number of tuples by ....
F. Olken. Random Sampling from Databases. PhD thesis, UC Berkeley, April 1993.
....[11] by using a randomized algorithm. This algorithm is applicable independent of the sort and source parameter choice. It is based on the iterative improvement technique, which has been proposed earlier in the context of search studies for query optimization [13, 5] The authors of [10] and [9] describe different kinds of uniform random sampling techniques in a DBMS. They discuss several techniques for uniform random sampling on relations or on output of relational operators. One problem concerning sampling in relational DBMS is the placement of the sampling operation in the access ....
Frank Olken. Random Sampling from Databases. PhD thesis, UC Berkeley, 1993.
....[11] by using a randomized algorithm. This algorithm is applicable independent of the sort and source parameter choice. It is based on the iterative improvement technique, which has been proposed earlier in the context of search studies for query optimization [13, 5] The authors of [10] and [9] describe different kinds of uniform random sampling techniques in a DBMS. They discuss several techniques for uniform random sampling on relations or on output of relational operators. One problem concerning sampling in relational DBMS is the placement of the sampling operation in the access ....
Frank Olken. Random Sampling from Databases. PhD thesis, UC Berkeley, 1993.
....hence we opt to employ existing techniques. To access data randomly from a relation, we can employ the heap scan for heap files [7] index scan when there is no correlation between the attributes being aggregated and the indexed attribute [7] and the pseudorandom sampling schemes for B trees [13]. For join queries, the ripple join algorithms [5] can be applied. 2. For the result (of the full query) to be useful, the estimate for the aggregate has to be meaningful. The proximity of the running aggregate to the actual value can be expressed in terms of a running confidence interval. The ....
....these work focused on non nested queries. To facilitate online aggregation, it is important that records be accessed in random order and that the running aggregate be computed meaningfully for it to be useful. Sampling from base relations provide a means of randomly accessing records [8, 9] In [13], Olken studied the methods to access records randomly from B trees and hash files, and how random samples can be obtained from relational operations and from selectproject join queries. When records are accessed in a random order, the running aggregate can be viewed as a statistical ....
F. Olken. Random Sampling from Databases. PhD thesis, University of California, Berkeley, 1993.
.... notation follows that of Antoshenkov [2] An easy and intuitive way to construct a hierarchical histogram from a tree index is to augment every non leaf node entry with a cardinality count (i.e. the total number of leaf records in the specified subtree) Such counts are commonly called ranks [38]. Inserting or deleting a record results in node modifications from leaf to root because any such update changes the cardinality of every subtree containing that record. This is generally considered to be impractical in a production DBMS (though bulk update, common in data warehouses, can reduce ....
....is discussed more thoroughly in Section 3 and Appendix B; again, unlike previous work, the methods described in this paper make the precision cost tradeoff explicit. Sampling: Database sampling takes many forms and has many applications; for more information, we recommend the surveys by Olken [38] and Haas [7, Sec. 9] The most specifically relevant index assisted sampling literature has been summarized in Section 3. 7. Conclusions and future directions In this paper, we have argued that indexing techniques form the basis for a general and practical approach to selectivity estimation in ....
F. Olken, Random Sampling from Databases, Ph.D. dissertation, Univ. of California, Berkeley, CA, 1993.
....then draw a sample quality curve (relationship between sample size and sample quality) using these (S i , Q i ) points. The SOSS is estimated using the curve. To be e#cient, we can calculate all samples qualities at the same time in one sequential scan of D by using the idea of Binomial Sampling [10]. That is, upon reading in each instance or data record, an random number x uniformly distributed on [0.0, 1.0) is generated. If x S i N , then corresponding statistics (by counting a categorical value or binning a numerical value) are gathered for the i th sample. We describe the procedure ....
F. Olken. Random Sampling from Databases. PhD thesis, Department of Computer Science, University of California Berkeley, 1993.
....mining. 4.2.4 Approximate Query Evaluation Improved performance in answering queries to databases can be achieved by giving approximate answers to them, computed using only a sample of the database. Random sampling is the only sampling mechanism that has been investigated in this context to date [17]. It does not seem likely that a di erent type of sampling, such as for example Consistent Sampling as de ned in Section 2, would provide better results than random sampling under these circumstances. 4.2.5 Data Mining with Consistent Database Sampling Section 4.2.3 brie y outlined the use of ....
.... in this quadrant (e.g. 15] are appropriate for initial database requirements analysis where alternative designs must be explored (semantic information would be useful) and therefore several Prototype Databases may need to be generated (need for eciency) Finally, methods in quadrant (4) e.g. [17, 13]) would not generally be useful to support information systems development. This quadrant indicates the use of operational data and the inclusion of little semantic information in the Prototype Database being populated. Using operational data results in prototype databases very similar to the ....
F. Olken. Random Sampling from Databases. PhD thesis, University of California, April 1993.
....in a way that X be a good representativeofR . This can be achieved by random selection of tuples from the relation R . There are alternativetechniques described in the literature for 9 random selections of tuples from a relation such as heap scan, index scan and an index sampling technique [Olk93, HS95] There are many issues in obtaining a good random representativespecially when there are index structures on the relation. The details of sampling are beyond the scope of this paper. Scalability: Although we describe our probe queries for joins between two relations (i.e. 2 way join) ....
F. Olken. Random Sampling from Databases. Ph.D. thesis, University of California, Berkeley,1993.
....investigate how it may be extended to handle other classes of queries. 26 9 Related Work While statistical techniques based on samples, histograms, etc. have been applied in databases for a while now, they have been primarily used in selectivity estimation during query optimization [SAC 79, Olk93, PIHS96] Approximate query answering using sampling has started receiving attention recently [Olk93, HHW97, GM98, AGPR99] The closest work to ours is the Online Aggregation scheme proposed by Hellerstein et al. [HHW97] In their approach, the original data is scanned in random order at query ....
....techniques based on samples, histograms, etc. have been applied in databases for a while now, they have been primarily used in selectivity estimation during query optimization [SAC 79, Olk93, PIHS96] Approximate query answering using sampling has started receiving attention recently [Olk93, HHW97, GM98, AGPR99] The closest work to ours is the Online Aggregation scheme proposed by Hellerstein et al. [HHW97] In their approach, the original data is scanned in random order at query time to generate increasingly larger random samples of the data, thus incrementally refining the ....
F. Olken. Random Sampling from Databases. PhD thesis, Computer Science, U.C. Berkeley, April 1993.
....previous section, we described how we maintain an icicle with respect to a changing workload. We now describe the methodology of answering aggregate queries using the icicle. Due to the presence of duplicates and the selection bias (or non uniformity) in an icicle, traditional estimators (e.g. Olk93, AGPR99b] do not apply directly. An example was discussed in Section 2.1 to illustrate this problem. We now derive icicle based estimators for the average, count, and sum aggregate operators. We first discuss the intuition behind our aggregate estimators, and then formalize them in Lemma 4.1. ....
Frank Olken. Random Sampling from Databases. PhD thesis, University of California at Berkeley, 1993.
....of the data, measured essentially by the function F 2 . However, they observe that fairly poor performance is obtained when using the standard statistical estimators, and remark that estimating F 0 via sampling is a hard and relatively unsolved problem. This is consistent with Olken s assertion [Olk93] that all known estimators give large errors on at least some data sets. In a recent paper, Chaudhuri et al. [CMN98] show that large error is unavoidable even for relatively large samples regardless of the estimator used. That is, there does not exist an estimator which can guarantee reasonable ....
F. Olken, Random sampling from databases, Ph.D. thesis, Computer Science, U.C. Berkeley, April 1993.
....of the data, measured essentially by the function F 2 . However, they observe that fairly poor performance is obtained when using the standard statistical estimators, and remark that estimating F 0 via sampling is a hard and relatively unsolved problem. This is consistent with Olken s assertion [Olk93] that all known estimators give large errors on at least some data sets. In a recent paper, Chaudhuri et al. [CMN98] show that large error is unavoidable even for relatively large samples regardless of the estimator used. That is, there does not exist an estimator which can guarantee reasonable ....
F. Olken. Random Sampling from Databases. PhD thesis, Computer Science, U.C. Berkeley, April 1993.
....histograms in our work 1 . In this paper, we investigate the above approach to approximate query answering and present two different ways of using histograms for this purpose. Though histograms have been widely used in databases, their usage has been mostly restricted to selectivity estimation [15, 9, 19, 12, 14]. The use of histograms for approximate query answering brings up several novel issues to fore, which form the main focus of this article. Our contributions are summarized below. Efficient Query Execution: We propose storing histograms as regular relations in a relational DBMS and appropriately ....
....Online aggregation [8] described earlier, constitutes another style of sampling based approximate query answering wherein the answers are continuously refined till the exact answer is computed. Histograms have been studied extensively for application in selectivity estimation in query optimizers [12, 14, 15]. In our earlier work, we have identified several novel classes of histograms to build on one or more attributes [19, 18] and also proposed techniques for their efficient computation [11] and incremental maintenance [5] We recently extended histograms for selectivity estimation in spatial ....
[Article contains additional citation context not shown here]
F. Olken. Random Sampling from Databases. PhD Dissertation, Computer Science, University of California at Berkeley, 1993.
.... between eciency and accuracy can be achieved by analysing only a sample of the database [6] Query Estimation Approximate answers to aggregate queries (e.g. number of tuples satisfying a particular predicate) can be eciently computed by answering these queries for a sample of the database [9]. Information System Development Data intensive applications development requires prototype databases to support several stages of the development process, e.g. validation, testing, users training. In particular, in the context of Legacy Information System Migration [3] a Sample Database has ....
.... in this quadrant (e.g. 8] are appropriate for initial database requirements analysis where di erent alternative designs must be explored (which requires information) and therefore several Prototype Databases may need to be generated (need for eciency) Finally, methods in quadrant (4) e.g. [9, 6]) would not be useful to support information systems development. This quadrant indicates the use of operational data but the inclusion of little semantic information in the Prototype Database being populated. Methods within this quadrant are, however, the only ones used in applications of ....
F. Olken. Random Sampling from Databases. PhD thesis, University of California, April 1993.
....histograms in our work # . In this paper, we investigate the above approach to approximate query answering and present two different ways of using histograms for this purpose. Though histograms have been widely used in databases, their usage has been mostly restricted to selectivity estimation [15, 9, 19, 12, 14]. The use of histograms for approximate query answering brings up several novel issues to fore, which form the main focus of this article. Our contributions are summarized below. # Efficient Query Execution: We propose storing histograms as regular relations in a relational DBMS and ....
....Online aggregation [8] described earlier, constitutes another style of sampling based approximate query answering wherein the answers are continuously refined till the exact answer is computed. Histograms have been studied extensively for application in selectivity estimation in query optimizers [12, 14, 15]. In our earlier work, we have identified several novel classes of histograms to build on one or more attributes [19, 18] and also proposed techniques for their efficient computation [11] and incremental maintenance [5] We recently extended histograms for selectivity estimation in spatial ....
[Article contains additional citation context not shown here]
F. Olken. Random Sampling from Databases. PhDDissertation, Computer Science, University of California at Berkeley, 1993.
No context found.
F. Olken. Random Sampling from Databases. PhD thesis, Berkeley, 1997.
No context found.
F. Olken. Random Sampling from Databases. In Ph.D. Dissertation, 1993.
No context found.
Frank Olken. Random Sampling from Databases. PhD thesis, University of California at Berkeley, 1993.
No context found.
OLKEN, F. Random Sampling from Databases. PhD thesis, University of California, Berkeley, 1993.
No context found.
F. Olken. Random Sampling from Databases. PhD thesis, Department of Computer Science, U.C., Berkeley, 1993.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC