MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Using loglinear models to compress datacube (2000) [15 citations — 0 self]

Download:
Download as a PDF | Download as a PS
by Xintao Wu
In Web-Age Information Management
http://www.ise.gmu.edu/~dbarbara/755/loglinear.ps
Add To MetaCart

Abstract:

A data cube is a popular organization for summary data. A cube is simply a multidimensional structure that contains in each cell an aggregate value, i.e., the result of applying an aggregate function to an underlying relation. In practical situations, cubes can require a large amount of storage, so, compressing them is of practical importance. In this paper, we propose an approximation technique that reduces the storage cost of the cube at the price of getting approximate answers for the queries posed against the cube. The idea is to characterize regions of the cube by using statistical models whose description take less space than the data itself. Then, the model parameters can be used to estimate the cube cells with a certain level of accuracy. To increase the accuracy, some of the "outliers, " i.e., cells that incur in the largest errors when estimated are retained. The storage taken by the model parameters and the retained cells, of course, should take a fraction of the space of the full cube and the estimation procedure should be faster than computing the data from the underlying relations. We use loglinear models to model the cube regions. Experiments show that the errors introduced in typical queries are small even when the description is substantially smaller than the full cube. Since cubes are used to support data analysis and analysts are rarely interested in the precise values of the aggregates (but rather in trends), providing approximate answers is, in most cases, a satisfactory compromise. Although other techniques have been used for the purpose of compressing datacubes, ours has the advantage of using parametric (loglinear) models which also offer information about the underlying structure of the data modeled by them. Moreover, these models are relatively easy to update dynamically as data is added to the warehouse. 1

Citations

1357 R.C.: Algorithms for clustering data – Jain, Dubes - 1988
938 Density Estimation for Statistics and Data Analysis – Silverman - 1986
502 Data cube: A relational aggregation operator generalizing group-by, and sub-totals – Gray, Bosworth, et al. - 1996
415 Categorical Data Analysis – AGRESTI - 1990
377 Implementing Data Cubes Efficiently – Harinarayan, Rajaraman, et al. - 1996
367 ªAutomatic Subspace Clustering of High Dimensional Data for Data Mining Applications,º – Agrawal, Gehrke, et al. - 1998
248 Online aggregation – Hellerstein, Haas, et al. - 1997
186 On the computation of multidimensional aggregates – Agarwal, Agrawal, et al. - 1996
145 An Array-Based Algorithm for Simultaneous Multidimensional Aggregations – Zhao, Deshpande, et al. - 1997
131 Approximate computation of multidimensional aggregates of sparse data using wavelets – Vitter, Wang - 1999
102 Join synopses for approximate query answering – Acharya, Gibbons, et al. - 1999
84 Fast computation of sparse datacubes – Ross, Srivastava - 1997
60 Learning from Data – Cherkassky, Mulier - 1998
58 Discovery-driven exploration of OLAP data cubes – Sarawagi, Agrawal, et al. - 1998
46 Compressed data cubes for OLAP aggregate query approximation on continuous dimensions.In – Shanmugasundaram, Fayyad, et al. - 1988
27 Quasi-cubes: Exploiting approximation in multidimensional databases – Barbara - 1997
21 The Statistical Analysis of Categorical Data – Andersen - 1992
14 Quasi-cubes: A space-efficient way to support approximate multidimensional databases – Barbar'a, Sullivan - 1998
7 Using Approximations to Scale Exploratory Data Analysis in Datacubes – Barbar, Wu - 1999
3 Models of Category Counts – Fingleton - 1984
3 Fast Computations of Sparse Cubes – Srivastava, Ross - 1997