Download:
|
by Paul Komarek, Remi Munos, Kary Myers, Dan Pelleg
http://www.ri.cmu.edu/pub_files/pub2/moore_andrew_1999_1/moore_andrew_1999_1.ps.gz
Add To MetaCart
Abstract:
There many massive databases in industry and science. There are also many ways that decision makers, scientists, and the public need to interact with these data sources. Wide ranging statistics and machine learning algorithms similarly need to query databases, sometimes millions of times for a single inference. With millions or billions of records (e.g. biotechnology databases, inventory management systems, astrophysics sky surveys, corporate sales information, science lab data repositories) this can be intractable using current algorithms. The Auton lab (at Carnegie Mellon University) and Schenley Park Research Inc. (a startup company), both jointly run by Andrew Moore and Jeff Schneider, are concerned with the fundamental computer science of making very advanced data analysis techniques computationally feasible for massive datasets. The computational challenge How can huge data sources (Gigabytes up to Terabytes) be analyzed automatically? There is no off-the-shelf technology for this. There are devastating computational and statistical difficulties; manual analysis of such data sources is now passing from being simply tedious into a new, fundamentally impossible realm where the data sources are just too large to assimilate by humans. This
Citations
|
1651
|
R-trees: A dynamic index structure for spatial searching
– Guttman
- 1984
|
|
502
|
Data cube: A relational aggregation operator generalizing group-by, and sub-totals
– Gray, Bosworth, et al.
- 1996
|
|
410
|
An algorithm for finding best matches in logarithmic expected time
– Friedman, Bentley, et al.
- 1977
|
|
377
|
Implementing Data Cubes Efficiently
– Harinarayan, Rajaraman, et al.
- 1996
|
|
347
|
Fast Discovery of Association Rules
– Agrawal
- 1995
|
|
105
|
Balancing histogram optimality and practicality for query result size estimation
– Ioannidis, Poosala
- 1995
|
|
99
|
Locally weighted learning for control
– Atkeson, Moore, et al.
- 1996
|
|
96
|
Cached sufficient statistics for efficient machine learning with large datasets
– Moore, Lee
- 1998
|
|
90
|
Sampling-based estimation of the number of distinct values of an attribute
– Haas, Naughton, et al.
- 1995
|
|
82
|
Multiple uses of frequent sets and condensed representations
– Mannila, Toivonen
- 1996
|
|
76
|
Efficient Algorithms with Neural Network Behaviour
– Omohundro
- 1987
|
|
74
|
Efficient locally weighted polynomial regression predictions
– Moore, Schneider, et al.
- 1997
|
|
73
|
Multidimensional divide and conquer
– Bentley
- 1980
|
|
56
|
Multiresolution Instance-Based Learning
– Deng, Moore
- 1995
|
|
39
|
Acquisition of Dynamic Control Knowledge for a Robotic Manipulator
– Moore
- 1990
|
|
15
|
Very fast mixture-model-based clustering using multiresolution kd-trees
– Moore
- 1999
|
|
13
|
SIPping from the data firehose
– John, Lent
- 1997
|
|
8
|
AD-trees for fasting counting and rule learning
– Moore
- 1998
|
|
3
|
A sampling algorithm for estimating join size
– Ganguly, Gibbons, et al.
- 1996
|