• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

External memory algorithms and data structures: dealing with massive data (0)

by J S Vitter
Venue:ACM Comput. Surv
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 350
Next 10 →

Models and issues in data stream systems

by Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Jennifer Widom - IN PODS , 2002
"... In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work releva ..."
Abstract - Cited by 786 (19 self) - Add to MetaCart
In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.
(Show Context)

Citation Context

...ts Since data streams are potentially unbounded in size, the amount of storage required to compute an exact answer to a data stream query may also grow without bound. While external memory algorithms =-=[91]-=- for handling data sets larger than main memory have been studied, such algorithms are not well suited to data stream applications since they do not support continuous queries and are typically too sl...

Data Streams: Algorithms and Applications

by S. Muthukrishnan , 2005
"... In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerg ..."
Abstract - Cited by 533 (22 self) - Add to MetaCart
In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].1
(Show Context)

Citation Context

... uniquely challenge the TCS needs. We in the computer science community have traditionally focused on scaling in size: how to efficiently manipulate large disk-bound data via suitable data structures =-=[213]-=-, how to scale to databases of petabytes [114], synthesize massive data sets [115], etc. However, far less attention has been given to benchmarking, studying performance of systems under rapid updates...

Approximate distance oracles

by Mikkel Thorup, Uri Zwick , 2004
"... Let G = (V, E) be an undirected weighted graph with |V | = n and |E | = m. Let k ≥ 1 be an integer. We show that G = (V, E) can be preprocessed in O(kmn 1/k) expected time, constructing a data structure of size O(kn 1+1/k), such that any subsequent distance query can be answered, approximately, in ..."
Abstract - Cited by 273 (9 self) - Add to MetaCart
Let G = (V, E) be an undirected weighted graph with |V | = n and |E | = m. Let k ≥ 1 be an integer. We show that G = (V, E) can be preprocessed in O(kmn 1/k) expected time, constructing a data structure of size O(kn 1+1/k), such that any subsequent distance query can be answered, approximately, in O(k) time. The approximate distance returned is of stretch at most 2k − 1, i.e., the quotient obtained by dividing the estimated distance by the actual distance lies between 1 and 2k−1. A 1963 girth conjecture of Erdős, implies that Ω(n 1+1/k) space is needed in the worst case for any real stretch strictly smaller than 2k + 1. The space requirement of our algorithm is, therefore, essentially optimal. The most impressive feature of our data structure is its constant query time, hence the name “oracle”. Previously, data structures that used only O(n 1+1/k) space had a query time of Ω(n 1/k). Our algorithms are extremely simple and easy to implement efficiently. They also provide faster constructions of sparse spanners of weighted graphs, and improved tree covers and distance labelings of weighted or unweighted graphs.

Wavelet-Based Histograms for Selectivity Estimation

by Yossi Matias, Jeffrey Scott Vitter, Min Wang
"... Query optimization is an integral part of relational database management systems. One important task in query optimization is selectivity estimation, that is, given a query P, we need to estimate the fraction of records in the database that satisfy P. Many commercial database systems maintain histog ..."
Abstract - Cited by 245 (16 self) - Add to MetaCart
Query optimization is an integral part of relational database management systems. One important task in query optimization is selectivity estimation, that is, given a query P, we need to estimate the fraction of records in the database that satisfy P. Many commercial database systems maintain histograms to approximate the frequency distribution of values in the attributes of relations. In this paper, we present a technique based upon a multiresolution wavelet decomposition for building histograms on the underlying data distributions, with applications to databases, statistics, and simulation. Histograms built on the cumulative data values give very good approximations with limited space usage. We give fast algorithms for constructing histograms and using

Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets

by Jeffrey Scott Vitter, Min Wang
"... Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, a ..."
Abstract - Cited by 198 (3 self) - Add to MetaCart
Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, approximate answers to OLAP aggregation queries. In this paper, we present anovel method that provides approximate answers to high-dimensional OLAP aggregation queries in massive sparse data sets in a time-efficient and space-efficient manner. We construct a compact data cube, which is an approximate and space-efficient representation of the underlying multidimensional array, based upon a multiresolution wavelet decomposition. In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a small number of I/Os, depending upon the desired accuracy. We present two I/O-efficient algorithms to construct the compact data cube for the important case of sparse high-dimensional arrays, which often arise in practice. The traditional histogram methods are infeasible for the massive high-dimensional data sets in OLAP applications. Previously developed wavelet techniques are efficient only for dense data. Our on-line query processing algorithm is very fast and capable of refining answers as the user demands more accuracy. Experiments on real data show that our method provides significantly more accurate results for typical OLAP aggregation queries than other efficient approximation techniques such as random sampling.
(Show Context)

Citation Context

...eded. However, it may no longer be desirable to do the transposition via the distribution approach of (5); instead we can do the transposition by sorting, which uses O( Nz B log M=B Nz B ) I/Os. (See =-=[Vit99]-=- for a proof in the I/O model that transposition is equivalent to sorting.) If all the processed hyperplanes individually fit into internal memory, the resulting I/O bound for Algorithm I will be O i ...

GPUTeraSort: High Performance Graphics Coprocessor Sorting for Large Database Management

by Naga K. Govindaraju, Jim Gray, Ritesh Kumar, Dinesh Manocha - SIGMOD , 2006
"... We present a new algorithm, GPUTeraSort, to sort billio-nrecord wide-key databases using a graphics processing unit (GPU) Our algorithm uses the data and task parallelism on the GPU to perform memory-intensive and compute-intensive tasks while the CPU is used to perform I/O and resource management. ..."
Abstract - Cited by 153 (8 self) - Add to MetaCart
We present a new algorithm, GPUTeraSort, to sort billio-nrecord wide-key databases using a graphics processing unit (GPU) Our algorithm uses the data and task parallelism on the GPU to perform memory-intensive and compute-intensive tasks while the CPU is used to perform I/O and resource management. We therefore exploit both the high-bandwidth GPU memory interface and the lower-bandwidth CPU main memory interface and achieve higher memory bandwidth than purely CPU-based algorithms. GPUTera-Sort is a two-phase task pipeline: (1) read disk, build keys, sort using the GPU, generate runs, write disk, and (2) read, merge, write. It also pipelines disk transfers and achieves near-peak I/O performance. We have tested the performance of GPUTeraSort on billion-record files using the standard Sort benchmark. In practice, a 3 GHz Pentium IV PC with $265 NVIDIA 7800 GT GPU is significantly faster than optimized CPU-based algorithms on much faster processors, sorting 60GB for a penny; the best reported PennySort price-performance. These results suggest that a GPU co-processor can significantly improve performance on large data processing tasks.
(Show Context)

Citation Context

... a set of files; the second phase processes these files to produce a totally ordered permutation of the input data file. External memory sorting algorithms can be classified into two broad categories =-=[42]-=-: • Distribution-Based Sorting: The first phase partitions the input data file using (S-1) partition keys and generates S disjoint buckets such that the elements in one bucket precede the elements in ...

Representing and querying correlated tuples in probabilistic databases

by Prithviraj Sen, Amol Deshpande - In ICDE , 2007
"... Probabilistic databases have received considerable attention recently due to the need for storing uncertain data produced by many real world applications. The widespread use of probabilistic databases is hampered by two limitations: (1) current probabilistic databases make simplistic assumptions abo ..."
Abstract - Cited by 142 (11 self) - Add to MetaCart
Probabilistic databases have received considerable attention recently due to the need for storing uncertain data produced by many real world applications. The widespread use of probabilistic databases is hampered by two limitations: (1) current probabilistic databases make simplistic assumptions about the data (e.g., complete independence among tuples) that make it difficult to use them in applications that naturally produce correlated data, and (2) most probabilistic databases can only answer a re-stricted subset of the queries that can be expressed using traditional query languages. We address both these limitations by proposing a framework that can represent not only probabilistic tuples, but also correlations that may be present among them. Our proposed framework naturally lends itself to the possible world semantics thus preserving the precise query semantics extant in current probabilistic databases. We develop an effi-cient strategy for query evaluation over such probabilistic databases by casting the query processing problem as an inference problem in an ap-propriately constructed probabilistic graphical model. We present several optimizations specific to probabilistic databases that enable efficient query evaluation. We validate our approach by presenting an experimental eval-uation that illustrates the effectiveness of our techniques at answering various queries using real and synthetic datasets. 1
(Show Context)

Citation Context

...cted from the various partitions and each reference is connected to all other references within the same partition via edges. As part of our future work we aim to use external memory graph algorithms =-=[36]-=- for these tasks. When exact probabilistic inference turns out to be too expensive we have the flexibility of switching to approximate inference techniques depending on the user’s requirements. Just l...

Terrain simplification simplified: a general framework for view-dependent out-of-core visualization.

by P Lindstrom, V Pascucci - IEEE Transactions on Visualization and Computer Graphics, , 2002
"... ..."
Abstract - Cited by 105 (2 self) - Add to MetaCart
Abstract not found

P,Pascucci V. Visualization of Large Terrains Made Easy [C

by Peter Lindstrom, Valerio Pascucci, Peter Lindstrom, Valerio Pascucci - In Proceedings of IEEE Visualization 2001
"... We present an elegant and simple to implement framework for per-forming out-of-core visualization and view-dependent refinement of large terrain surfaces. Contrary to the recent trend of increas-ingly elaborate algorithms for large-scale terrain visualization, our algorithms and data structures have ..."
Abstract - Cited by 87 (5 self) - Add to MetaCart
We present an elegant and simple to implement framework for per-forming out-of-core visualization and view-dependent refinement of large terrain surfaces. Contrary to the recent trend of increas-ingly elaborate algorithms for large-scale terrain visualization, our algorithms and data structures have been designed with the primary goal of simplicity and efficiency of implementation. Our approach to managing large terrain data also departs from more conventional strategies based on data tiling. Rather than emphasizing how to seg-ment and efficiently bring data in and out of memory, we focus on the manner in which the data is laid out to achieve good memory coherency for data accesses made in a top-down (coarse-to-fine) refinement of the terrain. We present and compare the results of us-ing several different data indexing schemes, and propose a simple to compute index that yields substantial improvements in locality and speed over more commonly used data layouts. Our second contribution is a new and simple, yet easy to gen-eralize method for view-dependent refinement. Similar to several published methods in this area, we use longest edge bisection in a top-down traversal of the mesh hierarchy to produce a continu-ous surface with subdivision connectivity. In tandem with the re-finement, we perform view frustum culling and triangle stripping. These three components are done together in a single pass over the mesh. We show how this framework supports virtually any error metric, while still being highly memory and compute efficient. 1
(Show Context)

Citation Context

...ries, including accuracy, mesh complexity, memory usage, refinement speed, generality, and, most importantly, ease of implementation. 2.2 Out-of-Core Paging and Data Layout External memory algorithms =-=[26]-=-, also known as out-of-core algorithms, address issues related to the hierarchical nature of the memory structure of modern computers (fast cache, main memory, hard disk, etc.). Managing and making th...

On Two-Dimensional Indexability and Optimal Range Search Indexing (Extended Abstract)

by Lars Arge, Vasilis Samoladas, Jeffrey Scott Vitter , 1999
"... Lars Arge Vasilis Samoladas y Jeffrey Scott Vitter z Abstract In this paper we settle several longstanding open problems in theory of indexability and external orthogonal range searching. In the first part of the paper, we apply the theory of indexability to the problem of two-dimensional rang ..."
Abstract - Cited by 86 (25 self) - Add to MetaCart
Lars Arge Vasilis Samoladas y Jeffrey Scott Vitter z Abstract In this paper we settle several longstanding open problems in theory of indexability and external orthogonal range searching. In the first part of the paper, we apply the theory of indexability to the problem of two-dimensional range searching. We show that the special case of 3-sided querying can be solved with constant redundancy and access overhead. From this, we derive indexing schemes for general 4-sided range queries that exhibit an optimal tradeoff between redundancy and access overhead.
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University