Results 1 - 10
of
33
Designing Efficient Sorting Algorithms for Manycore GPUs
, 2009
"... We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23 % faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, we carefully design our algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. We exploit the high-speed onchip shared memory provided by NVIDIA’s GPU architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be wellsuited for other manycore processors.
Rodinia: A Benchmark Suite for Heterogeneous Computing
"... Abstract—This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applicat ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
Abstract—This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley’s dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout. I.
MapReduce Optimization Using Regulated Dynamic Prioritization
"... We present a system for allocating resources in shared data and compute clusters that improves MapReduce job scheduling in three ways. First, the system uses regulated and user-assigned priorities to offer different service levels to jobs and users over time. Second, the system dynamically adjusts r ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
We present a system for allocating resources in shared data and compute clusters that improves MapReduce job scheduling in three ways. First, the system uses regulated and user-assigned priorities to offer different service levels to jobs and users over time. Second, the system dynamically adjusts resource allocations to fit the requirements of different job stages. Finally, the system automatically detects and eliminates bottlenecks within a job. We show experimentally using real applications that users can optimize not only job execution time but also the cost-benefit ratio or prioritization efficiency of a job using these three strategies. Our approach relies on a proportional share mechanism that continuously allocates virtual machine resources. Our experimental results show a 11−31 % improvement in completion time and 4−187 % improvement in prioritization efficiency for different classes of MapReduce jobs. We further show that delay intolerant users gain even more from our system.
Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures
"... Scaling computations on emerging massive-core supercomputers is a daunting task, which coupled with the significantly lagging system I/O capabilities exacerbates applications ’ end-to-end performance. The I/O bottleneck often negates potential performance benefits of assigning additional compute cor ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
Scaling computations on emerging massive-core supercomputers is a daunting task, which coupled with the significantly lagging system I/O capabilities exacerbates applications ’ end-to-end performance. The I/O bottleneck often negates potential performance benefits of assigning additional compute cores to an application. In this paper, we address this issue via a novel functional partitioning (FP) runtime environment that allocates cores to specific application tasks — checkpointing, de-duplication, and scientific data format transformation — so that the deluge of cores can be brought to bear on the entire gamut of application activities. The focus is on utilizing the extra cores to support HPC application I/O activities and also leverage solid-state disks in this context. For example, our evaluation shows that dedicating 1 core on an oct-core machine for checkpointing and its assist tasks using FP can improve overall execution time of a FLASH benchmark on 80 and 160 cores by 43.95 % and 41.34%, respectively. I.
Parallel data mining on graphics processors
, 2008
"... We introduce GPUMiner, a novel parallel data mining system that utilizes new-generation graphics processing units (GPUs). Our system relies on the massively multi-threaded SIMD (Single Instruction, Multiple-Data) architecture provided by GPUs. As specialpurpose co-processors, these processors are hi ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
We introduce GPUMiner, a novel parallel data mining system that utilizes new-generation graphics processing units (GPUs). Our system relies on the massively multi-threaded SIMD (Single Instruction, Multiple-Data) architecture provided by GPUs. As specialpurpose co-processors, these processors are highly optimized for graphics rendering and rely on the CPU for data input/output as well as complex program control. Therefore, we design GPUMiner to consist of the following three components: (1) a CPU-based storage and buffer manager to handle I/O and data transfer between the CPU and the GPU, (2) a GPU-CPU co-processing parallel mining module, and (3) a GPU-based mining visualization module. We design the GPU-CPU co-processing scheme in mining depending on the complexity and inherent parallelism of individual mining algorithms. We provide the visualization module to facilitate users to observe and interact with the mining process online. We have implemented the k-means clustering and the Apriori frequent pattern mining algorithms in GPUMiner. Our preliminary results have shown significant speedups over state-of-the-art CPU implementations on a PC with a G80 GPU and a quad-core CPU. We will demonstrate the mining process through our visualization module. Code and documentation of GPUMiner are available at
The design and implementation Ocelot’s dynamic binary translator from PTX to multi-core x86
, 2009
"... Abstract—Ocelot is a dynamic compilation framework designed to map the explicitly parallel PTX execution model used by NVIDIA CUDA applications onto diverse many-core architectures. Ocelot includes a dynamic binary translator from PTX to many-core processors that leverages the LLVM code generator to ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract—Ocelot is a dynamic compilation framework designed to map the explicitly parallel PTX execution model used by NVIDIA CUDA applications onto diverse many-core architectures. Ocelot includes a dynamic binary translator from PTX to many-core processors that leverages the LLVM code generator to target x86. The binary translator is able to execute CUDA applications without recompilation and Ocelot can in fact dynamically switch between execution on an NVIDIA GPU and a many-core CPU. It has been validated against over 100 applications taken from the CUDA SDK [1], the UIUC Parboil benchmarks [2], the Virginia Rodinia benchmarks [3], the GPU-VSIPL signal and image processing library [4], and several domain specific applications. This paper presents a detailed description of the implementation of our binary translator highlighting design decisions and trade-offs, and showcasing their effect on application performance. We explore several code transformations that are applicable only when translating explicitly parallel applications and suggest additional optimization passes that may be useful to this class of applications. We expect this study to inform the design of compilation tools for explicitly parallel programming models (such as OpenCL) as well as future CPU and GPU architectures. I.
Optimizing MapReduce for Multicore Architectures
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Tech. Rep
, 2010
"... MapReduce is a programming model for data-parallel programs originally intended for data centers. MapReduce simplifies parallel programming, hiding synchronization and task management. These properties make it a promising programming model for future processors with many cores, and existing MapReduc ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
MapReduce is a programming model for data-parallel programs originally intended for data centers. MapReduce simplifies parallel programming, hiding synchronization and task management. These properties make it a promising programming model for future processors with many cores, and existing MapReduce libraries such as Phoenix have demonstrated that applications written with MapReduce perform competitively with those written with Pthreads [11]. This paper explores the design of the MapReduce data structures for grouping intermediate key/value pairs, which is often a performance bottleneck on multicore processors. The paper finds the best choice depends on workload characteristics, such as the number of keys used by the application, the degree of repetition of keys, etc. This paper also introduces a new MapReduce library, Metis, with a compromise data structure designed to perform well for most workloads. Experiments with the Phoenix benchmarks on a 16-core AMD-based server show that Metis ’ data structure performs better than simpler alternatives, including Phoenix. 1
Accelerating SQL Database Operations on a GPU with CUDA
"... Prior work has shown dramatic acceleration for various database operations on GPUs, but only using primitives that are not part of conventional database languages such as SQL. This paper implements a subset of the SQLite command processor directly on the GPU. This dramatically reduces the effort req ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Prior work has shown dramatic acceleration for various database operations on GPUs, but only using primitives that are not part of conventional database languages such as SQL. This paper implements a subset of the SQLite command processor directly on the GPU. This dramatically reduces the effort required to achieve GPU acceleration by avoiding the need for database programmers to use new programming languages such as CUDA or modify their programs to use non-SQL libraries. This paper focuses on accelerating SELECT queries and describes the considerations in an efficient GPU implementation of the SQLite command processor. Results on an NVIDIA Tesla C1060 achieve speedups of 20-70X depending on the size of the result set.
Using graphics processors for high performance ir query processing
- In WWW
, 2009
"... Research Interests Web Search technology Indexing, data compression, query processing and pruning, caching Distributed System Algorithm under Hadoop framework and performance issues GPU-based computation GPU-based compression, GPU-based search, GPU-based algorithms Temporal Web Graph and ranking Web ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Research Interests Web Search technology Indexing, data compression, query processing and pruning, caching Distributed System Algorithm under Hadoop framework and performance issues GPU-based computation GPU-based compression, GPU-based search, GPU-based algorithms Temporal Web Graph and ranking Web graph with temporal information, web graph compression, ranking using temporal web-graph. Machine learning related topic Document classification.

