Results 1 - 10
of
14
Kernel Weaver: Automatically Fusing Database
- Primitives for Efficient GPU Computation.” MICRO
, 2012
"... Data warehousing applications represent an emerging application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high bandwidth architectures that potentially offer substantial improvements in throughput for these ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
(Show Context)
Data warehousing applications represent an emerging application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high bandwidth architectures that potentially offer substantial improvements in throughput for these applications. However, there are significant challenges that arise due to the overheads of data movement through the memory hierarchy and between the GPU and host CPU. This paper proposes data movement optimizations to address these challenges. Inspired in part by loop fusion optimizations in the scientific computing community, we propose kernel fusion as a basis for data movement optimizations. Kernel fusion fuses the code bodies of two GPU kernels to i) reduce data footprint to cut down data movement throughout GPU and CPU memory hierarchy, and ii) enlarge compiler optimization scope. We classify producer consumer dependences between compute kernels into three types, i) fine-grained thread-to-thread dependences, ii) medium-grained thread block dependences, and iii) coarse-grained kernel dependences. Based on this classification, we propose a compiler framework, Kernel Weaver, that can automatically fuse relational algebra operators thereby eliminating redundant data movement. The experiments on NVIDIA Fermi platforms demonstrate that kernel fusion achieves 2.89x speedup in GPU computation and a 2.35x speedup in PCIe transfer time on average across the microbenchmarks tested. We present key insights, lessons learned, measurements from our compiler implementation, and opportunities for further improvements. 1.
Collision-streams: Fast GPU-based collision detection for deformable models
- In ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games
, 2011
"... into a funnel and pass through it under the pressure of a ball. This model has 47K vertices, 92K triangles, and a lot of self-collisions. Our novel GPU-based CCD algorithm takes 4:4ms and 10ms per frame to compute all the collisions on a NVIDIA GeForce GTX 480 and a NVIDIA GeForce GTX 285, respectiv ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
into a funnel and pass through it under the pressure of a ball. This model has 47K vertices, 92K triangles, and a lot of self-collisions. Our novel GPU-based CCD algorithm takes 4:4ms and 10ms per frame to compute all the collisions on a NVIDIA GeForce GTX 480 and a NVIDIA GeForce GTX 285, respectively. We present a fast GPU-based streaming algorithm to perform col-lision queries between deformable models. Our approach is based on hierarchical culling and reduces the computation to generating different streams. We present a novel stream registration method to compact the streams and efficiently compute the potentially col-liding pairs of primitives. We also use a deferred front tracking method to lower the memory overhead. The overall algorithm has been implemented on different GPUs and we have evaluated its per-formance on non-rigid and deformable simulations. We highlight our speedups over prior CPU-based and GPU-based algorithms. In practice, our algorithm can perform inter-object and intra-object computations on models composed of hundreds of thousands of tri-angles in tens of milliseconds. 1
Barrier Invariants: A Shared State Abstraction for the Analysis of Data-Dependent GPU Kernels
- OOPSLA ’13, OCTOBER 29–31, 2013
, 2013
"... ..."
(Show Context)
A Sound and Complete Abstraction for Reasoning about Parallel Prefix Sums ∗
"... Prefix sums are key building blocks in the implementation of many concurrent software applications, and recently much work has gone into efficiently implementing prefix sums to run on massively par-allel graphics processing units (GPUs). Because they lie at the heart of many GPU-accelerated applicat ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Prefix sums are key building blocks in the implementation of many concurrent software applications, and recently much work has gone into efficiently implementing prefix sums to run on massively par-allel graphics processing units (GPUs). Because they lie at the heart of many GPU-accelerated applications, the correctness of prefix sum implementations is of prime importance. We introduce a novel abstraction, the interval of summations, that allows scalable reasoning about implementations of prefix sums. We present this abstraction as a monoid, and prove a sound-ness and completeness result showing that a generic sequential pre-fix sum implementation is correct for an array of length n if and only if it computes the correct result for a specific test case when instantiated with the interval of summations monoid. This allows correctness to be established by running a single test where the in-
High Resolution Sparse Voxel DAGs
"... Figure 1: The EPICCITADEL scene voxelized to a 128K 3 (131 072 3) resolution and stored as a Sparse Voxel DAG. Total voxel count is 19 billion, which requires 945MB of GPU memory. A sparse voxel octree would require 5.1GB without counting pointers. Primary shading is from triangle rasterization, whi ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Figure 1: The EPICCITADEL scene voxelized to a 128K 3 (131 072 3) resolution and stored as a Sparse Voxel DAG. Total voxel count is 19 billion, which requires 945MB of GPU memory. A sparse voxel octree would require 5.1GB without counting pointers. Primary shading is from triangle rasterization, while ambient occlusion and shadows are raytraced in the sparse voxel DAG at 170 MRays/sec and 240 MRays/sec respectively, on an NVIDIA GTX680. We show that a binary voxel grid can be represented orders of magnitude more efficiently than using a sparse voxel octree (SVO) by generalising the tree to a directed acyclic graph (DAG). While the SVO allows for efficient encoding of empty regions of space, the DAG additionally allows for efficient encoding of identical regions of space, as nodes are allowed to share pointers to identical subtrees. We present an efficient bottom-up algorithm that reduces an SVO to a minimal DAG, which can be applied even in cases where the complete SVO would not fit in memory. In all tested scenes, even the highly irregular ones, the number of nodes is reduced by one to three orders of magnitude. While the DAG requires more pointers per node, the memory cost for these is quickly amortized and the memory consumption of the DAG is considerably smaller, even when compared to an ideal SVO without pointers. Meanwhile, our sparse voxel DAG requires no decompression and can be traversed very efficiently. We demonstrate this by ray tracing hard and soft shadows, ambient occlusion, and primary rays in extremely high resolution DAGs at speeds that are on par with, or even faster than, state-of-the-art voxel and triangle GPU ray tracing.
Binned SAH Kd-Tree Construction on a GPU
, 2010
"... In ray tracing, kd-trees are often regarded as the best acceleration structures in the majority of cases. However, due to their large construction times they have been problematic for dynamic scenes. In this work, we try to overcome this obstacle by building the kd-tree in parallel on many cores of ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
In ray tracing, kd-trees are often regarded as the best acceleration structures in the majority of cases. However, due to their large construction times they have been problematic for dynamic scenes. In this work, we try to overcome this obstacle by building the kd-tree in parallel on many cores of a GPU. A new algorithm ensures close to optimal parallelism during every stage of the build process. The approach uses the SAH and samples the cost function at discrete (bin) locations. This approach constructs kd-trees faster than any other known GPU implementation, while maintaining competing quality compared to serial high-quality CPU builders. Further tests have shown that our scalability with respect to the number of cores is better than of other available GPU and CPU implementations. 1
Optimization and Architecture Effects on GPU Computing Workload Performance
"... It is unquestionable that successive hardware generations have significantly improved CPU computing workload performance over the last several years. Moore's law and DRAM scaling have respectively increased single-chip peak instruction throughput by 3X and off-chip bandwidth by 2.2X from ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
It is unquestionable that successive hardware generations have significantly improved CPU computing workload performance over the last several years. Moore's law and DRAM scaling have respectively increased single-chip peak instruction throughput by 3X and off-chip bandwidth by 2.2X from
Efficient RDF Stream Reasoning with Graphics Processing Units (GPUs)
"... In this paper, we study the problem of stream reasoning and propose a reasoning approach over large amounts of RDF data, which uses graphics processing units (GPU) to improve the performance. First, we show how the problem of stream reasoning can be reduced to a temporal reasoning problem. Then, we ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
In this paper, we study the problem of stream reasoning and propose a reasoning approach over large amounts of RDF data, which uses graphics processing units (GPU) to improve the performance. First, we show how the problem of stream reasoning can be reduced to a temporal reasoning problem. Then, we describe a number of algorithms to perform stream reasoning with GPUs. 1.
Article Load Balancing versus Occupancy Maximization on Graphics Processing Units: The Generalized Hough Transform as a Case Study
"... Programs developed under the Compute Unified Device Architecture obtain the highest performance rate, when the exploitation of hardware resources on a Graphics Processing Unit (GPU) is maximized. In order to achieve this purpose, load balancing among threads and a high value of processor occupancy, ..."
Abstract
- Add to MetaCart
Programs developed under the Compute Unified Device Architecture obtain the highest performance rate, when the exploitation of hardware resources on a Graphics Processing Unit (GPU) is maximized. In order to achieve this purpose, load balancing among threads and a high value of processor occupancy, i.e. the ratio of active threads, are indispensable. However, in certain applications, an optimally balanced implementation may limit the occupancy, due to a greater need for registers and shared memory. This is the case of the Fast Generalized Hough Transform (Fast GHT), an image-processing technique for localizing an object within an image. In this work, we present two parallelization alternatives for the Fast GHT, one that optimizes the load balancing and another that maximizes the occupancy. We have compared them using a large amount of real images to test their strong and weak points and we have drawn several con-clusions about under which conditions it is better to use one or the other. We have also tackled several parallelization problems related to sparse data distribution, divergent execution paths, and irregular memory access patterns in updating operations by proposing a set of generic techniques, including compacting, sorting, and memory storage replication. Finally, we have compared our Fast GHT with the classic GHT, both on a current GPU, obtaining an important speed-up.