Results 1 - 10
of
31
A survey of general-purpose computation on graphics hardware
, 2007
"... The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware acompelling platform for computationally demanding tasks in awide variety of application domains. In this report, we describe, summarize, and analyze the l ..."
Abstract
-
Cited by 231 (11 self)
- Add to MetaCart
The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware acompelling platform for computationally demanding tasks in awide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping general-purpose computation to graphics hardware. We begin with the technical motivations that underlie general-purpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping general-purpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in general-purpose application development on graphics hardware.
CellSs: a Programming Model for the Cell BE Architecture
- ACM/IEEE CONFERENCE ON SUPERCOMPUTING
, 2006
"... In this work we present Cell superscalar (CellSs) which addresses the automatic exploitation of the functional parallelism of a sequential program through the different processing elements of the Cell BE architecture. The focus in on the simplicity and flexibility of the programming model. Based on ..."
Abstract
-
Cited by 54 (5 self)
- Add to MetaCart
In this work we present Cell superscalar (CellSs) which addresses the automatic exploitation of the functional parallelism of a sequential program through the different processing elements of the Cell BE architecture. The focus in on the simplicity and flexibility of the programming model. Based on a simple annotation of the source code, a source to source compiler generates the necessary code and a runtime library exploits the existing parallelism by building at runtime a task dependency graph. The runtime takes care of the task scheduling and data handling between the different processors of this heterogeneous architecture. Besides, a locality-aware task scheduling has been implemented to reduce the overhead of data transfers. The approach has been implemented and tested with a set of examples and the results obtained since now are promising.
A memory model for scientific algorithms on graphics processors
- in Proc. of the ACM/IEEE Conference on Supercomputing (SC’06
, 2006
"... We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. We incorporate many characteristics ..."
Abstract
-
Cited by 35 (3 self)
- Add to MetaCart
We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. We incorporate many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and use the 3C’s model to analyze the cache misses. Moreover, we present techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of our model, we highlight its performance on three memory-intensive scientific applications – sorting, fast Fourier transform and dense matrix-multiplication. In practice, our cache-efficient algorithms for these applications are able to achieve memory throughput of 30–50 GB/s on a NVIDIA 7900 GTX GPU. We also compare our results with prior GPU-based and CPU-based implementations on highend processors. In practice, we are able to achieve 2–5× performance improvement.
Fast genetic programming on GPUs
- Proceedings of the 10th European Conference on Genetic Programming, volume 4445 of LNCS
, 2007
"... Abstract. As is typical in evolutionary algorithms, fitness evaluation in GP takes the majority of the computational effort. In this paper we demonstrate the use of the Graphics Processing Unit (GPU) to accelerate the evaluation of individuals. We show that for both binary and floating point based d ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
Abstract. As is typical in evolutionary algorithms, fitness evaluation in GP takes the majority of the computational effort. In this paper we demonstrate the use of the Graphics Processing Unit (GPU) to accelerate the evaluation of individuals. We show that for both binary and floating point based data types, it is possible to get speed increases of several hundred times over a typical CPU implementation. This allows for evaluation of many thousands of fitness cases, and hence should enable more ambitious solutions to be evolved using GP.
Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems
, 2008
"... Abstract. If multicore is a disruptive technology, try to imagine hybrid multicore systems enhanced with accelerators! This is happening today as accelerators, in particular Graphics Processing Units (GPUs), are steadily making their way into the high performance computing (HPC) world. We highlight ..."
Abstract
-
Cited by 14 (8 self)
- Add to MetaCart
Abstract. If multicore is a disruptive technology, try to imagine hybrid multicore systems enhanced with accelerators! This is happening today as accelerators, in particular Graphics Processing Units (GPUs), are steadily making their way into the high performance computing (HPC) world. We highlight the trends leading to the idea of hybrid manycore/GPU systems, and we present a set of techniques that can be used to efficiently program them. The presentation is in the context of Dense Linear Algebra (DLA), a major building block for many scientific computing applications. We motivate the need for new algorithms that would split the computation in a way that would fully exploit the power that each of the hybrid components offers. As the area of hybrid multicore/GPU computing is still in its infancy, we also argue for its importance in view of what future architectures may look like. We therefore envision the need for a DLA library similar to LAPACK but for hybrid manycore/GPU systems. We illustrate the main ideas with an LU-factorization algorithm where particular techniques are used to reduce the amount of pivoting, resulting in an algorithm achieving up to 388 GFlop/s for single and up to 99.4 GFlop/s for double precision factorization on a hybrid Intel Xeon
Multi-level graph layout on the GPU
- IEEE TRANS. VIS. COMPUT. GRAPH
, 2007
"... This paper presents a new algorithm for force directed graph layout on the GPU. The algorithm, whose goal is to compute layouts accurately and quickly, has two contributions. The first contribution is proposing a general multi-level scheme, which is based on spectral partitioning. The second contri ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
This paper presents a new algorithm for force directed graph layout on the GPU. The algorithm, whose goal is to compute layouts accurately and quickly, has two contributions. The first contribution is proposing a general multi-level scheme, which is based on spectral partitioning. The second contribution is computing the layout on the GPU. Since the GPU requires a data parallel programming model, the challenge is devising a mapping of a naturally unstructured graph into a well-partitioned structured one. This is done by computing a balanced partitioning of a general graph. This algorithm provides a general multi-level scheme, which has the potential to be used not only for computation on the GPU, but also on emerging multi-core architectures. The algorithm manages to compute high quality layouts of large graphs in a fraction of the time required by existing algorithms of similar quality. An application for visualization of the topologies of ISP (Internet Service Provider) networks is presented. Index Terms—Graph layout, GPU, graph partitioning.
Concurrent number cruncher: a gpu implementation of a general sparse linear solver
- Int. J. Parallel Emerg. Distrib. Syst
"... A wide class of numerical methods needs to solve a linear system, where the matrix pattern of non-zero coefficients can be arbitrary. These problems can greatly benefit from highly multithreaded computational power and large memory bandwidth available on GPUs, especially since dedicated general purp ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
A wide class of numerical methods needs to solve a linear system, where the matrix pattern of non-zero coefficients can be arbitrary. These problems can greatly benefit from highly multithreaded computational power and large memory bandwidth available on GPUs, especially since dedicated general purpose APIs such as CTM (AMD-ATI) and CUDA (NVIDIA) have appeared. CUDA even provides a BLAS implementation, but only for dense matrices (CuBLAS). Other existing linear solvers for the GPU are also limited by their internal matrix representation. This paper describes how to combine recent GPU programming techniques and new GPU dedicated APIs with high performance computing strategies (namely block compressed row storage, register blocking and vectorization), to implement a sparse general-purpose linear solver. Our implementation of the Jacobi-preconditioned Conjugate Gradient algorithm outperforms by up to a factor of 6.0x leading-edge CPU counterparts, making it attractive for applications which content with single precision.
Interactive depth of field using simulated diffusion
, 2006
"... Figure 1: Top: Pinhole camera image from an upcoming feature film. Bottom: Sample results of our depth-of-field algorithm based on simulated diffusion. We generate these results from a single color and depth value per pixel, and the above images render at 23–25 frames per second. The method is desig ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
Figure 1: Top: Pinhole camera image from an upcoming feature film. Bottom: Sample results of our depth-of-field algorithm based on simulated diffusion. We generate these results from a single color and depth value per pixel, and the above images render at 23–25 frames per second. The method is designed to produce film-preview quality at interactive rates on a GPU. Fast preview should allow greater artistic control of depth-of-field effects. Accurate computation of depth-of-field effects in computer graphics rendering is generally very time consuming, creating a problematic workflow for film authoring. The computation is particularly challenging because it depends on large-scale spatially-varying filtering that must accurately respect complex boundaries. A variety of real-time algorithms have been proposed for games, but the compromises required to achieve the necessary frame rates have made them them unsuitable for film. Here we introduce an approximate depth-of-field computation that is good enough for film preview, yet can be computed interactively on a GPU. The computation creates depth-of-field blurs by simulating the heat equation for a nonuniform medium. Our alternating direction implicit solution gives rise to separable spatially varying recursive filters that can compute large-kernel convolutions in constant time per pixel while respecting the boundaries between in-focus and out-of-focus objects. Recursive filters have traditionally been viewed as problematic for GPUs, but using the well-established method of cyclic reduction of tridiagonal systems, we are able to vectorize the computation and achieve interactive frame rates. Direction Implicit Methods, GPU, Tridiagonal Matrices, Cyclic Reduction. 1
VMM-independent graphics acceleration
- In Proceedings of VEE 2007
, 2007
"... This paper describes VMGL, a cross-platform OpenGL virtualization solution that is both virtual machine monitor (VMM) and graphics processing unit (GPU) independent. VMGL allows applications executing within virtual machines (VMs) to leverage hardware rendering acceleration, thus solving a problem t ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
This paper describes VMGL, a cross-platform OpenGL virtualization solution that is both virtual machine monitor (VMM) and graphics processing unit (GPU) independent. VMGL allows applications executing within virtual machines (VMs) to leverage hardware rendering acceleration, thus solving a problem that has limited virtualization of a growing class of graphics-intensive applications. VMGL also provides applications running within VMs with suspend and resume capabilities across GPUs from different vendors. Our experimental results from a number of graphics-intensive applications show that VMGL provides excellent rendering performance, coming within 14 % or better of native graphics hardware acceleration. Further, VMGL’s performance is two orders of magnitude better than that of software rendering, the commonly available alternative today for graphics-intensive applications running in virtualized environments. Our results confirm VMGL’s portability across VMware Workstation and Xen (on VT and non-VT hardware), and across Linux (with and without paravirtualization), FreeBSD, and Solaris. Finally, the resource demands of VMGL align well with the emerging trend of multi-core processors. Categories and Subject Descriptors I.3.4 [Computer Graphics]:
Cholesky decomposition and linear programming on a GPU
, 2006
"... The rapid evolution of Graphics Processing Units (GPUs) in performance, architecture, and programmability provides computational potential beyond their primary purpose, graphics processing. In this work we present an efficient algorithm for solving symmetric and positive definite linear systems usin ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
The rapid evolution of Graphics Processing Units (GPUs) in performance, architecture, and programmability provides computational potential beyond their primary purpose, graphics processing. In this work we present an efficient algorithm for solving symmetric and positive definite linear systems using triangular update on a GPU. Using the decomposition algorithm and other basic building blocks for linear algebra on the GPU, we demonstrate a GPU-powered linear program solver based on a Primal-Dual Interior-Point Method. Contributions: We present a new algorithm to decompose symmetric and positive definite dense matrices through a set of kernel calls with minimum copying operations to maximize performance. Using our algorithm and other BLAS kernels, we demonstrate how to build a GPU-powered primaldual interior-point method with minimal feedback to the CPU. We use: • Triangular domain updating to exploit the symmetric structure. • Texture coordinate mapping and index swizzling for efficient texture fetching.

