Results 1  10
of
26
A Quantitative Performance Analysis Model for GPU Architectures
 In HPCA
, 2011
"... We develop a microbenchmarkbased performance model for NVIDIA GeForce 200series GPUs. Our model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows programmers and architects to predict the benefits of potential program optimizations and architectural improv ..."
Abstract

Cited by 57 (2 self)
 Add to MetaCart
(Show Context)
We develop a microbenchmarkbased performance model for NVIDIA GeForce 200series GPUs. Our model identifies GPU program bottlenecks and quantitatively analyzes performance, and thus allows programmers and architects to predict the benefits of potential program optimizations and architectural improvements. In particular, we use a microbenchmarkbased approach to develop a throughput model for three major components of GPU execution time: the instruction pipeline, shared memory access, and global memory access. Because our model is based on the GPU’s native instruction set, we can predict performance with a 5–15 % error. To demonstrate the usefulness of the model, we analyze three representative realworld and already highlyoptimized programs: dense matrix multiply, tridiagonal systems solver, and sparse matrix vector multiply. The model provides us detailed quantitative analysis on performance, allowing us to understand the configuration of the fastest dense matrix multiply implementation and to optimize the tridiagonal solver and sparse matrix vector multiply by 60 % and 18 % respectively. Furthermore, our model applied to analysis on these codes allows us to suggest architectural improvements on hardware resource allocation, avoiding bank conflicts, block scheduling, and memory transaction granularity. 1
Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid
"... Abstract—We have previously suggested mixed precision iterative solvers specifically tailored to the iterative solution of sparse linear equation systems as they typically arise in the finite element discretization of partial differential equations. These schemes have been evaluated for a number of ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
(Show Context)
Abstract—We have previously suggested mixed precision iterative solvers specifically tailored to the iterative solution of sparse linear equation systems as they typically arise in the finite element discretization of partial differential equations. These schemes have been evaluated for a number of hardware platforms, in particular single precision GPUs as accelerators to the general purpose CPU. This paper reevaluates the situation with new mixed precision solvers that run entirely on the GPU: We demonstrate that mixed precision schemes constitute a significant performance gain over native double precision. Moreover, we present a new implementation of cyclic reduction for the parallel solution of tridiagonal systems and employ this scheme as a line relaxation smoother in our GPUbased multigrid solver. With an alternating direction implicit variant of this advanced smoother we can extend the applicability of the GPU multigrid solvers to very illconditioned systems arising from the discretization on anisotropic meshes, that previously had to be solved on the CPU. The resulting mixed precision schemes are always faster than double precision alone, and outperform tuned CPU solvers consistently by almost an order of magnitude.
An autotuned method for solving large tridiagonal systems on the GPU
 In Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium (IPDPS
, 2011
"... Abstract—We present a multistage method for solving large tridiagonal systems on the GPU. Previously large tridiagonal systems cannot be efficiently solved due to the limitation of onchip shared memory size. We tackle this problem by splitting the systems into smaller ones and then solving them on ..."
Abstract

Cited by 13 (3 self)
 Add to MetaCart
(Show Context)
Abstract—We present a multistage method for solving large tridiagonal systems on the GPU. Previously large tridiagonal systems cannot be efficiently solved due to the limitation of onchip shared memory size. We tackle this problem by splitting the systems into smaller ones and then solving them onchip. The multistage characteristic of our method, together with various workloads and GPUs of different capabilities, obligates an autotuning strategy to carefully select the switch points between computation stages. In particular, we show two ways to effectively prune the tuning space and thus avoid an impractical exhaustive search: (1) apply algorithmic knowledge to decouple tuning parameters, and (2) estimate search starting points based on GPU architecture parameters. We demonstrate that autotuning is a powerful tool that improves the performance by up to 5x, saves 17 % and 32 % of execution time on average respectively over static and dynamic tuning, and enables our multistage solver to outperform the Intel MKL tridiagonal solver on many parallel tridiagonal systems by 6–11x. KeywordsGPU Computing, AutoTuning Algorithms, Tridiagonal systems I.
A Memory Access Model for Highlythreaded Manycore Architectures
 SUBMITTED AND ACCEPTED BY ICPADS'2012
, 2012
"... Manycore architectures are excellent in hiding memoryaccess latency by lowoverhead context switching among a large number of threads. The speedup of algorithms carried out on these machines depends on how well the latency is hidden. If the number of threads were infinite, then theoretically thes ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
Manycore architectures are excellent in hiding memoryaccess latency by lowoverhead context switching among a large number of threads. The speedup of algorithms carried out on these machines depends on how well the latency is hidden. If the number of threads were infinite, then theoretically these machines should provide the performance predicted by the PRAM analysis of the programs. However, the number of allowable threads per processor is not infinite. In this paper, we introduce the Threaded Manycore Memory (TMM) model which is meant to capture the important characteristics of these highlythreaded, manycore machines. Since we model some important machine parameters of these machines, we expect analysis under this model to give more finegrained performance prediction than the PRAM analysis. We analyze 4 algorithms for the classic all pairs shortest paths problem under this model. We find that even when two algorithms have the same PRAM performance, our model predicts different performance for some settings of machine parameters. For example, for dense graphs, the FloydWarshall algorithm and Johnson’s algorithms have the same performance in the PRAM model. However, our model predicts different performance for large enough memoryaccess latency and validates the intuition that the FloydWarshall algorithm performs better on these machines.
Performance Analysis of a Hybrid MPI/CUDA Implementation of the NASLU Benchmark ABSTRACT
"... We present the performance analysis of a port of the LU benchmark from the NAS Parallel Benchmark (NPB) suite to NVIDIA’s Compute Unified Device Architecture (CUDA), and report on the optimisation efforts employed to take advantage of this platform. Execution times are reported for several different ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
We present the performance analysis of a port of the LU benchmark from the NAS Parallel Benchmark (NPB) suite to NVIDIA’s Compute Unified Device Architecture (CUDA), and report on the optimisation efforts employed to take advantage of this platform. Execution times are reported for several different GPUs, ranging from lowend consumergrade products to highend HPCgrade devices, including the Tesla C2050 built on NVIDIA’s Fermi processor. We also utilise recently developed performance models of LU to facilitate a comparison between future largescale distributed clusters of GPU devices and existing clusters built on traditional CPU architectures, including a quadsocket, quadcore AMD Opteron cluster and an IBM BlueGene/P.
A Scalable, Numerically Stable, Highperformance Tridiagonal Solver using GPUs
"... Abstract—In this paper, we present a scalable, numerically stable, highperformance tridiagonal solver. The solver is based on the SPIKE algorithm for partitioning a large matrix into small independent matrices, which can be solved in parallel. For each small matrix, our solver applies a general 1b ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract—In this paper, we present a scalable, numerically stable, highperformance tridiagonal solver. The solver is based on the SPIKE algorithm for partitioning a large matrix into small independent matrices, which can be solved in parallel. For each small matrix, our solver applies a general 1by1 or 2by2 diagonal pivoting algorithm, which is also known to be numerically stable. Our paper makes two major contributions. First, our solver is the first numerically stable tridiagonal solver for GPUs. Our solver provides comparable quality of stable solutions to Intel MKL and Matlab, at speed comparable to the GPU tridiagonal solvers in existing packages like CUSPARSE. It is also scalable to multiple GPUs and CPUs. Second, we present and analyze two key optimization strategies for our solver: a highthroughput data layout transformation for memory efficiency, and a dynamic tiling approach for reducing the memory access footprint caused by branch divergence.
Simulating spiking neural networks on GPU
, 2012
"... Abstract Modern graphics cards contain hundreds of cores that can be programmed for intensive calculations. They are beginning to be used for spiking neural network simulations. The goal is to make parallel simulation of spiking neural networks available to a large audience, without the requirement ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract Modern graphics cards contain hundreds of cores that can be programmed for intensive calculations. They are beginning to be used for spiking neural network simulations. The goal is to make parallel simulation of spiking neural networks available to a large audience, without the requirements of a cluster. We review the ongoing efforts towards this goal, and we outline the main difficulties.
On aligning massive timeseries data in Splash
 In VLDB BigData Workshop
, 2012
"... Important emerging sources of big data are largescale predictive simulation models used in escience and, increasingly, in guiding policy and investment decisions around highly complex issues such as population health and safety. The Splash project provides a platform for combining existing hetero ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Important emerging sources of big data are largescale predictive simulation models used in escience and, increasingly, in guiding policy and investment decisions around highly complex issues such as population health and safety. The Splash project provides a platform for combining existing heterogeneous simulation models and datasets across a broad range of disciplines to capture the behavior of complex systems of systems. Splash loosely couples models via data exchange, where each submodel often produces or expects time series having huge numbers of time points and many data values per time point. If the timeseries output of one “source ” submodel is used as input for another “target ” submodel and the time granularity of the source is coarser than that of the target, an interpolation operation is required. Cubicspline interpolation is the most widelyused method because of its smoothness properties. Scalable methods are needed for such data transformations, because the amount of data produced by a simulation program can be massive when simulating large, complex systems over long time periods, especially when the time dimension is modeled at high resolution. We demonstrate that we can efficiently perform cubicspline interpolation over a massive time series in a MapReduce environment using novel algorithms based on adapting the distributed stochastic gradient descent (DSGD) method of Gemulla et al., originally developed for lowrank matrix factorization. Specifically, we adapt DSGD to calculate the coefficients that appear in the cubicspline interpolation formula by solving a massive tridiagonal system of linear equations. Our techniques are potentially applicable to both spline interpolation and parallel solution of diagonal linear systems in other massively parallel dataintegration and dataanalysis applications. 1.
Fast and Accurate Finite Element Multigrid Solvers for PDE Simulations on GPU Clusters
, 2010
"... ..."
unknown title
"... Abstract—We present the design and evaluation of a scalable tridiagonal solver targeted for GPU architectures. We observed that two distinct steps are required to solve a large tridiagonal system in parallel: 1) breaking down a problem into multiple subproblemseach of which is independentof other, a ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—We present the design and evaluation of a scalable tridiagonal solver targeted for GPU architectures. We observed that two distinct steps are required to solve a large tridiagonal system in parallel: 1) breaking down a problem into multiple subproblemseach of which is independentof other, and 2) solving the subproblems using an efficient algorithm. We propose a hybrid method of tiled parallel cyclic reduction(tiled PCR) and threadlevel parallel Thomas algorithm(pThomas). Algorithm transition from tiled PCR to pThomas is determined by input system size andhardwarecapabilityinordertoachieve optimal performance. The proposed method is scalable as it can cope withvarious inputsystem sizes byproperlyadjustingalgorithm trasition point. Our method on a NVidia GTX480 shows up to 8.3x and 49x speedups over multithreaded and sequential MKL implementations on a 3.33GHz Intel i7 975 in double precision, respectively.