Results 11 - 20
of
149
Parallel Graph Component Labelling with GPUs and CUDA
, 2010
"... Graph component labelling, which is a subset of the general graph colouring problem, is a computationally expen-sive operation that is of importance in many applications and simulations. A number of data-parallel algorithmic variations to the component labelling problem are possible and we explore t ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
Graph component labelling, which is a subset of the general graph colouring problem, is a computationally expen-sive operation that is of importance in many applications and simulations. A number of data-parallel algorithmic variations to the component labelling problem are possible and we explore their use with general purpose graphical processing units (GPGPUs) and with the CUDA GPU programming language. We discuss implementation issues and performance results on GPUs using CUDA. We present results for regular mesh graphs as well as arbitrary structured and topical graphs such as small-world and scale-free structures. We show how different algorithmic variations can be used to best effect depending upon the cluster structure of the graph being labelled and consider how features of the GPU architectures and host CPUs can be combined to best effect into a cluster component labelling algorithm for use in high performance simulations.
A GPU Implementation of Inclusion-based Points-to Analysis ∗
"... Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate on pointer-based data structures such as graphs ..."
Abstract
-
Cited by 21 (5 self)
- Add to MetaCart
(Show Context)
Graphics Processing Units (GPUs) have emerged as powerful accelerators for many regular algorithms that operate on dense arrays and matrices. In contrast, we know relatively little about using GPUs to accelerate highly irregular algorithms that operate on pointer-based data structures such as graphs. For the most part, research has focused on GPU implementations of graph analysis algorithms that do not modify the structure of the graph, such as algorithms for breadth-first search and strongly-connected components. In this paper, we describe a high-performance GPU implementation of an important graph algorithm used in compilers such as gcc and LLVM: Andersen-style inclusion-based points-to analysis. This algorithm is challenging to parallelize effectively on GPUs because it makes extensive modifications to the structure of the underlying graph and performs relatively little computation. In spite of this, our program, when executed on a 14 Streaming Multiprocessor GPU, achieves an average speedup of 7x compared to a sequential CPU implementation and outperforms a parallel implementation of the same algorithm running on 16 CPU cores. Our implementation provides general insights into how to produce high-performance GPU implementations of graph algorithms, and it highlights key differences between optimizing parallel programs for multicore CPUs and for GPUs.
Exploiting Graphical Processing Units for Data-Parallel Scientific Applications
- Massey University
, 2008
"... Graphical Processing Units (GPUs) have recently attracted attention for certain scientific simulation problems such as particle simulations. This is partially driven by low commodity pricing of GPUs but also by recent toolkit and library developments that make them more accessible to scientific prog ..."
Abstract
-
Cited by 16 (14 self)
- Add to MetaCart
(Show Context)
Graphical Processing Units (GPUs) have recently attracted attention for certain scientific simulation problems such as particle simulations. This is partially driven by low commodity pricing of GPUs but also by recent toolkit and library developments that make them more accessible to scientific programmers. We report on two further application paradigms – regular mesh field equations with unusual boundary conditions and graph analysis algorithms – that can also make use of GPU architectures to greatly accelerate certain simulations. We discuss the relevance of all these application paradigms and how they relate to simulations engines and embedded games components. GPUs were aimed primarily at the accelerated graphics market but since this is often closely coupled to advanced game products it is interesting to speculate about the future of fully integrated accelerator hardware for both visualisation and simulation combined. As well as reporting speedup performance on selected simulation paradigms, we discuss suitable data parallel algorithms and present some specific code examples for making good use of GPU features like large numbers of threads and localised texture memory. We review how these ideas have evolved from past data parallel systems such as vector and array processors and speculate on how future CPUs will incorporate further support for these ideas.
P.J.: Singular value decomposition on gpu using cuda
- In: IPDPS ’09: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing
, 2009
"... Linear algebra algorithms are fundamental to many computing applications. Modern GPUs are suited for many general purpose processing tasks and have emerged as inexpensive high performance co-processors due to their tremendous computing power. In this paper, we present the implementation of singular ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
(Show Context)
Linear algebra algorithms are fundamental to many computing applications. Modern GPUs are suited for many general purpose processing tasks and have emerged as inexpensive high performance co-processors due to their tremendous computing power. In this paper, we present the implementation of singular value decomposition (SVD) of a dense matrix on GPU using the CUDA programming model. SVD is implemented using the twin steps of bidiagonalization followed by diagonalization. It has not been implemented on the GPU before. Bidiagonalization is implemented using a series of Householder transformations which map well to BLAS operations. Diagonalization is performed by applying the implicitly shifted QR algorithm. Our complete SVD implementation outperforms the MATLAB and Intel R○Math Kernel Library (MKL) LAPACK implementation significantly on the CPU. We show a speedup of upto 60 over the MATLAB implementation and upto 8 over the Intel MKL implementation on a Intel Dual Core 2.66GHz PC on NVIDIA GTX 280 for large matrices. We also give results for very large matrices on NVIDIA Tesla S1070. 1.
K.Srinathan, "A performance prediction model for the CUDA GPGPU platform
- the 16th IEEE International Conference on High Performance Computing (HiPC
, 2009
"... The significant growth in computational power of mod-ern Graphics Processing Units(GPUs) coupled with the advent of general purpose programming environments like NVIDA’s CUDA, has seen GPUs emerging as a very popular parallel computing platform. However, de-spite their popularity, there is no perfor ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
(Show Context)
The significant growth in computational power of mod-ern Graphics Processing Units(GPUs) coupled with the advent of general purpose programming environments like NVIDA’s CUDA, has seen GPUs emerging as a very popular parallel computing platform. However, de-spite their popularity, there is no performance model of any GPGPU programming environment. The absence of such a model makes it difficult to definitively as-sess the suitability of the GPU for solving a particular problem and is a significant impediment to the main-stream adoption of GPUs as a massively parallel (su-per)computing platform. In this paper we present a performance prediction model for the CUDA GPGPU platform. This model encompasses the various facets of the GPU architec-ture like scheduling, memory hierarchy and pipelin-ing among others. We also perform experiments that demonstrate the effects of various memory access strategies. The proposed model can be used to analyze pseudo code for a CUDA kernel to obtain a performance estimate, in a way that is similar to performing asymp-totic analysis. We illustrate the usage of our model and its accuracy, with three case studies: Matrix Multiplica-tion, List Ranking, and histogram generation. 1
Medusa: Simplified Graph Processing on GPUs
, 2013
"... Graphs are common data structures for many applications, and efficient graph processing is a must for application performance. Recently, the graphics processing unit (GPU) has been adopted to accelerate various graph processing algorithms such as BFS and shortest paths. However, it is difficult to ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
Graphs are common data structures for many applications, and efficient graph processing is a must for application performance. Recently, the graphics processing unit (GPU) has been adopted to accelerate various graph processing algorithms such as BFS and shortest paths. However, it is difficult to write correct and efficient GPU programs and even more difficult for graph processing due to the irregularities of graph structures. To simplify graph processing on GPUs, we propose a programming framework called Medusa which enables developers to leverage the capabilities of GPUs by writing sequential C/C++ code. Medusa offers a small set of user-defined APIs, and embraces a runtime system to automatically execute those APIs in parallel on the GPU. We develop a series of graph-centric optimizations based on the architecture features of GPUs for efficiency. Additionally, Medusa is extended to execute on multiple GPUs within a machine. Our experiments show that (1) Medusa greatly simplifies implementation of GPGPU programs for graph processing, with many fewer lines of source code written by developers; (2) The optimization techniques significantly improve the performance of the runtime system, making its performance comparable with or better than manually tuned GPU graph operations.
To GPU Synchronize or Not GPU Synchronize?
"... Abstract — The graphics processing unit (GPU) has evolved from being a fixed-function processor with programmable stages into a programmable processor with many fixed-function components that deliver massive parallelism. By modifying the GPU’s stream processor to support “general-purpose computation ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
(Show Context)
Abstract — The graphics processing unit (GPU) has evolved from being a fixed-function processor with programmable stages into a programmable processor with many fixed-function components that deliver massive parallelism. By modifying the GPU’s stream processor to support “general-purpose computation ” on the GPU (GPGPU), applications that perform massive vector operations can realize many orders-of-magnitude improvement in performance over a traditional processor, i.e., CPU. However, the breadth of general-purpose computation that can be efficiently supported on a GPU has largely been limited to highly dataparallel or task-parallel applications due to the lack of explicit support for communication between streaming multiprocessors (SMs) on the GPU. Such communication can occur via the global memory of a GPU, but it then requires a barrier synchronization across the SMs of the GPU in order to complete the communication between SMs. Although our previous work demonstrated that implementing barrier synchronization on the GPU itself can significantly improve performance and deliver correct results in critical bioinformatics applications, guaranteeing the correctness of inter-SM communication is only possible if a memory consistency model is assumed. To address this problem, NVIDIA recently introduced the threadfence() function in CUDA 2.2, a function that can guarantee the correctness of GPU-based inter-SM communication. However, this function currently introduces so much overhead that when using it in (direct) GPU synchronization, GPU synchronization actually performs worse than indirect synchronization via the CPU, thus raising the question of whether “to GPU synchronize or not GPU synchronize?” I.
Low latency complex event processing on parallel hardware
- J. Parallel Distrib. Comput
, 2012
"... Low latency complex event processing on parallel hardware ..."
Abstract
-
Cited by 11 (8 self)
- Add to MetaCart
(Show Context)
Low latency complex event processing on parallel hardware
CAPRI: Prediction of CompactionAdequacy for Handling Control-Divergence in GPGPU Architectures
- In 39th International Symposium on Computer Architecture (ISCA-39
, 2012
"... Wide SIMD-based GPUs have evolved into a promising plat-form for running general purpose workloads. Current pro-grammable GPUs allow even code with irregular control to ex-ecute well on their SIMD pipelines. To do this, each SIMD lane is considered to execute a logical thread where hardware ensures ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
(Show Context)
Wide SIMD-based GPUs have evolved into a promising plat-form for running general purpose workloads. Current pro-grammable GPUs allow even code with irregular control to ex-ecute well on their SIMD pipelines. To do this, each SIMD lane is considered to execute a logical thread where hardware ensures that control flow is accurate by automatically applying masked execution. The masked execution, however, often degrades perfor-mance because the issue slots of masked lanes are wasted. This degradation can be mitigated by dynamically compacting multi-ple unmasked threads into a single SIMD unit. This paper pro-poses a fundamentally new approach to branch compaction that avoids the unnecessary synchronization required by previous tech-niques and that only stalls threads that are likely to benefit from compaction. Our technique is based on the compaction-adequacy predictor (CAPRI). CAPRI dynamically identifies the compaction-effectiveness of a branch and only stalls threads that are pre-dicted to benefit from compaction. We utilize a simple single-level branch-predictor inspired structure and show that this simple con-figuration attains a prediction accuracy of 99.8 % and 86.6 % for non-divergent and divergent workloads, respectively. Our perfor-mance evaluation demonstrates that CAPRI consistently outper-forms both the baseline design that never attempts compaction and prior work that stalls upon all divergent branches. 1
Anytime Algorithms for GPU Architectures
"... Abstract—Most algorithms are run-to-completion and provide one answer upon completion and no answer if interrupted before completion. On the other hand, anytime algorithms have a monotonic increasing utility with the length of execution time. Our investigation focuses on the development of time-boun ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Most algorithms are run-to-completion and provide one answer upon completion and no answer if interrupted before completion. On the other hand, anytime algorithms have a monotonic increasing utility with the length of execution time. Our investigation focuses on the development of time-bounded anytime algorithms on Graphics Processing Units (GPUs) to trade-off the quality of output with execution time. Given a time-varying workload, the algorithm continually measures its progress and the remaining contract time to decide its execution pathway and select system resources required to maximize the quality of the result. To exploit the quality-time tradeoff, the focus is on the construction, instrumentation, on-line measurement and decision making of algorithms capable of efficiently managing GPU resources. We demonstrate this with a Parallel A * routing algorithm on a CUDA-enabled GPU. The