Results 1 - 10
of
20
Analyzing CUDA workloads using a detailed gpu simulator
- In Proceedings of the International Symposium on Performance Analysis of Systems and Software
, 2009
"... Modern Graphic Processing Units (GPUs) provide suffi-ciently flexible programming models that understanding their performance can provide insight in designing tomorrow’s manycore processors, whether those are GPUs or other-wise. The combination of multiple, multithreaded, SIMD cores makes studying t ..."
Abstract
-
Cited by 168 (8 self)
- Add to MetaCart
(Show Context)
Modern Graphic Processing Units (GPUs) provide suffi-ciently flexible programming models that understanding their performance can provide insight in designing tomorrow’s manycore processors, whether those are GPUs or other-wise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding trade-offs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw comput-ing power than contemporary CPUs, many important ap-plications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA’s CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA’s parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applica-tions on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitec-ture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel work-load distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the appli-cations we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system. 1.
vCUDA: GPU Accelerated High Performance Computing in Virtual Machines
- IEEE Transactions on Computers
"... This paper describes vCUDA, a GPGPU (General Purpose Graphics Processing Unit) computing solution for virtual machines. vCUDA allows applications executing within virtual machines (VMs) to leverage hardware acceleration, which can be beneficial to the performance of a class of high performance compu ..."
Abstract
-
Cited by 30 (2 self)
- Add to MetaCart
(Show Context)
This paper describes vCUDA, a GPGPU (General Purpose Graphics Processing Unit) computing solution for virtual machines. vCUDA allows applications executing within virtual machines (VMs) to leverage hardware acceleration, which can be beneficial to the performance of a class of high performance computing (HPC) applications. The key idea in our design is: API call interception and redirection. With API interception and redirection, applications in VMs can access graphics hardware device and achieve high performance computing in a transparent way. We carry out detailed analysis on the performance and overhead of our framework. Our evaluation shows that GPU acceleration for HPC applications in VMs is feasible and competitive with those running in a native, nonvirtualized environment. Furthermore, our evaluation also identifies the main cause of overhead in our current framework, and we give some suggestions for future improvement. 1.
Characterizing and Evaluating a Key-value Store Application on
- Heterogeneous CPU-GPU Systems,” in Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, ser. ISPASS
, 2012
"... Abstract—The recent use of graphics processing units (GPUs) in several top supercomputers demonstrate their ability to con-sistently deliver positive results in high-performance computing (HPC). GPU support for significant amounts of parallelism would seem to make them strong candidates for non-HPC ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
(Show Context)
Abstract—The recent use of graphics processing units (GPUs) in several top supercomputers demonstrate their ability to con-sistently deliver positive results in high-performance computing (HPC). GPU support for significant amounts of parallelism would seem to make them strong candidates for non-HPC applications as well. Server workloads are inherently parallel; however, at first glance they may not seem suitable to run on GPUs due to their irregular control flow and memory access patterns. In this work, we evaluate the performance of a widely used key-value store middleware application, Memcached, on recent integrated and discrete CPU+GPU heterogeneous hardware and characterize the resulting performance. To gain greater insight, we also evaluate Memcached’s performance on a GPU simulator. This work explores the challenges in porting Memcached to OpenCL and provides a detailed analysis into Memcached’s behavior on a GPU to better explain the performance results observed on physical hardware. On the integrated CPU+GPU systems, we observe up to 7.5X performance increase compared to the CPU when executing the key-value look-up handler on the GPU. Index Terms—GPGPU, SIMD, OpenCL, key-value store, server I.
Shredder: GPU-Accelerated Incremental Storage and Computation
"... Redundancy elimination using data deduplication and incremental data processing has emerged as an important technique to minimize storage and computation requirements in data center computing. In this paper, we present the design, implementation and evaluation of Shredder, a high performance content ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
(Show Context)
Redundancy elimination using data deduplication and incremental data processing has emerged as an important technique to minimize storage and computation requirements in data center computing. In this paper, we present the design, implementation and evaluation of Shredder, a high performance content-based chunking framework for supporting incremental storage and computation systems. Shredder exploits the massively parallel processing power of GPUs to overcome the CPU bottlenecks of content-based chunking in a cost-effective manner. Unlike previous uses of GPUs, which have focused on applications where computation costs are dominant, Shredder is designed to operate in both compute-and dataintensive environments. To allow this, Shredder provides several novel optimizations aimed at reducing the cost of transferring data between host (CPU) and GPU, fully utilizing the multicore architecture at the host, and reducing GPU memory access latencies. With our optimizations, Shredder achieves a speedup of over 5X for chunking bandwidth compared to our optimized parallel implementation without a GPU on the same host system. Furthermore, we present two real world applications of Shredder: an extension to HDFS, which serves as a basis for incremental MapReduce computations, and an incremental cloud backup system. In both contexts, Shredder detects redundancies in the input data across successive runs, leading to significant savings in storage, computation, and end-to-end completion times. 1
Operating Systems Challenges for GPU Resource Management
- In Proceedings of the International Workshop on Operating Systems Platforms for Embedded Real-Time Applications
, 2011
"... The graphics processing unit (GPU) is becoming a very powerful platform to accelerate graphics and data-parallel compute-intensive applications. It significantly outperforms traditional multi-core processors in performance and energy efficiency. Its application domains also range widely from embedde ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
The graphics processing unit (GPU) is becoming a very powerful platform to accelerate graphics and data-parallel compute-intensive applications. It significantly outperforms traditional multi-core processors in performance and energy efficiency. Its application domains also range widely from embedded systems to high-performance computing systems. However, operating systems support is not adequate, lacking models, designs, and implementation efforts of GPU resource management for multi-tasking environments. This paper identifies a GPU resource management model to provide a basis for operating systems research using GPU technology. In particular, we present design concepts for GPU resource management. A list of operating systems challenges is also provided to highlight future directions of this research domain, including specific ideas of GPU scheduling for real-time systems. Our preliminary evaluation demonstrates that the performance of open-source software is competitive with that of proprietary software, and hence operating systems re-search can start investigating GPU resource management. 1
Visualizing Complex Dynamics in Many-Core Accelerator Architectures
- In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS
"... Abstract—While many-core accelerator architectures, such as today’s Graphics Processing Units (GPUs), offer orders of magnitude more raw computing power than contemporary CPUs, their massive parallelism often produces complex dynamic behaviors even with the simplest applications. Using a fixed set o ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Abstract—While many-core accelerator architectures, such as today’s Graphics Processing Units (GPUs), offer orders of magnitude more raw computing power than contemporary CPUs, their massive parallelism often produces complex dynamic behaviors even with the simplest applications. Using a fixed set of hardware or simulator performance counters to quantify behavior over a large interval of time such as an entire application execution run or program phase may not capture this behavior. Software and/or hardware designers may consequently miss out on opportunities to optimize for better performance. Similarly, significant effort may be expended to find metrics that explain anomalous behavior in architecture design studies. Moreover, the increasing complexity of applications developed for today’s GPU has created additional difficulties for software developers when attempting to identify bottlenecks of an application for optimization. This paper presents a novel GPU performance visualization tool, AerialVision, to address these two problems. It interfaces with the GPGPU-Sim simulator to capture and visualize the dynamic behavior of a GPU architecture throughout an application run. Similar to existing performance analysis tools for CPUs, it can annotate individual lines of source code with performance statistics to simplify the bottleneck identification process. To provide further insight, AerialVision introduces a novel methodology to relate pathological dynamic architectural behaviors resulting in performance loss with the part of the source code that is responsible. By rapidly providing insight into complex dynamic behavior, AerialVision enables research on improving many-core accelerator architectures and will help ensure applications written for these architectures reach their full performance potential. I.
Supporting lowlatency CPS using GPUs and direct I/O schemes
- In 18th RTCSA
, 2012
"... Abstract—Graphics processing units (GPUs) are increasingly being used for general purpose parallel computing. They provide significant performance gains over multi-core CPU systems, and are an easily accessible alternative to super-computers. The architecture of general purpose GPU systems (GPGPU), ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Graphics processing units (GPUs) are increasingly being used for general purpose parallel computing. They provide significant performance gains over multi-core CPU systems, and are an easily accessible alternative to super-computers. The architecture of general purpose GPU systems (GPGPU), however, poses challenges in efficiently transferring data among the host and device(s). Although commodity many-core devices such as NVIDIA GPUs provide more than one way to move data around, it is unclear which method is most effective given a particular application. This presents difficulty in supporting latency-sensitive cyber-physical systems (CPS). In this work we present a new approach to data trans-fer in a heterogeneous computing system that allows direct communication between GPUs and other I/O devices. In addition to adding this functionality our system also improves communication between the GPU and host. We analyze the current vendor provided data communication mechanisms and identify which methods work best for particular tasks with respect to throughput, and total time to completion. Our method allows a new class of real-time cyber-physical applications to be implemented on a GPGPU system. The results of the experiments presented here show that GPU tasks can be completed in 34 percent less time than current methods. Furthermore, effective data throughput is at least as good as the current best performers. This work is part of concurrent development of Gdev [6], an open-source project to provide Linux operating system support of many-core device resource management. Keywords-GPGPU; real time systems; GPU communication; I.
Implementation of multiple-precision modular multiplication on gpu
, 2009
"... Multiple-precision modular multiplications are the key components in security applications, like public-key cryptography for encrypting and signing digital data. But unfortunately they are computationally expensive for contemporary CPUs. By exploiting the computing power of the many-core GPUs, we im ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Multiple-precision modular multiplications are the key components in security applications, like public-key cryptography for encrypting and signing digital data. But unfortunately they are computationally expensive for contemporary CPUs. By exploiting the computing power of the many-core GPUs, we implemented a multiple-precision integer library with CUDA. In this paper, we will investigate the implementation of two approaches of multiple-precision modular multiplications on GPU. We will analyze the detail of the instructions of multiple-precision modular multiplication on the GPU and find the hit issues, and then we propose to use the inline ASM to improve the implementation of this function. Our experimental results show that the performance of multiple-precision modular multiplication has been improved by 20%.
On the Use of GPUs in Realizing Cost-Effective Distributed RAID
"... Abstract—The exponential growth in user and application data entails new means for providing fault tolerance and protection against data loss. High Performance Computing (HPC) storage systems, which are at the forefront of handling the data deluge, typically employ hardware RAID at the backend. Howe ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract—The exponential growth in user and application data entails new means for providing fault tolerance and protection against data loss. High Performance Computing (HPC) storage systems, which are at the forefront of handling the data deluge, typically employ hardware RAID at the backend. However, such solutions are costly, do not ensure end-to-end data integrity, and can become a bottleneck during data reconstruction. In this paper,we designaninnovative solutiontoachieve aflexible,faulttolerant,andhigh-performanceRAID-6solutionfor aparallel file system (PFS). Our system utilizes low-cost, strategically placed GPUs — both on the client and server sides — to accelerate parity computation. In contrast to hardware-based approaches, we provide full control over the size, length and location of a RAID array on a per file basis, end-to-end data integrity checking, and parallelization of RAID array reconstruction. We have deployed our system in conjunction with the widely-used Lustre PFS, and show that our approach is feasible and imposes acceptable overhead. I.
SEE PROFILE
"... All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract
- Add to MetaCart
(Show Context)
All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.