Results 1 - 10
of
13
Parallel Graph Processing on Graphics Processors Made Easy
"... This paper demonstrates Medusa, a programming framework for parallel graph processing on graphics processors (GPUs). Medusa enables developers to leverage the massive parallelism and other hardware features of GPUs by writing sequential C/C++ code for a small set of APIs. This simplifies the impleme ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
This paper demonstrates Medusa, a programming framework for parallel graph processing on graphics processors (GPUs). Medusa enables developers to leverage the massive parallelism and other hardware features of GPUs by writing sequential C/C++ code for a small set of APIs. This simplifies the implementation of parallel graph processing on the GPU. The runtime system of Medusa automatically executes the user-defined APIs in parallel on the GPU, with a series of graph-centric optimizations based on the architecture features of GPUs. We will demonstrate the steps of developing GPU-based graph processing algorithms with Medusa, and the superior performance of Medusa with both real-world and synthetic datasets. 1.
GPU-accelerated Database Systems: Survey and Open Challenges?
"... Abstract. The vast amount of processing power and memory band-width provided by modern graphics cards make them an interesting platform for data-intensive applications. Unsurprisingly, the database research community identified GPUs as effective co-processors for data processing several years ago. I ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract. The vast amount of processing power and memory band-width provided by modern graphics cards make them an interesting platform for data-intensive applications. Unsurprisingly, the database research community identified GPUs as effective co-processors for data processing several years ago. In the past years, there were many ap-proaches to make use of GPUs at different levels of a database system. In this paper, we explore the design space of GPU-accelerated database management systems. Based on this survey, we present key properties, important trade-offs and typical challenges of GPU-aware database ar-chitectures, and identify major open challenges. Additionally, we sur-vey existing GPU-accelerated DBMSs and classify their architectural properties. Then, we summarize typical optimizations implemented in GPU-accelerated DBMSs. Finally, we propose a reference architecture, indicating how GPU acceleration can be integrated in existing DBMSs. 1
Out-of-core GPU Memory Management for MapReduce-based Large-scale Graph Processing
"... GPUs can accelerate edge scan performance of graph processing applications; however, the capacity of device memory on GPUs limits the size of graph to process. Efficient techniques to handle GPU memory overflows, including overflow detection and performance analysis in large-scale systems, are not ..."
Abstract
- Add to MetaCart
GPUs can accelerate edge scan performance of graph processing applications; however, the capacity of device memory on GPUs limits the size of graph to process. Efficient techniques to handle GPU memory overflows, including overflow detection and performance analysis in large-scale systems, are not well investigated. To address the problem, we propose a MapReduce-based out-of-core GPU memory management technique for processing large-scale graph applications on heterogeneous GPU-based supercomputers. Our proposed technique automatically handles memory overflows from GPUs by dynamically dividing graph data into multiple chunks and overlaps CPU-GPU data transfer and computation on GPUs as much as possible. Our experimental results on TSUBAME2.5 using 1024 nodes (12288 CPU cores, 3072 GPUs) exhibit that our GPU-based implementation performs 2.10x faster than running on CPU when graph data size does not fit on GPUs. We also study the performance characteristics of our proposed out-of-core GPU memory management technique, including application’s performance and power efficiency of scale-up and scale-out approaches in terms of the number of GPUs.
GRapid: a Compilation and Runtime Framework for Rapid Prototyping of Graph Applications on Many-core Processors
"... Many applications use graphs to represent and analyze data, but the effective deployment of graph algorithms on many-core processors is still a challenge. Many-core devices offer higher peak performance than multi-core devices; however, many-core programming is still a specialized skill. While comp ..."
Abstract
- Add to MetaCart
Many applications use graphs to represent and analyze data, but the effective deployment of graph algorithms on many-core processors is still a challenge. Many-core devices offer higher peak performance than multi-core devices; however, many-core programming is still a specialized skill. While compilation and runtime frameworks for parallelizing graph applications on multi-core CPUs exist, there is still a need for comparable frameworks for many-core devices. We propose GRapid: a compilation and runtime framework that generates efficient parallel implementations of generic graph applications for multi-core CPUs, NVIDIA GPUs and Intel Xeon Phi. Applications are expressed using a platform-agnostic programming API. Our source-to-source compiler performs platform-specific code transformations and optimizations, handles data transfers between the host and the coprocessor, and generates several functionally-equivalent code variants for the target platform. Our runtime library provides an efficient dynamic memory management scheme for applications where the graph topology is modified during processing. We used the proposed framework to rapidly generate efficient implementations of four graph applications on different target processors. Such rapid prototyping and re-targeting capability has obvious programmability advantages, but it can also be used to quickly identify target processors that better match the application’s computational needs.
High-performance Graph Analytics on Manycore Processors
"... Abstract-The divergence in the computer architecture landscape has resulted in different architectures being considered mainstream at the same time. For application and algorithm developers, a dilemma arises when one must focus on using underlying architectural features to extract the best performa ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract-The divergence in the computer architecture landscape has resulted in different architectures being considered mainstream at the same time. For application and algorithm developers, a dilemma arises when one must focus on using underlying architectural features to extract the best performance on each of these architectures, while writing portable code at the same time. We focus on this problem with graph analytics as our target application domain. In this paper, we present an abstraction-based methodology for performanceportable graph algorithm design on manycore architectures. We demonstrate our approach by systematically optimizing algorithms for the problems of breadth-first search, color propagation, and strongly connected components. We use Kokkos, a manycore library and programming model, for prototyping our algorithms. Our portable implementation of the strongly connected components algorithm on the NVIDIA Tesla K40M is up to 3.25× faster than a state-of-the-art parallel CPU implementation on a dual-socket Sandy Bridge compute node.
Fast Subgraph Matching on Large Graphs using Graphics Processors
"... Abstract. Subgraph matching is the task of finding all matches of a query graph in a large data graph, which is known as an NP-complete problem. Many algorithms are proposed to solve this problem using CPUs. In recent years, Graphics Processing Units (GPUs) have been adopted to accelerate fundamenta ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Subgraph matching is the task of finding all matches of a query graph in a large data graph, which is known as an NP-complete problem. Many algorithms are proposed to solve this problem using CPUs. In recent years, Graphics Processing Units (GPUs) have been adopted to accelerate fundamental graph operations such as breadth-first search and shortest path, owing to their parallelism and high data throughput. The existing subgraph matching algorithms, however, face challenges in mapping backtracking problems to the GPU architectures. Moreover, the previous GPU-based graph algorithms are not designed to handle intermediate and final outputs. In this paper, we present a simple and GPU-friendly method for subgraph matching, called GpSM, which is designed for massively parallel architectures. We show that GpSM out-performs the state-of-the-art algorithms and efficiently answers subgraph queries on large graphs. 1
Scalable SIMD-Efficient Graph Processing on GPUs
"... Abstract—The vast computing power of GPUs makes them an attractive platform for accelerating large scale data parallel computations such as popular graph processing applications. However, the inherent irregularity and large sizes of real-world power law graphs makes effective use of GPUs a major cha ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—The vast computing power of GPUs makes them an attractive platform for accelerating large scale data parallel computations such as popular graph processing applications. However, the inherent irregularity and large sizes of real-world power law graphs makes effective use of GPUs a major challenge. In this paper we develop techniques that greatly enhance the performance and scalability of vertex-centric graph processing on GPUs. First, we present Warp Segmentation, a novel method that greatly enhances GPU device utilization by dynamically assigning appropriate number of SIMD threads to process a vertex with irregular-sized neighbors while employing compact CSR representation to maximize the graph size that can be kept inside the GPU global memory. Prior works can either maximize graph sizes (VWC [11] uses the CSR representation) or device utilization (e.g., CuSha [13] uses the CW representation; however, CW is roughly 2.5x the size of CSR). Second, we further scale graph processing to make use of multiple GPUs while proposing Vertex Refinement to address the challenge of judiciously using the limited bandwidth available for transferring data between GPUs via the PCIe bus. Vertex refinement employs parallel binary prefix sum to dynamically collect only the updated boundary vertices inside GPUs ’ outbox buffers for dramatically reducing inter-GPU data transfer volume. Whereas existing multi-GPU techniques (Medusa [31], TOTEM [7]) perform high degree of wasteful vertex transfers. On a single GPU, our framework delivers average speedups of 1.29x to 2.80x over VWC. When scaled to multiple GPUs, our framework achieves up to 2.71x performance improvement compared to inter-GPU vertex communication schemes used by other multi-
X. SHI ET AL. 1 Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model
"... Abstract—GPUs have been increasingly used to accelerate graph processing for complicated computational problems regarding graph theory. Many parallel graph algorithms adopt the asynchronous computing model to accelerate the iterative convergence. Unfortunately, the consistent asynchronous computing ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—GPUs have been increasingly used to accelerate graph processing for complicated computational problems regarding graph theory. Many parallel graph algorithms adopt the asynchronous computing model to accelerate the iterative convergence. Unfortunately, the consistent asynchronous computing requires locking or atomic operations, leading to significant penalties/overheads when implemented on GPUs. To this end, coloring algorithm is adopted to separate the vertices with potential updating conflicts, guaranteeing the consistency/correctness of the parallel processing. Common coloring algorithms, however, may suffer from low parallelism because of a large number of colors generally required for processing a large-scale graph with billions of vertices. We propose a light-weight asynchronous processing framework called Frog with a hybrid coloring model. The fundamental idea is based on Pareto principle (or 80-20 rule) about coloring algorithms as we observed through masses of real graph coloring cases. We find that majority of vertices (about 80%) are colored with only a few colors, such that they can be read and updated in a very high degree of parallelism without violating the sequential consistency. Accordingly, our solution will separate the processing of the vertices based on the distribution of colors. In this work, we mainly answer the four questions: (1) how to partition the vertices in a sparse graph with maximized parallelism, (2) how to process large-scale graphs which are out of GPU memory, (3) how to reduce the overhead of data transfers on PCIe while processing each partition, and (4) how to customize the data structure that is particularly suitable for GPU-based graph processing. Experiments based on real-world data show that our asynchronous GPU graph processing engine
A Fast, Energy-Efficient Abstraction for Simultaneous Breadth-First Searches
"... Abstract—Optimized GPU kernels are sufficiently complicated to write that they often are specialized to input data, target architectures, or applications. This paper presents a multi-search abstraction for computing multiple breadth-first searches in par-allel and demonstrates a high-performance, ge ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Optimized GPU kernels are sufficiently complicated to write that they often are specialized to input data, target architectures, or applications. This paper presents a multi-search abstraction for computing multiple breadth-first searches in par-allel and demonstrates a high-performance, general implementa-tion. Our abstraction removes the burden of orchestrating graph traversal from the user while providing high performance and low energy usage, an often overlooked component of algorithm design. Energy consumption has become a first-class hardware design constraint for both massive and embedded computing platforms. Our abstraction can be applied to such problems as the all-pairs shortest-path problem, community detection, reachability querying, and others. To map graph traversal efficiently to the GPU, our hybrid implementation chooses between processing active vertices with a single thread or an entire warp based on vertex outdegree. For a set of twelve varied graphs, the implementation of our abstraction saves 42 % time and 62% energy on average compared to representative implementations of specific applications from existing literature. I.
1Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling
"... Abstract—Graphics processors, or GPUs, have recently been widely used as accelerators in shared environments such as clusters and clouds. In such shared environments, many kernels are submitted to GPUs from different users, and throughput is an important metric for performance and total ownership co ..."
Abstract
- Add to MetaCart
Abstract—Graphics processors, or GPUs, have recently been widely used as accelerators in shared environments such as clusters and clouds. In such shared environments, many kernels are submitted to GPUs from different users, and throughput is an important metric for performance and total ownership cost. Despite recently improved runtime support for concurrent GPU kernel executions, the GPU can be severely underutilized, resulting in suboptimal throughput. In this paper, we propose Kernelet, a runtime system to improve the throughput of concurrent kernel executions on the GPU. Kernelet embraces transparent memory management and PCI-e data transfer techniques, and dynamic slicing and scheduling techniques for kernel executions. With slicing, Kernelet divides a GPU kernel into multiple sub-kernels (namely slices). Each slice has tunable occupancy to allow co-scheduling with other slices for high GPU utilization. We develop a novel Markov chain-based performance model to guide the scheduling decision. Our experimental results demonstrate up to 31 % and 23 % performance improvement on NVIDIA Tesla C2050 and GTX680 GPUs, respectively.