Results 1 
6 of
6
CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores
 in IEEE International Symposium on Workload Characterization, (IISWC
, 2015
"... Abstract—Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared me ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared memory multicores have focused on various workload domains, such as scientific, graphics, vision, financial and media processing. However, these suites lack graph applications that must be evaluated in the context of architectural design space exploration for futuristic multicores. This paper presents CRONO, a benchmark suite composed of multithreaded graph algorithms for shared memory multicore processors. We analyze and characterize these benchmarks using a multicore simulator, as well as a real multicore machine setup. CRONO uses both synthetic and real world graphs. Our characterization shows that graph benchmarks are diverse and challenging in the context of scaling efficiency. They exhibit low locality due to unstructured memory access patterns, and incur finegrain communication between threads. Energy overheads also occur due to nondeterministic memory and synchronization patterns on network connections. Our characterization reveals that these challenges remain in stateoftheart graph algorithms, and in this context CRONO can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks in futuristic multicore processors. I.
Parallel Methods for Verifying the Consistency of WeaklyOrdered Architectures
 in Proceedings of the 24th ACM/IEEE International Conference on Parallel Architectures and Compilation Techniques (PACT
, 2015
"... Abstract—Contemporary microprocessors use relaxed memory consistency models to allow for aggressive optimizations in hardware. This enhancement in performance comes at the cost of design complexity and verification effort. In particular, verifying an execution of a program against its system’s memo ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Contemporary microprocessors use relaxed memory consistency models to allow for aggressive optimizations in hardware. This enhancement in performance comes at the cost of design complexity and verification effort. In particular, verifying an execution of a program against its system’s memory consistency model is an NPcomplete problem. Several graphbased approximations to this problem based on carefully constructed randomized test programs have been proposed in the literature; however, such approaches are sequential and execute slowly on large graphs of interest. Unfortunately, the ability to execute larger tests is tremendously important, since such tests enable one to expose bugs more quickly. Successfully executing more tests per unit time is also desirable, since it allows for one to check for a greater variety of errors in the memory subsystem by utilizing a more diverse set of tests. This paper improves upon existing work by introducing an algorithm that not only reduces the time complexity of the verification process, but also facilitates the development of parallel algorithms for solving these problems. We first show performance improvements from a sequential approach and gain further performance from parallel implementations in OpenMP and CUDA. For large tests of interest, our GPU implementation achieves an average application speedup of 26.36x over existing techniques in use at NVIDIA.
A Fast, EnergyEfficient Abstraction for Simultaneous BreadthFirst Searches
"... Abstract—Optimized GPU kernels are sufficiently complicated to write that they often are specialized to input data, target architectures, or applications. This paper presents a multisearch abstraction for computing multiple breadthfirst searches in parallel and demonstrates a highperformance, ge ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—Optimized GPU kernels are sufficiently complicated to write that they often are specialized to input data, target architectures, or applications. This paper presents a multisearch abstraction for computing multiple breadthfirst searches in parallel and demonstrates a highperformance, general implementation. Our abstraction removes the burden of orchestrating graph traversal from the user while providing high performance and low energy usage, an often overlooked component of algorithm design. Energy consumption has become a firstclass hardware design constraint for both massive and embedded computing platforms. Our abstraction can be applied to such problems as the allpairs shortestpath problem, community detection, reachability querying, and others. To map graph traversal efficiently to the GPU, our hybrid implementation chooses between processing active vertices with a single thread or an entire warp based on vertex outdegree. For a set of twelve varied graphs, the implementation of our abstraction saves 42 % time and 62% energy on average compared to representative implementations of specific applications from existing literature. I.
Scalable SIMDEfficient Graph Processing on GPUs
"... Abstract—The vast computing power of GPUs makes them an attractive platform for accelerating large scale data parallel computations such as popular graph processing applications. However, the inherent irregularity and large sizes of realworld power law graphs makes effective use of GPUs a major cha ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—The vast computing power of GPUs makes them an attractive platform for accelerating large scale data parallel computations such as popular graph processing applications. However, the inherent irregularity and large sizes of realworld power law graphs makes effective use of GPUs a major challenge. In this paper we develop techniques that greatly enhance the performance and scalability of vertexcentric graph processing on GPUs. First, we present Warp Segmentation, a novel method that greatly enhances GPU device utilization by dynamically assigning appropriate number of SIMD threads to process a vertex with irregularsized neighbors while employing compact CSR representation to maximize the graph size that can be kept inside the GPU global memory. Prior works can either maximize graph sizes (VWC [11] uses the CSR representation) or device utilization (e.g., CuSha [13] uses the CW representation; however, CW is roughly 2.5x the size of CSR). Second, we further scale graph processing to make use of multiple GPUs while proposing Vertex Refinement to address the challenge of judiciously using the limited bandwidth available for transferring data between GPUs via the PCIe bus. Vertex refinement employs parallel binary prefix sum to dynamically collect only the updated boundary vertices inside GPUs ’ outbox buffers for dramatically reducing interGPU data transfer volume. Whereas existing multiGPU techniques (Medusa [31], TOTEM [7]) perform high degree of wasteful vertex transfers. On a single GPU, our framework delivers average speedups of 1.29x to 2.80x over VWC. When scaled to multiple GPUs, our framework achieves up to 2.71x performance improvement compared to interGPU vertex communication schemes used by other multi
Fast Execution of Simultaneous BreadthFirst Searches on Sparse Graphs
"... Abstract—The construction of efficient parallel graph algorithms is important for quickly solving problems in areas such as urban planning, social network analysis, and hardware verification. Existing GPU implementations of graph algorithms tend to be monolithic and thus contributions from the lite ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—The construction of efficient parallel graph algorithms is important for quickly solving problems in areas such as urban planning, social network analysis, and hardware verification. Existing GPU implementations of graph algorithms tend to be monolithic and thus contributions from the literature are typically rebuilt rather than reused. Recent work has focused on traversalbased abstractions that efficiently execute a single breadthfirst search or enact algorithms in the “think like a vertex ” paradigm. However, graph analytics such as the allpairs shortest paths problem, diameter computations, betweenness centrality, and reachability querying require the execution of many such graph traversals. Typically, these traversals are independent of one another and can thus be executed in parallel. This paper presents multisearch, a simple abstraction that is designed for graph algorithms requiring many breadthfirst searches that can be executed simultaneously. Although algorithms have implicitly leveraged this abstraction in the past, we provide an explicit, reusable implementation that efficiently maps this abstraction to the GPU, performing more than twice as fast as previous approaches on large graphs of varying diameter. This approach allows us to scale our APSP implementation to graphs with millions of vertices using a single GPU whereas prior approaches were either constrained to much smaller graph instances or required large supercomputers to process graphs of similar size. To show the flexibility of our abstraction, we use it to express betweenness centrality and achieve more than a 5.82x average speedup over parallel CPU implementations from existing frameworks and a 2.24x average speedup over a manual, highly optimized GPU implementation of the algorithm. I.