Results 1 
7 of
7
AM++: A generalized active message framework
 In
, 2010
"... Active messages have proven to be an effective approach for certain communication problems in high performance computing. Many MPI implementations, as well as runtimes for Partitioned Global Address Space languages, use active messages in their lowlevel transport layers. However, most active messag ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
(Show Context)
Active messages have proven to be an effective approach for certain communication problems in high performance computing. Many MPI implementations, as well as runtimes for Partitioned Global Address Space languages, use active messages in their lowlevel transport layers. However, most active message frameworks have lowlevel programming interfaces that require significant programming effort to use directly in applications and that also prevent optimization opportunities. In this paper we present AM++, a new userlevel library for active messages based on generic programming techniques. Our library allows message handlers to be run in an explicit loop that can be optimized and vectorized by the compiler and that can also be executed in parallel on multicore architectures. Runtime optimizations, such as message combining and filtering, are also provided by the library, removing the need to implement that functionality at the application level. Evaluation of AM++ with distributedmemory graph algorithms shows the usability benefits provided by these library features, as well as their performance advantages.
A spaceefficient parallel algorithm for computing betweenness centrality in distributed memory
 In Proc. Int’l. Conf. on High Performance Computing (HiPC 2010
, 2010
"... Abstract—Betweenness centrality is a measure based on shortest paths that attempts to quantify the relative importance of nodes in a network. As computation of betweenness centrality becomes increasingly important in areas such as social network analysis, networks of interest are becoming too large ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Betweenness centrality is a measure based on shortest paths that attempts to quantify the relative importance of nodes in a network. As computation of betweenness centrality becomes increasingly important in areas such as social network analysis, networks of interest are becoming too large to fit in the memory of a single processing unit, making parallel execution a necessity. Parallelization over the vertex set of the standard algorithm, with a final reduction of the centrality for each vertex, is straightforward but requires Ω(V  2) storage. In this paper we present a new parallelizable algorithm with low spatial complexity that is based on the best known sequential algorithm. Our algorithm requires O(V  + E) storage and enables efficient parallel execution. Our algorithm is especially well suited to distributed memory processing because it can be implemented using coarsegrained parallelism. The presented time bounds for parallel execution of our algorithm on CRCW PRAM and on distributed memory systems both show good asymptotic performance. Experimental results with a distributed memory computer show the practical applicability of our algorithm. I.
DisNet: A framework for distributed graph computation
 in Proc. of the Int. Conf. on Advances in Social Networks Analysis and Mining (ASONAM
, 2011
"... Abstract—With the rise of network science as an exciting interdisciplinary research topic, efficient graph algorithms are in high demand. Problematically, many such algorithms measuring important properties of networks have asymptotic lower bounds that are quadratic, cubic, or higher in the number o ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
Abstract—With the rise of network science as an exciting interdisciplinary research topic, efficient graph algorithms are in high demand. Problematically, many such algorithms measuring important properties of networks have asymptotic lower bounds that are quadratic, cubic, or higher in the number of vertices. For analysis of social networks, transportation networks, communication networks, and a host of others, computation is intractable. In these networks computation in serial fashion requires years or even decades. Fortunately, these same computational problems are often naturally parallel. We present here the design and implementation of a masterworker framework for easily computing such results in these circumstances. The user needs only to supply two small fragments of code describing the fundamental kernel of the computation. The framework automatically divides and distributes the workload and manages completion using an
Employing transactional memory and helper threads to speedup dijkstra’s algorithm
 In ICPP
, 2009
"... Abstract—In this paper we work on the parallelization of the inherently serial Dijkstra’s algorithm on modern multicore platforms. Dijkstra’s algorithm is a greedy algorithm that computes Single Source Shortest Paths for graphs with nonnegative edges and is based on the iterative extraction of nod ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract—In this paper we work on the parallelization of the inherently serial Dijkstra’s algorithm on modern multicore platforms. Dijkstra’s algorithm is a greedy algorithm that computes Single Source Shortest Paths for graphs with nonnegative edges and is based on the iterative extraction of nodes from a priority queue. This property limits the explicit parallelism of the algorithm and any attempt to utilize the remaining parallelism results in significant slowdowns due to synchronization overheads. To deal with these problems, we employ the concept of Helper Threads (HT) to extract parallelism on a nontraditional fashion and Transactional Memory (TM) to efficiently orchestrate the concurrent threads ’ accesses to shared data structures. Results demonstrate that the proposed implementation is able to achieve performance speedups (reaching up to 1.84 for 14 threads), indicating that the two paradigms could be efficiently combined. I.
Early Experiences on Accelerating Dijkstra’s Algorithm Using Transactional Memory
"... In this paper we use Dijkstra’s algorithm as a challenging, hard to parallelize paradigm to test the efficacy of several parallelization techniques in a multicore architecture. We consider the application of Transactional Memory (TM) as a means of concurrent access to shared data and compare its pe ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
In this paper we use Dijkstra’s algorithm as a challenging, hard to parallelize paradigm to test the efficacy of several parallelization techniques in a multicore architecture. We consider the application of Transactional Memory (TM) as a means of concurrent access to shared data and compare its performance with straightforward parallel versions of the algorithm based on traditional synchronization primitives. To increase the granularity of parallelism and avoid excessive synchronization, we combine TM with Helper Threading (HT). Our simulation results demonstrate that the straightforward parallelization of Dijkstra’s algorithm with traditional locks and barriers has, as expected, disappointing performance. On the other hand, TM by itself is able to provide some performance improvement in several cases, while the version based on TM and HT exhibits a significant performance improvement that can reach up to a speedup of 1.46. 1
Graph algorithms in a guaranteeddeterministic language
 In Workshop on Deterministic and Correctness in Parallel Programming (WoDet’14
"... Deterministic implementations of graph algorithms have recently been shown to be reasonably performant. In this paper we explore a followon question: can deterministic graph algorithms be expressed in guaranteeddeterministic parallel languages, which are necessarily restrictive in what concurrenc ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Deterministic implementations of graph algorithms have recently been shown to be reasonably performant. In this paper we explore a followon question: can deterministic graph algorithms be expressed in guaranteeddeterministic parallel languages, which are necessarily restrictive in what concurrency idioms they employ? To find out, we implement several graph algorithms using the LVish library for Haskell (a deterministic language), which reveals its strengths as well as limitations. We surmount these limitations by (1) implementing a functional version of the deterministic reservations mechanism, and (2) adding a new mechanism to LVish called BulkRetry. We present results from an earlystage prototype. 1.
Performance Analysis of Single Source Shortest Path Algorithm over Multiple GPUs in a Network of Workstations using OpenCL and MPI
"... Graphics Processing Units (GPUs) are being heavily used in various graphics and nongraphics applications. Many practical problems in computing can be represented as graphs to arrive at a particular solution. These graphs contains very large number, up to millions pairs of vertices and edges. In thi ..."
Abstract
 Add to MetaCart
(Show Context)
Graphics Processing Units (GPUs) are being heavily used in various graphics and nongraphics applications. Many practical problems in computing can be represented as graphs to arrive at a particular solution. These graphs contains very large number, up to millions pairs of vertices and edges. In this paper, we present performance analysis of Dijkstra’s single source shortest path algorithm over multiple GPU devices in a single machine as well as over a network of workstations using OpenCL and MPI. Experimental results prove that parallel execution of Dijkstra’s algorithm has good performance when algorithm is run over multiGPU devices in a single workstation as opposed to multiGPU devices over a network of workstations. For our experimentation, we have used workstation having Intel Xeon 6core Processor; supporting hyperthreading and a total of 24 threads with NVIDIA Quadro FX 3800 GPU device. The two GPU devices are connected by SLI Bridge. Overall, on average we achieved performance improvement up to an order of 1015x.