Results 1  10
of
12
Hardware/software vectorization for closeness centrality on multi/manycore architectures
 In 28th International Parallel and Distributed Processing Symposium Workshops, Workshop on Multithreaded Architectures and Applications (MTAAP
, 2014
"... Abstract—Centrality metrics have shown to be highly correlated with the importance and loads of the nodes in a network. Given the scale of today’s social networks, it is essential to use efficient algorithms and high performance computing techniques for their fast computation. In this work, we expl ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Centrality metrics have shown to be highly correlated with the importance and loads of the nodes in a network. Given the scale of today’s social networks, it is essential to use efficient algorithms and high performance computing techniques for their fast computation. In this work, we exploit hardware and software vectorization in combination with finegrain parallelization to compute the closeness centrality values. The proposed vectorization approach enables us to do concurrent breadthfirst search operations and significantly increases the performance. We provide a comparison of different vectorization schemes and experimentally evaluate our contributions with respect to the existing parallel CPUbased solutions on cuttingedge hardware. Our implementations achieve to be 11 times faster than the stateoftheart implementation for a graph with 234 million edges. The proposed techniques are beneficial to show how the vectorization can be efficiently utilized to execute other graph kernels that require multiple traversals over a largescale network on cuttingedge architectures. KeywordsCentrality, closeness centrality, vectorization, breadthfirst search, Intel Xeon Phi. I.
SPARSE MATRIX MULTIPLICATION ON AN ASSOCIATIVE PROCESSOR
"... Abstract—Sparse matrix multiplication is an important component of linear algebra computations. Implementing sparse matrix multiplication on an associative processor (AP) enables high level of parallelism, where a row of one matrix is multiplied in parallel with the entire second matrix, and where t ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Sparse matrix multiplication is an important component of linear algebra computations. Implementing sparse matrix multiplication on an associative processor (AP) enables high level of parallelism, where a row of one matrix is multiplied in parallel with the entire second matrix, and where the AP execution time of vector dot product does not depend on the vector size. Four sparse matrix multiplication algorithms are explored in this paper, combining AP and CPU processing to various levels. They are evaluated by simulation on a large set of sparse matrices. The computational complexity of sparse matrix multiplication on AP is shown to be an O(M) where M is the number of nonzero elements. The AP is found to be especially efficient in binary sparse matrix multiplication. AP outperforms conventional solutions in power efficiency.
Partnership for Advanced Computing in Europe Porting FEASTFLOW to the Intel Xeon Phi: Lessons Learned
"... In this paper we report our experiences in porting the FEASTFLOW software infrastructure to the Intel Xeon Phi coprocessor. Our efforts involved both the evaluation of programming models including OpenCL, POSIX threads and OpenMP and typical optimization strategies like parallelization and vectoriza ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper we report our experiences in porting the FEASTFLOW software infrastructure to the Intel Xeon Phi coprocessor. Our efforts involved both the evaluation of programming models including OpenCL, POSIX threads and OpenMP and typical optimization strategies like parallelization and vectorization. Since the straightforward porting process of the already existing OpenCL version of the code encountered performance problems that require further analysis, we focused our efforts on the implementation and optimization of two core building block kernels for FEASTFLOW: an axpy vector operation and a sparse matrixvector multiplication (spmv). Our experimental results on these building blocks indicate the Xeon Phi can serve as a promising accelerator for our software infrastructure. 1.
Evaluating the capabilities of the Xeon Phi
"... (will be inserted by the editor) ..."
(Show Context)
Delivering Parallel Programmability to the Masses via the Intel MIC Ecosystem: A Case Study
"... Abstract—Moore’s Law effectively doubles the compute power of a microprocessor every 24 months. Over the past decade, however, this doubling in performance has been due to the doubling of the number of cores in a microprocessor rather than clock speed increases. Perhaps nowhere is this more evident ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—Moore’s Law effectively doubles the compute power of a microprocessor every 24 months. Over the past decade, however, this doubling in performance has been due to the doubling of the number of cores in a microprocessor rather than clock speed increases. Perhaps nowhere is this more evident than with the Intel Xeon Phi coprocessor. This manycore architecture exhibits not only massive intercore parallelism but also intracore parallelism via a wider SIMD width. However, for dataintensive applications, the bandwidth constraint of MIC hinders the full utilization of computational resources, especially when massive parallelism is required to process big data sets. Furthermore, the process of optimizing the performance on such platforms is complex and requires architectural expertise. To evaluate the efficacy of the Intel MIC ecosystem for “big data ” applications, we use the FloydWarshall algorithm as a representative case study for graph applications. Our study offers evidence that traditional compiler optimizations can deliver parallel programmability to the masses on the Intel Xeon Phi platform. That is, developers can straightforwardly create manycore codes in the Intel Xeon Phi ecosystem that deliver significant speedup. The optimizations include reordering dataaccess patterns, adjusting loop structures, vectorizing branches, and using OpenMP directives. We start from the default serial algorithm and apply the above optimizations one by one. Overall, we achieve a 281.7fold speedup over the default serial version. When compared with the default OpenMP FloydWarshall parallel implementation, we still achieve a 6.4fold speedup. We also observe that the identically optimized code on MIC can outperform its CPU counterpart by up to 3.2fold.
Incremental Closeness Centrality in Distributed Memory
, 2015
"... Networks are commonly used to model traffic patterns, social interactions, or web pages. The vertices in a network do not possess the same characteristics: some vertices are naturally more connected and some vertices can be more important. Closeness centrality (CC) is a global metric that quantifies ..."
Abstract
 Add to MetaCart
(Show Context)
Networks are commonly used to model traffic patterns, social interactions, or web pages. The vertices in a network do not possess the same characteristics: some vertices are naturally more connected and some vertices can be more important. Closeness centrality (CC) is a global metric that quantifies how important is a given vertex in the network. When the network is dynamic and keeps changing, the relative importance of the vertices also changes. The best known algorithm to compute the CC scores makes it impractical to recompute them from scratch after each modification. In this paper, we propose Streamer, a distributed memory framework for incrementally maintaining the closeness centrality scores of a network upon changes. It leverages pipelined, replicated parallelism, and SpMMbased BFSs, and it takes NUMA effects into account. It makes maintaining the Closeness Centrality values of reallife networks with millions of interactions significantly faster and obtains almost linear speedups on a 64 nodes 8 threads/node cluster.
Regularizing Graph Centrality Computations
, 2014
"... Centrality metrics such as betweenness and closeness have been used to identify important nodes in a network. However, it takes days to months on a highend workstation to compute the centrality of today’s networks. The main reasons are the size and the irregular structure of these networks. While ..."
Abstract
 Add to MetaCart
(Show Context)
Centrality metrics such as betweenness and closeness have been used to identify important nodes in a network. However, it takes days to months on a highend workstation to compute the centrality of today’s networks. The main reasons are the size and the irregular structure of these networks. While today’s computing units excel at processing dense and regular data, their performance is questionable when the data is sparse. In this work, we show how centrality computations can be regularized to reach higher performance. For betweenness centrality, we deviate from the traditional finegrain approach by allowing a GPU to execute multiple BFSs at the same time. Furthermore, we exploit hardware and software vectorization to compute closeness centrality values on CPUs, GPUs and Intel Xeon Phi. Experiments show that only by reengineering the algorithms and without using additional hardware, the proposed techniques can speed up the centrality computations significantly: an improvement of a factor 5.9 on CPU architectures, 70.4 on GPU architectures and 21.0 on Intel Xeon Phi.
A UNIFIED SPARSE MATRIX DATA FORMAT FOR EFFICIENT GENERAL SPARSE MATRIXVECTOR MULTIPLY ON MODERN PROCESSORS WITH WIDE SIMD UNITS
"... Abstract. Sparse matrixvector multiplication (spMVM) is the most timeconsuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardwarespecific, which could ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. Sparse matrixvector multiplication (spMVM) is the most timeconsuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardwarespecific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi and manycore processors should be used most efficiently if there is no structure in the sparsity pattern of the matrix. We suggest SELLCσ, a variant of Sliced ELLPACK, as a SIMDfriendly data format which combines longstanding ideas from General Purpose Graphics Processing Units (GPGPUs) and vector computer programming. We discuss the advantages of SELLCσ compared to established formats like Compressed Row Storage (CRS) and ELLPACK and show its suitability on a variety of hardware platforms (Intel Sandy Bridge, Intel Xeon Phi and Nvidia Tesla K20) for a wide range of test matrices from different application areas. Using appropriate performance models we develop deep insight into the data transfer properties of the SELLCσ spMVM kernel. SELLCσ comes with two tuning parameters whose performance impact across the range of test matrices is studied and for which reasonable choices are proposed. This leads to a hardwareindependent (“catchall”) sparse matrix format, which achieves very high efficiency for all test matrices across all hardware platforms.
COMPRESSED MULTIROW STORAGE FORMAT FOR SPARSE MATRICES ON GRAPHICS PROCESSING UNITS
"... ar ..."
(Show Context)
SOFTWARE Open Access
"... Heterogeneous computing architecture for fast detection of SNPSNP interactions ..."
Abstract
 Add to MetaCart
(Show Context)
Heterogeneous computing architecture for fast detection of SNPSNP interactions