Results 1  10
of
10
Tile QR Factorization with Parallel Panel Processing for Multicore Architectures
, 2009
"... To exploit the potential of multicore architectures, recent dense linear algebra libraries have used tile algorithms, which consist in scheduling a Directed Acyclic Graph (DAG) of tasks of fine granularity where nodes represent tasks, either panel factorization or update of a blockcolumn, and edges ..."
Abstract

Cited by 16 (9 self)
 Add to MetaCart
(Show Context)
To exploit the potential of multicore architectures, recent dense linear algebra libraries have used tile algorithms, which consist in scheduling a Directed Acyclic Graph (DAG) of tasks of fine granularity where nodes represent tasks, either panel factorization or update of a blockcolumn, and edges represent dependencies among them. Although past approaches already achieve high performance on moderate and large square matrices, their way of processing a panel in sequence leads to limited performance when factorizing tall and skinny matrices or small square matrices. We present a new fully asynchronous method for computing a QR factorization on sharedmemory multicore architectures that overcomes this bottleneck. Our contribution is to adapt an existing algorithm that performs a panel factorization in parallel (named CommunicationAvoiding QR and initially designed for distributedmemory machines), to the context of tile algorithms using asynchronous computations. An experimental study shows significant improvement (up to almost 10 times faster) compared to stateoftheart approaches. We aim to eventually incorporate this work into the Parallel Linear Algebra for Scalable Multicore Architectures (PLASMA) library. 1
Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures
 In Proceedings of the 9th international conference on High
, 2011
"... ar ..."
(Show Context)
Enhancing Parallelism of Tile QR Factorization for Multicore Architectures 1
"... To exploit the potential of multicore architectures, recent dense linear algebra libraries have used tile algorithms, which consist of scheduling a Directed Acyclic Graph (DAG) of fine granularity tasks where nodes represent tasks, either panel factorization or update of a blockcolumn, and edges re ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
To exploit the potential of multicore architectures, recent dense linear algebra libraries have used tile algorithms, which consist of scheduling a Directed Acyclic Graph (DAG) of fine granularity tasks where nodes represent tasks, either panel factorization or update of a blockcolumn, and edges represent dependencies among them. Although past approaches already achieve high performance on moderate and large square matrices, their way of processing a panel in sequence leads to limited performance when factorizing tall and skinny matrices or small square matrices. We present a new, fully asynchronous method for computing a QR factorization on sharedmemory multicore architectures that overcomes this bottleneck. Our contribution is to adapt an existing algorithm that performs a panel factorization in parallel (named CommunicationAvoiding QR and initially designed for distributedmemory machines) to the context of tile algorithms using asynchronous computations. An experimental study shows significant improvement (up to almost 10 times faster) compared to stateoftheart approaches. We aim to eventually incorporate this work into the Parallel Linear Algebra for Scalable Multicore Architectures (PLASMA) library. I. Introduction and Motivations QR factorization is one of the major onesided factorizations in dense linear algebra.
Virtual Systolic Array for QR Decomposition
"... Abstract—Systolic arrays offer a very attractive, datacentric, execution model as an alternative to the von Neumann architecture. Hardware implementations of systolic arrays turned out not to be viable solutions in the past. This article shows how the systolic design principles can be applied to a s ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Systolic arrays offer a very attractive, datacentric, execution model as an alternative to the von Neumann architecture. Hardware implementations of systolic arrays turned out not to be viable solutions in the past. This article shows how the systolic design principles can be applied to a software solution to deliver an algorithm with unprecedented strong scaling capabilities. Systolic array for the QR decomposition is developed and a virtualization layer is used for mapping of the algorithm to a large distributed memory system. Strong scaling properties are discovered, superior to existing solutions. Keywordssystolic array; QR decomposition; multicore; message passing; dataflow programming; roofline model; I.
LU Factorization with Partial Pivoting for a MultiCPU, MultiGPU Shared Memory System – LAPACK Working Note 266
"... LU factorization with partial pivoting is a canonical numerical procedure and the main component of the High Performance LINPACK benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. Performance in excess ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
LU factorization with partial pivoting is a canonical numerical procedure and the main component of the High Performance LINPACK benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs. 1
Tall and Skinny QR Matrix Factorization Using Tile Algorithms on Multicore Architectures
"... Abstract. To exploit the potential of multicore architectures, recent dense linear algebra libraries have used tile algorithms, which consist in scheduling a Directed Acyclic Graph (DAG) of tasks of fine granularity where nodes represent tasks, either panel factorization or update of a blockcolumn, ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Abstract. To exploit the potential of multicore architectures, recent dense linear algebra libraries have used tile algorithms, which consist in scheduling a Directed Acyclic Graph (DAG) of tasks of fine granularity where nodes represent tasks, either panel factorization or update of a blockcolumn, and edges represent dependencies among them. Although past approaches already achieve high performance on moderate and large square matrices, their way of processing a panel in sequence leads to limited performance when factorizing tall and skinny matrices or small square matrices. We present a fully asynchronous method for computing a QR factorization on sharedmemory multicore architectures that overcomes this bottleneck. Our contribution is to adapt an existing algorithm that performs a panel factorization in parallel (named CommunicationAvoiding QR and initially designed for distributedmemory machines), to the context of tile algorithms using asynchronous computations. An experimental study shows significant improvement (up to almost 10 times faster) compared to stateoftheart approaches. We aim to eventually incorporate this work into the Parallel Linear Algebra for Scalable Multicore Architectures (PLASMA) library. 1
Lightweight Superscalar Task Execution in Distributed Memory
"... Arguably, we have yet to find a solution to the burden of multicore distributed programming facing domain scientists. This burden has been exacerbated by the increasing size of multicores, increasing the effect of any excess synchronization. To deal with these difficulties, numerical algorithms are ..."
Abstract
 Add to MetaCart
(Show Context)
Arguably, we have yet to find a solution to the burden of multicore distributed programming facing domain scientists. This burden has been exacerbated by the increasing size of multicores, increasing the effect of any excess synchronization. To deal with these difficulties, numerical algorithms are reengineered as sequences of interdependent tilebased tasks which can be executed by a dynamic runtime environment. We present a new runtime environment for distributed architectures which uses superscalar scheduling concepts. Tasks are inserted serially, and the runtime determines the dependencies dynamically and manages data movement transparently. QUARKD (QUeuing and Runtime for Kernels on Distributed Memory) is shown to scale to O(1000) cores for linear algebra algorithms and have competitive performance. The primary message of this research is that scalable and competitive performance can be achieved by a distributedmemory execution system using superscalar scheduling ideas where serial code is the input and parallel execution correctness is guaranteed. 1.
TILED ALGORITHMS FOR MATRIX COMPUTATIONS ON MULTICORE ARCHITECTURES
, 2012
"... ar ..."
(Show Context)
COMPUTING AND SOFTWARE
"... defence committee, and university, make no claim as to the fitness for any purpose, and accept no direct or indirect liability for the use of algorithms, findings, or recommendations in this thesis. 11 The multicore revolution in chip design has fundamentally altered the demands placed on developers ..."
Abstract
 Add to MetaCart
defence committee, and university, make no claim as to the fitness for any purpose, and accept no direct or indirect liability for the use of algorithms, findings, or recommendations in this thesis. 11 The multicore revolution in chip design has fundamentally altered the demands placed on developers. Threadlevel parallelism is critical to optimizing software performance on multicore chips. However threadlevel parallelism presents challenges with respect to optimization, safety and program representation. Program models and compiler technologies must act as a bridge from applications to efficient hardware usage. Coconut (COde CONstructing User Tool) is an ongoing project at McMaster to develop a platform for experimenting with novel ideas in reliable and high performance code generation, currently targeting the Cell/B.E.. The Coconut Multicore Framework uses a virtual machine abstraction layer to model
HLU Factorization on ManyCore Systems
, 2014
"... A version of the HLU factorization is introduced, based on the individual computational tasks occurring during the blockwise HLU factorization. The dependencies between these tasks form a directed acylic graph, which is used for efficient scheduling on parallel systems. The algorithm is especial ..."
Abstract
 Add to MetaCart
(Show Context)
A version of the HLU factorization is introduced, based on the individual computational tasks occurring during the blockwise HLU factorization. The dependencies between these tasks form a directed acylic graph, which is used for efficient scheduling on parallel systems. The algorithm is especially suited for manycore processors and shows a much improved parallel scaling behavior compared to previous HLU factorization algorithms.