Results 1  10
of
83
A class of parallel tiled linear algebra algorithms for multicore architectures
"... Abstract. As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a ..."
Abstract

Cited by 171 (60 self)
 Add to MetaCart
(Show Context)
Abstract. As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations. 1
DAGuE: A generic distributed DAG engine for high performance computing
, 2010
"... The frenetic development of the current architectures places a strain on the current stateoftheart programming environments. Harnessing the full potential of such architectures has been a tremendous task for the whole scientific computing community. We present DAGuE a generic framework for archit ..."
Abstract

Cited by 70 (24 self)
 Add to MetaCart
(Show Context)
The frenetic development of the current architectures places a strain on the current stateoftheart programming environments. Harnessing the full potential of such architectures has been a tremendous task for the whole scientific computing community. We present DAGuE a generic framework for architecture aware scheduling and management of microtasks on distributed manycore heterogeneous architectures. Applications we consider can be represented as a Direct Acyclic Graph of tasks with labeled edges designating data dependencies. DAGs are represented in a compact, problemsize independent format that can be queried ondemand to discover data dependencies, in a totally distributed fashion. DAGuE assigns computation threads to the cores, overlaps communications and computations and uses a dynamic, fullydistributed scheduler based on cache awareness, datalocality and task priority. We demonstrate the efficiency of our approach, using several microbenchmarks to analyze the performance of different components of the framework, and a Linear Algebra factorization as a use case. I.
Programming matrix algorithmsbyblocks for threadlevel parallelism
 ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE
"... With the emergence of threadlevel parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution ..."
Abstract

Cited by 46 (18 self)
 Add to MetaCart
(Show Context)
With the emergence of threadlevel parallelism as the primary means for continued performance improvement, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions. We propose a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain. The first abstraction, FLASH, allows algorithms to express computation with matrices consisting of contiguous blocks, facilitating algorithmsbyblocks. Operand descriptions are registered for a particular operation a priori by the library implementor. A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads outoforder and executed in parallel. But not all classical algorithms in linear algebra lend themselves to conversion to algorithmsbyblocks. We show how our recently proposed LU factorization with incremental pivoting and a closely related algorithmbyblocks for the QR factorization, both originally designed for outofcore computation, overcome this difficulty. Anecdotal evidence regarding the development of routines with a core functionality demonstrates how the methodology supports high productivity while experimental results suggest
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
"... Abstract—One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid acceleratorsbased node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we pre ..."
Abstract

Cited by 27 (13 self)
 Add to MetaCart
(Show Context)
Abstract—One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid acceleratorsbased node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly efficient QR factorization for such a node. Our method is in three steps. The first step consists of expressing the QR factorization as a sequence of tasks of well chosen granularity that will aim at being executed on a CPU core or a GPU. We show that we can efficiently adapt highlevel algorithms from the literature that were initially designed for homogeneous multicore architectures. The second step consists of designing the kernels that implement each individual task. We use CPU kernels from previous work and present new kernels for GPUs that complement kernels already
MultiThreading and OneSided Communication in Parallel LU Factorization
"... Dense LU factorization has a high ratio of computation to communication and, as evidenced by the High Performance Linpack (HPL) benchmark, this property makes it scale well on most parallel machines. Nevertheless, the standard algorithm for this problem has nontrivial dependence patterns which limi ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
(Show Context)
Dense LU factorization has a high ratio of computation to communication and, as evidenced by the High Performance Linpack (HPL) benchmark, this property makes it scale well on most parallel machines. Nevertheless, the standard algorithm for this problem has nontrivial dependence patterns which limit parallelism, and local computations require large matrices in order to achieve good single processor performance. We present an alternative programming model for this type of problem, which combines UPC's global address space with lightweight multithreading. We introduce the concept of memoryconstrained lookahead where the amount of concurrency managed by each processor is controlled by the amount of memory available. We implement novel techniques for steering the computation to optimize for high performance and demonstrate the scalability and portability of UPC with Teraflop level performance on some machines, comparing favourably to other stateoftheart MPI codes.
Tile QR Factorization with Parallel Panel Processing for Multicore Architectures
, 2009
"... To exploit the potential of multicore architectures, recent dense linear algebra libraries have used tile algorithms, which consist in scheduling a Directed Acyclic Graph (DAG) of tasks of fine granularity where nodes represent tasks, either panel factorization or update of a blockcolumn, and edges ..."
Abstract

Cited by 16 (9 self)
 Add to MetaCart
(Show Context)
To exploit the potential of multicore architectures, recent dense linear algebra libraries have used tile algorithms, which consist in scheduling a Directed Acyclic Graph (DAG) of tasks of fine granularity where nodes represent tasks, either panel factorization or update of a blockcolumn, and edges represent dependencies among them. Although past approaches already achieve high performance on moderate and large square matrices, their way of processing a panel in sequence leads to limited performance when factorizing tall and skinny matrices or small square matrices. We present a new fully asynchronous method for computing a QR factorization on sharedmemory multicore architectures that overcomes this bottleneck. Our contribution is to adapt an existing algorithm that performs a panel factorization in parallel (named CommunicationAvoiding QR and initially designed for distributedmemory machines), to the context of tile algorithms using asynchronous computations. An experimental study shows significant improvement (up to almost 10 times faster) compared to stateoftheart approaches. We aim to eventually incorporate this work into the Parallel Linear Algebra for Scalable Multicore Architectures (PLASMA) library. 1
Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA
 In Accepted at the 12th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC11
, 2011
"... Abstract—We present a method for developing dense linear algebra algorithms that seamlessly scales to thousands of cores. It can be done with our project called DPLASMA (Distributed PLASMA) that uses a novel generic distributed Direct Acyclic Graph Engine (DAGuE). The engine has been designed for hi ..."
Abstract

Cited by 14 (10 self)
 Add to MetaCart
(Show Context)
Abstract—We present a method for developing dense linear algebra algorithms that seamlessly scales to thousands of cores. It can be done with our project called DPLASMA (Distributed PLASMA) that uses a novel generic distributed Direct Acyclic Graph Engine (DAGuE). The engine has been designed for high performance computing and thus it enables scaling of tile algorithms, originating in PLASMA, on large distributed memory systems. The underlying DAGuE framework has many appealing features when considering distributedmemory platforms with heterogeneous multicore nodes: DAG representation that is independent of the problemsize, automatic extraction of the communication from the dependencies, overlapping of communication and computation, task prioritization, and architectureaware scheduling and management of tasks. The originality of this engine lies in its capacity to translate a sequential code with nestedloops into a concise and synthetic format which can then be interpreted and executed in a distributed environment. We present three common dense linear algebra algorithms from PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures), namely: Cholesky, LU, and QR factorizations, to investigate their data driven expression and execution in a distributed system. We demonstrate through experimental results on the Cray XT5 Kraken system that our DAGbased approach has the potential to achieve sizable fraction of peak performance which is characteristic of the stateoftheart distributed numerical software on current and emerging architectures. KeywordsNumerical linear systems, scalable parallel algorithms, scheduling and task partitioning I.
Parallelizing Dense and Banded Linear Algebra Libraries using SMPSs
, 2008
"... The promise of future manycore processors, with hundreds of threads running concurrently, has lead the developers of linear algebra libraries to rethink their design in order to extract more parallelism, further exploit data locality, attain a better load balance, and pay careful attention to the c ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
(Show Context)
The promise of future manycore processors, with hundreds of threads running concurrently, has lead the developers of linear algebra libraries to rethink their design in order to extract more parallelism, further exploit data locality, attain a better load balance, and pay careful attention to the critical path of computation. In this paper we describe how existing serial libraries like (C)LAPACK and FLAME can be easily parallelized using the SMPSs tools, consisting of a few OpenMPlike pragmas and a runtime system. In the LAPACK case, this usually requires the development of blocked algorithms for simple BLASlevel operations, which expose concurrency at a finer grain. For better performance, our experimental results indicate that columnmajor order, as employed by this library, needs to be abandoned in benefit of a block data layout. This will require a deeper rewrite of LAPACK or, alternatively, a dynamic conversion of the storage pattern at runtime. The parallelization of FLAME routines using SMPSs is quite simple as this library includes blocked algorithms (or algorithmsbyblocks in the FLAME argot) for most operations and storagebyblocks (or block data layout) is already in place.
Reducing the Amount of Pivoting in Symmetric Indefinite Systems
"... Abstract. This paper illustrates how the communication due to pivoting in the solution of symmetric indefinite linear systems can be reduced by considering innovative approaches that are different from pivoting strategies implemented in current linear algebra libraries. First a tiled algorithm where ..."
Abstract

Cited by 13 (7 self)
 Add to MetaCart
(Show Context)
Abstract. This paper illustrates how the communication due to pivoting in the solution of symmetric indefinite linear systems can be reduced by considering innovative approaches that are different from pivoting strategies implemented in current linear algebra libraries. First a tiled algorithm where pivoting is performed within a tile is described and then an alternative to pivoting is proposed. The latter considers a symmetric randomization of the original matrix using the socalled recursive butterfly matrices. In numerical experiments, the accuracy of tilewise pivoting and of the randomization approach is compared with the accuracy of the BunchKaufman algorithm.