Results 1  10
of
170
The Landscape of Parallel Computing Research: A View from Berkeley
 TECHNICAL REPORT, UC BERKELEY
, 2006
"... ..."
A class of parallel tiled linear algebra algorithms for multicore architectures
"... Abstract. As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a ..."
Abstract

Cited by 169 (58 self)
 Add to MetaCart
(Show Context)
Abstract. As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations. 1
Parallel tiled QR factorization for multicore architectures
, 2007
"... As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requ ..."
Abstract

Cited by 81 (41 self)
 Add to MetaCart
(Show Context)
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.
DAGuE: A generic distributed DAG engine for high performance computing
, 2010
"... The frenetic development of the current architectures places a strain on the current stateoftheart programming environments. Harnessing the full potential of such architectures has been a tremendous task for the whole scientific computing community. We present DAGuE a generic framework for archit ..."
Abstract

Cited by 67 (21 self)
 Add to MetaCart
(Show Context)
The frenetic development of the current architectures places a strain on the current stateoftheart programming environments. Harnessing the full potential of such architectures has been a tremendous task for the whole scientific computing community. We present DAGuE a generic framework for architecture aware scheduling and management of microtasks on distributed manycore heterogeneous architectures. Applications we consider can be represented as a Direct Acyclic Graph of tasks with labeled edges designating data dependencies. DAGs are represented in a compact, problemsize independent format that can be queried ondemand to discover data dependencies, in a totally distributed fashion. DAGuE assigns computation threads to the cores, overlaps communications and computations and uses a dynamic, fullydistributed scheduler based on cache awareness, datalocality and task priority. We demonstrate the efficiency of our approach, using several microbenchmarks to analyze the performance of different components of the framework, and a Linear Algebra factorization as a use case. I.
Performance Contracts: Predicting and Monitoring Grid Application Behavior
, 2001
"... ..."
(Show Context)
A proposal for a heterogeneous cluster ScaLAPACK (dense linear solvers)
, 2001
"... In this paper, we study the implementation of dense linear algebra kernels, such as matrix multiplication or linear system solvers, on heterogeneous networks of workstations. The uniform blockcyclic data distribution scheme commonly used for homogeneous collections of processors limits the perform ..."
Abstract

Cited by 59 (24 self)
 Add to MetaCart
(Show Context)
In this paper, we study the implementation of dense linear algebra kernels, such as matrix multiplication or linear system solvers, on heterogeneous networks of workstations. The uniform blockcyclic data distribution scheme commonly used for homogeneous collections of processors limits the performance of these linear algebra kernels on heterogeneous grids to the speed of the slowest processor. We present and study more sophisticated data allocation strategies that balance the load on heterogeneous platforms with respect to the performance of the processors. When targeting unidimensional grids, the loadbalancing problem can be solved rather easily. When targeting twodimensional grids, which are the key to scalability and efficiency for numerical kernels, the problem turns out to be surprisingly difficult. We formally state the 2D loadbalancing problem and prove its NPcompleteness. Next, we introduce a data allocation heuristic, which turns out to be very satisfactory: Its practical usefulness is demonstrated by MPI experiments conducted with a heterogeneous network of workstations.
Matrix Multiplication on Heterogeneous Platforms
, 2001
"... this paper, we address the issue of implementing matrix multiplication on heterogeneous platforms. We target two different classes of heterogeneous computing resources: heterogeneous networks of workstations and collections of heterogeneous clusters. Intuitively, the problem is to load balance the ..."
Abstract

Cited by 53 (15 self)
 Add to MetaCart
this paper, we address the issue of implementing matrix multiplication on heterogeneous platforms. We target two different classes of heterogeneous computing resources: heterogeneous networks of workstations and collections of heterogeneous clusters. Intuitively, the problem is to load balance the work with different speed resources while minimizing the communication volume. We formally state this problem in a geometric framework and prove its NPcompleteness. Next, we introduce a (polynomial) columnbased heuristic, which turns out to be very satisfactory: We derive a theoretical performance guarantee for the heuristic and we assess its practical usefulness through MPI experiments
Efficient Runtime Support for Irregular BlockStructured Applications
, 1998
"... Parallel implementations of scientific applications often rely on elaborate dynamic data structures with complicated communication patterns. We describe a set of intuitive geometric programming abstractions that simplify coordination of irregular blockstructured scientific calculations without sacr ..."
Abstract

Cited by 50 (18 self)
 Add to MetaCart
Parallel implementations of scientific applications often rely on elaborate dynamic data structures with complicated communication patterns. We describe a set of intuitive geometric programming abstractions that simplify coordination of irregular blockstructured scientific calculations without sacrificing performance. We have implemented these abstractions in KeLP, a C++ runtime library. KeLP's abstractions enable the programmer to express complicated communication patterns for dynamic applications, and to tune communication activity with a highlevel, abstract interface. We show that KeLP's flexible communication model effectively manages elaborate data motion patterns arising in structured adaptive mesh refinement, and achieves performance comparable to handcoded messagepassing on several structured numerical kernels. to appear in J. Parallel and Distributed Computing 1 Introduction Many scientific numerical methods employ structured irregular representations to improve accura...
Sparse Gaussian Elimination on High Performance Computers
, 1996
"... This dissertation presents new techniques for solving large sparse unsymmetric linear systems on high performance computers, using Gaussian elimination with partial pivoting. The efficiencies of the new algorithms are demonstrated for matrices from various fields and for a variety of high performan ..."
Abstract

Cited by 40 (7 self)
 Add to MetaCart
This dissertation presents new techniques for solving large sparse unsymmetric linear systems on high performance computers, using Gaussian elimination with partial pivoting. The efficiencies of the new algorithms are demonstrated for matrices from various fields and for a variety of high performance machines. In the first part we discuss optimizations of a sequential algorithm to exploit the memory hierarchies that exist in most RISCbased superscalar computers. We begin with the leftlooking supernodecolumn algorithm by Eisenstat, Gilbert and Liu, which includes Eisenstat and Liu's symmetric structural reduction for fast symbolic factorization. Our key contribution is to develop both numeric and symbolic schemes to perform supernodepanel updates to achieve better data reuse in cache and floatingpoint register...