Results 11  20
of
53
MatrixProduct on Heterogeneous MasterWorker Platforms
"... This paper is focused on designing efficient parallel matrixproduct algorithms for heterogeneous masterworker platforms. While matrixproduct is wellunderstood for homogeneous 2Darrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses th ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
(Show Context)
This paper is focused on designing efficient parallel matrixproduct algorithms for heterogeneous masterworker platforms. While matrixproduct is wellunderstood for homogeneous 2Darrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that render our work original and innovative: Centralized data. We assume that all matrix files originate from, and must be returned to, the master. The master distributes data and computations to the workers (while in ScaLAPACK, input and output matrices are supposed to be equally distributed among participating resources beforehand). Typically, our approach is useful in the context of speeding up MATLAB or SCILAB clients running on a server (which acts as the master and initial repository of files). Heterogeneous starshaped platforms. We target fully heterogeneous platforms, where computational resources have different computing powers. Also, the workers are connected to the master by links of different capacities. This framework is realistic when deploying the application from the server, which is responsible for enrolling authorized resources. Limited memory. As we investigate the parallelization of large problems, we cannot assume that full matrix column blocks can be stored in the worker memories and be reused for subsequent updates (as in ScaLAPACK). We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages), and we report a set of numerical experiments on a platform at our site. The experiments show that our matrixproduct algorithm has smaller execution times than existing ones, while it also uses fewer resources.
A GeneralPurpose Model for Heterogeneous Computation
, 2000
"... Heterogeneous computing environments are becoming an increasingly popular platform for executing parallel applications. Such environments consist of a diverse set of machines and offer considerably more computational power at a lower cost than a parallel computer. Efficient heterogeneous parallel ap ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
Heterogeneous computing environments are becoming an increasingly popular platform for executing parallel applications. Such environments consist of a diverse set of machines and offer considerably more computational power at a lower cost than a parallel computer. Efficient heterogeneous parallel applications must account for the differences inherent in such an environment. For example, faster machines should possess more data items than their slower counterparts and communication should be minimized over slow network links. Current parallel applications are not designed with such heterogeneity in mind. Thus, a new approach is necessary for designing efficient heterogeneous parallel programs. We propose
Distributed Data Partitioning for Heterogeneous Processors Based on Partial Estimation of Their Functional Performance Models
"... Abstract. The paper presents a new data partitioning algorithm for parallel computing on heterogeneous processors. Like traditional functional partitioning algorithms, the algorithm assumes that the speed of the processors is characterized by speed functions rather than speed constants. Unlike the ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
Abstract. The paper presents a new data partitioning algorithm for parallel computing on heterogeneous processors. Like traditional functional partitioning algorithms, the algorithm assumes that the speed of the processors is characterized by speed functions rather than speed constants. Unlike the traditional algorithms, it does not assume the speed functions to be given. Instead, it uses a computational kernel to estimate the speed functions of the processors for different problem sizes during its execution. This makes the algorithm distributed as its execution involves all the heterogeneous processors. The algorithm does not construct the complete speed function for each processor but rather builds and uses their partial estimates sufficient for optimal data distribution with a given accuracy. The low execution cost of this algorithm makes it ideal for employment in selfadaptable applications. Experiments with a parallel matrix multiplication application employing this algorithm are performed on a local heterogeneous computational cluster. The results show that the algorithm converges very fast and that its execution time is several orders of magnitude less than the total execution time of the application.
Mapping and LoadBalancing Iterative Computations
 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 2004
"... This paper is devoted to mapping iterative algorithms onto heterogeneous clusters. The application data is partitioned over the processors, which are arranged along a virtual ring. At each iteration, independent calculations are carried out in parallel, and some communications take place between c ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
This paper is devoted to mapping iterative algorithms onto heterogeneous clusters. The application data is partitioned over the processors, which are arranged along a virtual ring. At each iteration, independent calculations are carried out in parallel, and some communications take place between consecutive processors in the ring. The question is to determine how to slice the application data into chunks, and to assign these chunks to the processors, so that the total execution time is minimized. One major difficulty is to embed a processor ring into a network that typically is not fully connected, so that some communication links have to be shared by several processor pairs. We establish a complexity result that assesses the difficulty of this problem, and we design a practical heuristic that provides efficient mapping, routing, linksharing, and data distribution schemes.
Scientific Programming for Heterogeneous Systems—Bridging the Gap between Algorithms and Applications
 Proceedings of the 5th International Symposium on Parallel Computing in Electrical Engineering (PARELEC 2006
, 2006
"... High performance computing in heterogeneous environments is a dynamically developing area. A number of highly efficient heterogeneous parallel algorithms have been designed over last decade. At the same time, scientific software based on the algorithms is very much under par. The paper analyses main ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
(Show Context)
High performance computing in heterogeneous environments is a dynamically developing area. A number of highly efficient heterogeneous parallel algorithms have been designed over last decade. At the same time, scientific software based on the algorithms is very much under par. The paper analyses main issues encountered by scientific programmers during implementation of heterogeneous parallel algorithms in a portable form. It explains how programming systems can address the issues in order to maximally facilitate implementation of parallel algorithms for heterogeneous platforms and outlines two existing programming systems for high performance heterogeneous computing, mpC and HeteroMPI.
Static LoadBalancing Techniques For Iterative Computations On Heterogeneous Clusters
, 2003
"... This paper is devoted to static load balancing techniques for mapping iterative algorithms onto heterogeneous clusters. The application data is partitioned over the processors. At each iteration, independent calculations are carried out in parallel, and some communications take place. The questio ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
This paper is devoted to static load balancing techniques for mapping iterative algorithms onto heterogeneous clusters. The application data is partitioned over the processors. At each iteration, independent calculations are carried out in parallel, and some communications take place. The question
Experimental Study of Six Different Implementations of Parallel
 Matrix Multiplication on Heterogeneous Computational Clusters of Multicore Processors” in Proceedings of Parallel, Distributed and NetworkBased Processing (PDP
, 2010
"... Abstract—Two strategies of distribution of computations can be used to implement parallel solvers for dense linear algebra problems for Heterogeneous Computational Clusters of Multicore Processors (HCoMs). These strategies are called Heterogeneous Process Distribution Strategy (HPS) and Heterogeneo ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract—Two strategies of distribution of computations can be used to implement parallel solvers for dense linear algebra problems for Heterogeneous Computational Clusters of Multicore Processors (HCoMs). These strategies are called Heterogeneous Process Distribution Strategy (HPS) and Heterogeneous Data Distribution Strategy (HDS). They are not novel and have been researched thoroughly. However, the advent of multicores necessitates enhancements to them. In this paper, we present these enhancements. Our study is based on experiments using six applications to perform Parallel Matrixmatrix Multiplication (PMM) on an HCoM employing the two distribution strategies. Keywords Heterogeneous ScaLAPACK; HeteroMPI; multicore clusters; matrixmatrix multiplication; heterogenous clusters I.
Adaptive approaches for efficient parallel algorithms on clusterbased systems, in "International
 n o 2, InderScience Pub
"... Abstract: Few years ago, there was a huge development of new parallel and distributed systems. Due to many reasons, such as the inherent heterogeneity, the diversity, and the continuous evolution of such computational supports, it is very hard to solve efficiently a target problem by using a single ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract: Few years ago, there was a huge development of new parallel and distributed systems. Due to many reasons, such as the inherent heterogeneity, the diversity, and the continuous evolution of such computational supports, it is very hard to solve efficiently a target problem by using a single algorithm or to write portable programs that perform well on any architecture. Toward this goal, we propose a generic framework combining communication models and adaptive approaches to deal with the performance modeling problem associated to the design of efficient parallel algorithms on grid computing environments, and we apply this methodology on collective communication operations. Experiments performed on a grid platform prove that the framework provides significant performances while determining the best combination modelalgorithm depending on the problem and architecture parameters.
HeteroMPI+ScaLAPACK: Towards a ScaLAPACK (Dense Linear Solvers) on Heterogeneous Networks of Computers
 Proceedings of the 13th IEEE International Conference on High Performance Computing (HiPC 2006), Bangalore, India, LNCS Volume 4297
, 2006
"... Abstract. The paper presents a tool that ports ScaLAPACK programs designed to run on massively parallel processors to Heterogeneous Networks of Computers. The tool converts ScaLAPACK programs to HeteroMPI programs. The resulting HeteroMPI programs do not aim to extract the maximum performance from a ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Abstract. The paper presents a tool that ports ScaLAPACK programs designed to run on massively parallel processors to Heterogeneous Networks of Computers. The tool converts ScaLAPACK programs to HeteroMPI programs. The resulting HeteroMPI programs do not aim to extract the maximum performance from a Heterogeneous Networks of Computers but provide an easy and simple way to execute the ScaLAPACK programs on such networks with good performance improvements. We demonstrate the efficiency of the resulting HeteroMPI programs by performing experiments with a matrix multiplication application on a local network of heterogeneous computers. 1
Revisiting matrix product on masterworker platforms
, 2006
"... This paper is aimed at designing efficient parallel matrixproduct algorithms for heterogeneous masterworker platforms. While matrixproduct is wellunderstood for homogeneous 2Darrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
This paper is aimed at designing efficient parallel matrixproduct algorithms for heterogeneous masterworker platforms. While matrixproduct is wellunderstood for homogeneous 2Darrays of processors (e.g., Cannon algorithm and ScaLAPACK outer product algorithm), there are three key hypotheses that render our work original and innovative: Centralized data. We assume that all matrix files originate from, and must be returned to, the master. The master distributes both data and computations to the workers (while in ScaLAPACK, input and output matrices are initially distributed among participating resources). Typically, our approach is useful in the context of speeding up MATLAB or SCILAB clients running on a server (which acts as the master and initial repository of files). Heterogeneous starshaped platforms. We target fully heterogeneous platforms, where computational resources have different computing powers. Also, the workers are connected to the master by links of different capacities. This framework is realistic when deploying the application from the server, which is responsible for enrolling authorized resources. Limited memory. Because we investigate the parallelization of large problems, we cannot assume that full matrix panels can be stored in the worker memories and reused for subsequent updates (as in ScaLAPACK). The amount of memory available in each worker is expressed as a given number mi of buffers, where a buffer can store a square block of matrix elements. The size q of these square blocks is chosen so as to harness the power of Level 3 BLAS routines: q = 80 or 100 on most platforms. We have devised efficient algorithms for resource selection (deciding which workers to enroll) and communication ordering (both for input and result messages), and we report a set of numerical experiments on various platforms at École Normale Supérieure de Lyon and the University of Tennessee. However, we point out that in this first version of the report, experiments are limited to homogeneous platforms.