Results 1  10
of
136
Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures
 Journal of Parallel and Distributed Computing
, 1993
"... This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. These primitives (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of offprocessor data, (3) ..."
Abstract

Cited by 146 (17 self)
 Add to MetaCart
(Show Context)
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. These primitives (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of offprocessor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM). 1 Introduction Over the past few years we have developed a methodology to produce efficient distributed memory code for sparse and unstructured problems in which array accesses are made through a level of indirection. In such problems the dependency structure is determined by variable values known only at runtime. In...
Scalable Load Balancing Techniques for Parallel Computers
, 1994
"... In this paper we analyze the scalability of a number of load balancing algorithms which can be applied to problems that have the following characteristics : the work done by a processor can be partitioned into independent work pieces; the work pieces are of highly variable sizes; and it is not po ..."
Abstract

Cited by 120 (16 self)
 Add to MetaCart
In this paper we analyze the scalability of a number of load balancing algorithms which can be applied to problems that have the following characteristics : the work done by a processor can be partitioned into independent work pieces; the work pieces are of highly variable sizes; and it is not possible (or very difficult) to estimate the size of total work at a given processor. Such problems require a load balancing scheme that distributes the work dynamically among different processors. Our goal here is to determine the most scalable load balancing schemes for different architectures such as hypercube, mesh and network of workstations. For each of these architectures, we establish lower bounds on the scalability of any possible load balancing scheme. We present the scalability analysis of a number of load balancing schemes that have not been analyzed before. This gives us valuable insights into their relative performance for different problem and architectural characteristi...
A Massively Parallel Adaptive Finite Element Method with Dynamic Load Balancing
 Appl. Numer. Math
, 1993
"... We construct massively parallel adaptive finite element methods for the solution of hyperbolic conservation laws. Spatial discretization is performed by a discontinuous Galerkin finite element method using a basis of piecewise Legendre polynomials. Temporal discretization utilizes a RungeKutta meth ..."
Abstract

Cited by 109 (14 self)
 Add to MetaCart
We construct massively parallel adaptive finite element methods for the solution of hyperbolic conservation laws. Spatial discretization is performed by a discontinuous Galerkin finite element method using a basis of piecewise Legendre polynomials. Temporal discretization utilizes a RungeKutta method. Dissipative fluxes and projection limiting prevent oscillations near solution discontinuities. The resulting method is of high order and may be parallelized efficiently on MIMD computers. We demonstrate parallel efficiency through computations on a 1024processor nCUBE/2 hypercube. We present results using adaptiverefinement to reduce the computational cost of the method, and tiling, a dynamic, elementbased data migration system that maintains global load balance of the adaptive method by overlapping neighborhoods of processors that each perform local balancing. 1. Introduction We are studying massively parallel adaptive finite element methods for solving systems ofdimensional hyper...
Job Characteristics of a Production Parallel Scientific Workload on the NASA Ames iPSC/860
, 1995
"... . Statistics of a parallel workload on a 128node iPSC/860 located at NASA Ames are presented. It is shown that while the number of sequential jobs dominates the number of parallel jobs, most of the resources (measured in nodeseconds) were consumed by parallel jobs. Moreover, most of the sequen ..."
Abstract

Cited by 104 (28 self)
 Add to MetaCart
(Show Context)
. Statistics of a parallel workload on a 128node iPSC/860 located at NASA Ames are presented. It is shown that while the number of sequential jobs dominates the number of parallel jobs, most of the resources (measured in nodeseconds) were consumed by parallel jobs. Moreover, most of the sequential jobs were for system administration. The average runtime of jobs grew with the number of nodes used, so the total resource requirements of large parallel jobs were larger by more than the number of nodes they used. The job submission rate during peak day activity was somewhat lower than one every two minutes, and the average job size was small. At night, submission rate was low but job sizes and system utilization were high, mainly due to NQS. Submission rate and utilization over the weekend were lower than on weekdays. The overall utilization was 50%, after accounting for downtime. About 2/3 of the applications were executed repeatedly, some for a significant number of times....
Analyzing Scalability of Parallel Algorithms and Architectures
 Journal of Parallel and Distributed Computing
, 1994
"... The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithmarchitecture combination for a problem under different constraints on the growth of ..."
Abstract

Cited by 98 (19 self)
 Add to MetaCart
(Show Context)
The scalability of a parallel algorithm on a parallel architecture is a measure of its capacity to effectively utilize an increasing number of processors. Scalability analysis may be used to select the best algorithmarchitecture combination for a problem under different constraints on the growth of the problem size and the number of processors. It may be used to predict the performance of a parallel algorithm and a parallel architecture for a large number of processors from the known performance on fewer processors. For a fixed problem size, it may be used to determine the optimal number of processors to be used and the maximum possible speedup that can be obtained. The objective of this paper is to critically assess the state of the art in the theory of scalability analysis, and motivate further research on the development of new and more comprehensive analytical tools to study the scalability of parallel algorithms and architectures. We survey a number of techniques and formalisms t...
A microprocessorbased hypercube supercomputer
 Department of Electrical Engineering, University of Michigan, Ann Arbor, MI
, 1962
"... The architecture and applications of the class of highly parallel distributedmemory multiprocessors based on the hypercube interconnection structure are surveyed. The history of hypercube computers from their conceptual origins in the 1960s to the recent introduction of commercial machines is brief ..."
Abstract

Cited by 92 (6 self)
 Add to MetaCart
The architecture and applications of the class of highly parallel distributedmemory multiprocessors based on the hypercube interconnection structure are surveyed. The history of hypercube computers from their conceptual origins in the 1960s to the recent introduction of commercial machines is briefly reviewed. The properties of hypercube graphs relevant to their use in supercomputers are examined, including connectivity, routing, and embedding. The hardware and software characteristics of current hypercubes are discussed, emphasizing the unique aspects of their operating systems and programming languages. A sample C program is presented to illustrate the singlecode multipledata programming style typical of distributedmemory machines in general, and hypercubes in particular. Two contrasting hypercube applications are presented and analyzed: image processing and branchandbound optimization. The paper concludes with a discussion of current trends. I.
Special Purpose Parallel Computing
 Lectures on Parallel Computation
, 1993
"... A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing ..."
Abstract

Cited by 81 (6 self)
 Add to MetaCart
A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing [365] demonstrated that, in principle, a single general purpose sequential machine could be designed which would be capable of efficiently performing any computation which could be performed by a special purpose sequential machine. The importance of this universality result for subsequent practical developments in computing cannot be overstated. It showed that, for a given computational problem, the additional efficiency advantages which could be gained by designing a special purpose sequential machine for that problem would not be great. Around 1944, von Neumann produced a proposal [66, 389] for a general purpose storedprogram sequential computer which captured the fundamental principles of...
A Parallel Graph Coloring Heuristic
 SIAM J. SCI. COMPUT
, 1992
"... The problem of computing good graph colorings arises in many diverse applications, such as in the estimation of sparse Jacobians and in the development of efficient, parallel iterative methods for solving sparse linear systems. In this paper we present an asynchronous graph coloring heuristic well s ..."
Abstract

Cited by 75 (7 self)
 Add to MetaCart
(Show Context)
The problem of computing good graph colorings arises in many diverse applications, such as in the estimation of sparse Jacobians and in the development of efficient, parallel iterative methods for solving sparse linear systems. In this paper we present an asynchronous graph coloring heuristic well suited to distributed memory parallel computers. We present experimental results obtained on an Intel iPSC/860 which demonstrate that, for graphs arising from finite element applications, the heuristic exhibits scalable performance and generates colorings usually within three or four colors of the bestknown linear time sequential heuristics. For bounded degree graphs, we show that the expected running time of the heuristic under the PRAM computation model is bounded by EO(log(n)= log log(n)). This bound is an improvement over the previously known best upper bound for the expected running time of a random heuristic for the graph coloring problem.
The TorusWrap Mapping For Dense Matrix Calculations On Massively Parallel Computers
 SIAM J. SCI. STAT. COMPUT
, 1994
"... Dense linear systems of equations are quite common in science and engineering, arising in boundary element methods, least squares problems and other settings. Massively parallel computers will be necessary to solve the large systems required by scientists and engineers, and scalable parallel algori ..."
Abstract

Cited by 71 (5 self)
 Add to MetaCart
Dense linear systems of equations are quite common in science and engineering, arising in boundary element methods, least squares problems and other settings. Massively parallel computers will be necessary to solve the large systems required by scientists and engineers, and scalable parallel algorithms for the linear algebra applications must be devised for these machines. A critical step in these algorithms is the mapping of matrix elements to processors. In this paper, we study the use of the toruswrap mapping in general dense matrix algorithms, from both theoretical and practical viewpoints. We prove that, under reasonable assumptions, this assignment scheme leads to dense matrix algorithms that achieve (to within a constant factor) the lower bound on interprocessor communication. We also show that the toruswrap mapping allows algorithms to exhibit less idle time, better load balancing and less memory overhead than the more common row and column mappings. Finally, we discuss ...