| G. Fox, et al., Solving Problems on Concurrent Processors: Volume 1, Prentice Hall, Englewood Cliffs, NJ, 1988. |
....bonded interactions per atom. In step (3) the local force vectors are summed across all processors in such a way that each processor 4 ends up with the total force on each of its N=P atoms. This is the sub vector f z . This force summation is a parallel communication operation known as a fold [12]. Various algorithms have been developed for performing the operation efficiently on different parallel machines and architectures [3, 12, 26] The key point is that each processor must essentially receive N=P values from every other processor to sum the total forces on its atoms. The total volume ....
....4 ends up with the total force on each of its N=P atoms. This is the sub vector f z . This force summation is a parallel communication operation known as a fold [12] Various algorithms have been developed for performing the operation efficiently on different parallel machines and architectures [3, 12, 26]. The key point is that each processor must essentially receive N=P values from every other processor to sum the total forces on its atoms. The total volume of communication (per processor) is thus P Theta N=P and the fold operation thus scales as N . In step (4) the summed forces are used to ....
[Article contains additional citation context not shown here]
G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker. Solving Problems on Concurrent Processors: Volume 1. Prentice Hall, Englewood Cliffs, NJ, 1988.
....copy of f . This step also scales as N=P , since there are a small number of bonded interactions per atom. In step (3) the force vector copies are summed across all P processors in such a way that each processor ends up with the total force on only its N=P atoms. This is called a fold operation [4, 16, 25] and scales optimally as N , the volume of data in the force vector f . We note that the fold operation is less costly than a global sum operation where each processor ends up with the total force on all N atoms, as is done in the RD algorithms discussed in [12, 21, 23, 32] A global sum operation ....
....is 8 times more expensive than a fold. The N=P forces resulting from the fold are used to update atom positions and velocities in step (4) Finally, in step (5) the new atom positions in x z are shared among all P processors in preparation for the next timestep. This is called an expand operation [4, 16, 25] and is essentially the inverse of the fold operation. Now each processor starts with a small N=P piece of the position vector and ends up with a copy of the entire N length vector. The cost of this communication step also scales as N . The chief advantage of the RD algorithm we have outlined is ....
G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker. Solving Problems on Concurrent Processors: Volume 1. Prentice Hall, Englewood Cliffs, NJ, 1988.
....where particles move through grids have changing communication and load balance requirements which can be detected and adapted for most efficiently in a MIMD code. 3 MP Computational Tools Most computational problems in science and engineering exhibit a significant degree of parallelism [3, 8]. If the problems are regular in that identical operations can be performed in parallel on each computational element (e.g. grid cell or particle) and static so that the topology of interactions between neighboring elements does not change as the simulation progresses, then both SIMD and MIMD ....
G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker. Solving Problems on Concurrent Processors: Volume 1. Prentice Hall, Englewood Cliffs, NJ, 1988.
....successfully used for a variety of problems to obtain fast convergence. It has been observed that in spite of the difficulty in parallelizing these methods, they are often faster than competing preconditioners. Issues related to parallel implementation of the multigrid algorithms are discussed in [4, 5, 2, 3]. In general, however, the effectiveness of these preconditioners reduces considerably when the system matrices are indefinite. We have devised a new algorithm to obtain a preconditioned system from the original n Theta n linear system, which can then be solved using an iterative method. A ....
G.C. Fox, M. Johnson, G. Lyzenga, S.W. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors: Volume 1. Prentice-Hall, Englewood Cliffs, NJ, 1988.
....successfully used for a variety of problems to obtain fast convergence. It has been observed that inspite of the difficulty in parallelizing these methods, they are often faster than competing preconditioners. Issues related to parallel implementation of the multigrid algorithms are discussed in [17, 18, 12, 13]. In general, however, the effectiveness of these preconditioners reduces considerably when the system matrices are indefinite. Efficient parallel algorithms are indispensable for the solution of the large scale linear systems arising from the generalized Stokes problem. Parallel algorithms for ....
G. C. Fox, M. Johnson, G. Lyzenga, S. W. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors: Volume 1. Prentice-Hall, Englewood Cliffs, NJ, 1988.
....3 have been isolated into two modules. Mentat has been ported to six platforms, including networks of Sun 3 s, Sun 4 s, Silicon Graphics Irises, the Intel iPSC 2, the Intel iPSC 860, and the TMC CM 5. The high level and applications performance aspects of Mentat have been presented elsewhere [14][15][16] In [14] we present an overview, the Mentat philosophy, the Mentat approach to parallel computing, and performance results. In [16] the performance of a range of applications with a range of speedups are explored; the results are promising, speedups are not only good they are competitive ....
....the application. We have extensive experience with Mentat performance on applications from areas as diverse as electrical engineering, physics, biochemistry, and computer science, on platforms as diverse as networks of workstations and the Intel iPSC 860 (gamma) The results are detailed elsewhere [15][16] In several cases handcoded parallel implementations of the application exist. These provide us with a metric to measure the penalty of using Mentat and MDF. The results are very encouraging. Performance is good, and competitive with the hand coded implementations. Further, the use of the ....
G. Fox et al.,Solving Problems on Concurrent Processors Volume I, Prentice Hall, Englewood Cliffs, NJ, 1988.
....simulations [4, 5] The current state of the art is such that simulating ten to hundred thousand atom systems for picoseconds takes hours of CPU time on machines such as the Cray Y MP. The fact that MD computations are inherently parallel has been extensively discussed in the literature [11, 22]. There has been considerable effort in the last few years by researchers to exploit this parallelism on various machines. The majority of the work that has included implementations of proposed algorithms has been for single instruction multiple data (SIMD) parallel machines such as the CM 2 ....
....the corresponding x 2 piece of the position vector. In addition, it must know the entire position vector x (shown spanning the columns) to compute the matrix elements in F 2 . algorithms have been developed for performing this operation efficiently on different parallel machines and architectures [7, 22, 54]. We use an idea outlined in Fox, et al. 22] that is simple, portable, and works well on a variety of machines. We describe it briefly because it is the chief communication component of both the AD algorithms of this section and the force decomposition algorithms presented in the next section. ....
[Article contains additional citation context not shown here]
G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker. Solving Problems on Concurrent Processors: Volume 1. Prentice Hall, Englewood Cliffs, NJ, 1988.
....w n 2 p time, the total parallel execution time for all the movements of the sub blocks of both the matrices is given by the following equation: T p = n 3 p 2t s # p 2t w n 2 # p . 3. 12) Fox s Algorithm This algorithm is due to Fox et al. and is described in detail in [39] and [38]. The input matrices are initially distributed among the processors in the same manner as in the simple algorithm in Section 3.2.1. The algorithm works in # p iterations, where p is the number of processors being used. The data communication in the algorithm involves successive broadcast of the ....
G. C. Fox, M. Johnson, G. Lyzenga, S. W. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors: Volume 1. Prentice-Hall, Englewood Cliffs, NJ, 1988.
....parallel algorithm. Clearly, for a given W , the parallel algorithm can not use more than C(W ) processors. C(W ) depends only on the parallel algorithm, and is independent of the architecture. For example, for multiplying two N N matrices using Fox s parallel matrix multiplication algorithm [37], W = N 3 and C(W ) N 2 = W 2 3 . It is easily seen that if the processor time product [5] is #(W ) i.e. the algorithm is cost optimal) then C(W ) O(W ) Maximum Number of Processors Usable, p max : The number of processors that yield maximum speedup S max for a given W . ....
G. C. Fox, M. Johnson, G. Lyzenga, S. W. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors: Volume 1. Prentice-Hall, Englewood Cliffs, NJ, 1988. 154
....3: Communication statistics for LDLC Comm. type Mesg. count Mesg. volume Factor 1721 102992 Source 890 75276 Local aggr. 146 9158 Global aggr. 3083 77815 Total 5840 265241 7 Related work Various previous works have addressed the use of index based mapping strategies for dense matrix factorization [14, 9, 1]. Several researchers have also proposed nested mapping strategies for sparse matrix factorization [12, 10, 16, 7] The combination of index based mapping and nested mapping strategies has been studied in34 dependently by other researchers in specific contexts. In [13] a mapping strategy for a ....
G. Fox, et al., Solving problems on concurrent processors: Volume 1, Prentice Hall, Englewood Cliffs, NJ, 1988.
....effective. While these techniques show great promise for improving the run time of very large scale problems, they do little to help with small to modest sized systems in the range 50K atoms, which is the focus of this paper. Many implementations of parallel molecular dynamics have been developed [4, 6, 8, 9, 14, 19, 20], but few groups have addressed issues related to the use of massively parallel machines with 100K to 1M processors for small to modest size systems. In this paper we address two main issues: a good decomposition method that can take advantage of a massively parallel system and the communication ....
Fox, G. C., Johnson, M. A., Lyzenga, G. A., Otto, S. W., Salmon, J. K. and D. W. Walker, Solving Problems on Concurrent Processors: Volume 1, Prentice Hall, Englewood Cliffs, New Jersey, 1988.
....Clearly, for a given W , the parallel algorithm cannot use more than Gamma(W ) processors. Gamma(W ) depends only on the parallel algorithm, and is independent of the architecture. For example, for multiplying two N Theta N matrices using Fox s parallel matrix multiplication algorithm [12], W = N 3 and Gamma(W ) N 2 = W 2=3 . It is easily seen that if the processor time product [1] is Theta(W ) i.e. the algorithm is cost optimal) then Gamma(W ) Theta(W ) 3 Scalability Metrics for Parallel Systems It is a well known fact that given a parallel architecture and a ....
....of concurrency inherent in the algorithm. In the last decade, there has been a growing realization that for a variety of parallel systems, given any number of processors p, speedup arbitrarily close to p can be obtained by simply executing the parallel algorithm on big enough problem instances [47, 31, 36, 45, 22, 12, 38, 41, 40, 58]. Kumar and Rao [31] developed a scalability metric relating the problem size to the number of processors necessary for an increase in speedup in proportion to the number of processors. This metric is known as the isoefficiency function. If a parallel system is used to solve a problem instance of ....
G. C. Fox, M. Johnson, G. Lyzenga, S. W. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors: Volume 1. Prentice-Hall, Englewood Cliffs, NJ, 1988.
.... replace ORB [22] Singh developed a partitioning technique called costzone for parallel FMA for shared memory systems [21, 20] Grama, Kumar, and Sameh [7, 8] derived both Modular Scattered Decomposition (MSD) and Morton Ordering Based Scattered Decomposition (MOBSD) from scattered decomposition [5]. It has been shown that existing partitioning techniques work fine for the Barnes Hut algorithm or the FMA on shared memory systems. However, the irregular partition pattern resulting from the above partitioning techniques substantially increases communication overhead in translations of ....
G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker, Solving Problems on Concurrent Processors: Volume I, Prentice Hall, 1988.
....the number of processors eventually exceeds the degree of concurrency inherent in the algorithm. For a variety of parallel systems, given any number of processors p, speedup arbitrarily close to p can be obtained by simply executing the parallel algorithm on big enough problem instances (e.g. [19, 6, 23, 30, 12, 4, 25, 28, 27, 40]) The ease with which a parallel algorithm can achieve speedups proportional to p on a parallel architecture can serve as a measure of the scalability of the parallel system. The isoefficiency function [21, 22] is one such metric of scalability which is a measure of an algorithm s capability ....
G. C. Fox, M. Johnson, G. Lyzenga, S. W. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors: Volume 1. Prentice-Hall, Englewood Cliffs, NJ, 1988.
....multicomputers, there has also been a significant amount of work on the more restricted class of SIMD machines [15, 23, 29, 116, 166, 304, 352, 399, 410] Locality based algorithms have been developed for all kinds of interconnection network structures. For an introduction to this area, see e.g. [34, 123, 185, 224]. The following is a short list of some of the more common network structures for which such algorithms have been produced: arrays [224, 292, 335, 360] trees and Sneptrees [165, 242, 264] pyramids [5, 284, 351] the mesh of trees structure [178, 224] fat trees [232, 233] hypercubes [70, 80, ....
G C Fox, M A Johnson, G A Lyzenga, S W Otto, J K Salmon, and D W Walker. Solving Problems on Concurrent Processors: Volume 1. General Techniques and Regular Problems. Prentice Hall, 1988.
....Thus, for some problems, it is more important to execute many time steps on a modest size problem than few time steps on a large size problem. We analyze the use of current and future MPPs for these modest sized problems. Many implementations of parallel molecular dynamics have been developed [2, 3, 5, 6, 9, 11, 14, 15], but very little work has addressed issues related to the use of machines with 50,000 processors for modest sized problems. In this paper we focus on a fine grained decomposition of molecular dynamics applications that can be parallelized beyond the number of atoms in the system. In particular, ....
Fox, G. C., M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker, Solving Problems on Concurrent Processors: Volume 1, Prentice Hall, Englewood Cliffs, New Jersey, 1988.
....the number of processors eventually exceeds the degree of concurrency inherent in the algorithm. For a variety of parallel systems, given any number of processors p, speedup arbitrarily close to p can be obtained by simply executing the parallel algorithm on big enough problem instances (e.g. [21, 12, 29, 34, 16, 10, 31, 33, 32, 40]) The ease with which a parallel algorithm can achieve speedups proportional to p on a parallel architecture can serve as a measure of the scalability of the parallel system. The isoefficiency function [24, 26] is one such metric of scalability which is a measure of an algorithm s capability to ....
....t w n 2 p time, the total parallel execution time for all the movements of the sub blocks of both the matrices is given by the following equation: T p = n 3 p 2t s p p 2t w n 2 p p (3) 4. 3 Fox s Algorithm This algorithm is due to Fox et al. and is described in detail in [11] and [10]. The input matrices are initially distributed among the processors in the same manner as in the algorithm in Section 4.1. The algorithm works in p p iterations, where p is the number of processors being used. The data communication in the algorithm involves successive broadcast of the the ....
G. C. Fox, M. Johnson, G. Lyzenga, S. W. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors: Volume 1. Prentice-Hall, Englewood Cliffs, NJ, 1988.
....petaflop performance. This class of machines is likely to be memory limited because of cost. Molecular dynamics is a good application to examine because it has modest memory requirements. Many implementations of parallel molecular dynamics have been developed for the first two classes of MPPs [3, 4, 6, 7, 11, 16, 17], but few groups have addressed issues related to the use of the third class, particularly for small to modest sized problems. In this paper we focus on a fine grained decomposition of the molecular dynamics algorithm that parallelizes beyond the number of atoms in the systems. Traditional ....
Fox, G. C., M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker, Solving Problems on Concurrent Processors: Volume 1, Prentice Hall, Englewood Cliffs, New Jersey, 1988.
....sparse matrix vector multiplication. For i = 1, y i = Ax i . x i 1 = y i . EndFor 3.1.2 Communication Primitives. BBA requires three kinds of communication primitives. The first primitive adds vectors present in different processors in each row and is called a fold operation [5, 6]. As seen from Figure 3 processor, P fffi owns a block of the matrix A fffi and vector x fi . If z fffi = A fffi x fi , then y ff = z ff1 z ff2 Delta Delta Delta z ff;npx : Conjugate Gradient Algorithm on the CRAY T3D 7 The fold operation is used for adding vectors z fffi of length ....
G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker, Solving Problems on Concurrent Processors: Volume 1, Prentice-Hall, Englewood Cliffs, NJ, 1988.
....This formula ignores network conflicts during the redistribution of y, which will occur on a mesh, but can be avoided on a hypercube, at least for a square logical mesh. It should be noted that this is essentially a sparse version of the dense matrix vector multiplication algorithm in Fox, et al. [5] for hypercubes, with the additional observation that redistribution is needed to complete the data movement for the iteration. This method was independently discovered for hypercubes by Hendrickson, Leland and Plimpton [6] who observed the need to avoid network conflicts for the transpose in ....
G. Fox, et al., Solving Problems on Concurrent Processors: Volume 1, Prentice Hall, Englewood Cliffs, NJ, 1988.
....of the sparsity pattern of the matrix. The algorithm we describe was developed in connection with research on efficient methods of organizing parallel many body calculations [8] We subsequently learned our matrix vector multiply or matvec algorithm is very similar to an algorithm described in [6]. We have, nevertheless, chosen to present our algorithm here for two reasons. First, we improve upon the algorithm in [6] in several ways. Specifically, we discuss how to overlap communication and computation and thereby reduce the overall run time. We also show how to map the sub blocks of the ....
....of organizing parallel many body calculations [8] We subsequently learned our matrix vector multiply or matvec algorithm is very similar to an algorithm described in [6] We have, nevertheless, chosen to present our algorithm here for two reasons. First, we improve upon the algorithm in [6] in several ways. Specifically, we discuss how to overlap communication and computation and thereby reduce the overall run time. We also show how to map the sub blocks of the matrix to processors in a novel way which reduces the cost of the communication on parallel machines with hypercube ....
[Article contains additional citation context not shown here]
G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker, Solving problems on concurrent processors: Volume 1, Prentice Hall, Englewood Cliffs, NJ, 1988.
....increased [17, 21] An alternative method for assigning matrix elements to processors is the torus wrap mapping. Variants of this assignment scheme have been independently discovered by several researchers, and consequently given a number of different names including cyclic [23] scattered [15], grid [36] and subcube grid [8] as well as torus wrap [30] The mapping was first described by O Leary and Stewart in a data flow context [29, 30] and the synergy between the torus wrap mapping and the hypercube topology was observed by Fox [14, 15] Variants of the torus wrap mapping ....
.... names including cyclic [23] scattered [15] grid [36] and subcube grid [8] as well as torus wrap [30] The mapping was first described by O Leary and Stewart in a data flow context [29, 30] and the synergy between the torus wrap mapping and the hypercube topology was observed by Fox [14, 15]. Variants of the torus wrap mapping have been used in high performance LU factorization codes on a number of different machines [3, 4, 6, 9, 27, 35, 36] Assuming each matrix element is stored on only a single processor, Ashcraft built on work by Saad to show that for LU factorization, the ....
[Article contains additional citation context not shown here]
G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker, Solving problems on concurrent processors: Volume 1, Prentice Hall, Englewood Cliffs, NJ, 1988.
....summation, processor P fffi needs to know the summed results for only the N=P particles it is responsible for updating, namely those in x fi ff . These N=P values are the sum of the corresponding elements across all the processors in row block ff. This can be accomplished with a fold operation [7], as outlined in Fig. 3. The communication pattern of the fold operation is precisely the reverse of that in the expand operation. At each y : x fi ff For k = log 2 ( p P ) 0 1; 0 P 0 : P fffi with k th bit of fi flipped Send y to processor P 0 Receive z from processor P ....
G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker. Solving problems on concurrent processors: Volume 1. Prentice Hall, Englewood Cliffs, NJ, 1988.
....of the sparsity pattern of the matrix. The algorithm we describe was developed in connection with research on efficient methods of organizing parallel many body calculations [7] We subsequently learned our matrix vector multiply or matvec algorithm is very similar to an algorithm described in [5]. We have, nevertheless, chosen to present our algorithm here for two reasons. First, we improve upon the algorithm in [5] in several ways. Specifically, we discuss how to overlap communication and computation and thereby reduce the overall run time. We also show how to map the sub blocks of the ....
....of organizing parallel many body calculations [7] We subsequently learned our matrix vector multiply or matvec algorithm is very similar to an algorithm described in [5] We have, nevertheless, chosen to present our algorithm here for two reasons. First, we improve upon the algorithm in [5] in several ways. Specifically, we discuss how to overlap communication and computation and thereby reduce the overall run time. We also show how to map the sub blocks of the matrix to processors in a novel way which reduces the cost of the communication on parallel machines with hypercube ....
[Article contains additional citation context not shown here]
G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker, Solving problems on concurrent processors: Volume 1, Prentice Hall, Englewood Cliffs, NJ, 1988.
....for processor q. In steps (1 3) each processor learns how many other processors want to send it data. In step (1) each of the P processors initializes a P length vector with zeroes and stores a 1 in each location corresponding to a processor it needs to send data to. The fold operation [2] in step (2) communicates this vector in an optimal way; processor q ends up with the sum across all processors of only location q, which is the total number of messages it will receive. In step (4) each processor sends a short message to the processors it has data for, indicating how much data ....
....RCB decomposition can be represented as a set of P Gamma 1 cuts, one of which is stored by each processor as the RCB decomposition is carried out. In step (3) we communicate this cut information so that every processor has a copy of the entire set of cuts. This is done via an expand operation [2]. Before contact detection is performed, each processor must know (1) Send contact data to old RCB decomposition (2) Perform parallel RCB to rebalance (3) Share RCB cut info with all processors (4) For all my surfaces If surface extends beyond my RCB box Determine what other processors need it ....
G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker, Solving Problems on Concurrent Processors: Volume 1, Prentice Hall, Englewood Cliffs, NJ, 1988.
....and exploited in typical placement strategies; if two tasks communicate repeatedly with each other they are placed in close proximity. In systems which support dynamic placement, changing communication and processing requirements can result in task migration from node to node at run time [Fox88] Static placement however tries to map a group of tasks to processor nodes based on information about an application s communication requirements specified at compile time. This information may be a single collapsed graph expressing all of the communication which must take place during the ....
Geoffrey C. Fox. Solving Problems On Concurrent Processors: Volume I. Prentice Hall, 1988.
....q. In steps (1) 3) each processor learns how many other processors want to send it data. In step (1) each of the P processors initializes a local copy of a P length vector with zeroes and stores a 1 in each location corresponding to a processor it needs to send data to. The fold operation [4] in step (2) communicates this vector in an optimal way; processor q ends up with the sum across all processors of only location q, which is the total number of messages it will receive. In step (4) each processor sends a short message to the processors it has data for, indicating how much data ....
....of P Gamma 1 cuts, where P is the number of processors. One of these cuts is stored by each processor as the RCB operations are carried out. In step (3) we communicate this cut information so that every processor has a copy of the entire set of cuts. This is done optimally via an expand operation [4]. Before the SPH calculations can be performed, each processor must know about nearby particles that overlap its sub domain and are thus potential neighbors of its own particles. This information is acquired in steps (4) and (5) First in step (4) each processor checks which of its SPH particles ....
G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker, Solving Problems on Concurrent Processors: Volume 1, Prentice Hall, Englewood Cliffs, NJ, 1988.
....code) the file system, and the database systems available. These differences must be masked. 2. 1 Approach We consider a metasystem to be an combination of two different types of systems, parallel processing systems (PPS) and heterogeneous distributed computing systems (HDCS) 1] 2] 4] 5] 6][11][13] 22] 23] 25] 26] 30] 31] 21] Borrowing from the object oriented lexicon, any solution to the metasystems problem will inherit attributes and behaviors from both areas (Figure 2) While it is important that we inherit many features from both PPS and HDCS, some features of these systems are at ....
G. Fox et al.,Solving Problems on Concurrent Processors Volume I, Prentice Hall, Englewood Cliffs, NJ, 1988.
....summation, processor P fffi needs to know the summed results for only the N=P particles it is responsible for updating, namely those in x fi ff . These N=P values are the sum of the corresponding elements across all the processors in row block ff. This can be accomplished with a fold operation [8], as outlined in Fig. 3. y : f fffi For k = 0; log 2 ( p p) Gamma 1 y 1 : top half of y vector y 2 : bottom half of y vector P 0 : P fffi with k th bit of fi flipped If bit k of fi is 0 Then Send y 2 to processor P 0 Receive z from processor P 0 y : y 1 z ....
G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker. Solving problems on concurrent processors: Volume 1. Prentice Hall, Englewood Cliffs, NJ, 1988.
....which is the matrix distribution defined by OE 0 (i) OE 1 (i) i mod p p; for 0 i n; 6) where q 0 = q 1 = p p. This distribution is optimal for linear algebra computations such as dense LU decomposition [6, 17] It is also known under other names such as scattered square decomposition [11], cyclic storage [19] and torus wrap mapping [17] The term grid distribution should not be confused with the term grid used in the context of f A : n Theta n; distr(A) OE 0 ; OE 1 ) v : n; distr(v) distr(diag(A) g f I st = fi : 0 i n OE 0 (i) s (9j : 0 j n OE 1 (j) t ....
G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, and D. W. Walker, Solving Problems on Concurrent Processors: Volume I, General Techniques and Regular Problems. Englewood Cliffs, NJ: Prentice-Hall, 1988.
....algorithm. Clearly, for a given W , the parallel algorithm can not use more than C(W ) processors. C(W ) depends only on the parallel algorithm, and is independent of the architecture. For example, for multiplying two N Theta N matrices using Fox s parallel matrix multiplication algorithm [9], W = N 3 and C(W ) N 2 = W 2=3 . It is easily seen that if the processor time product [1] is Theta(W ) i.e. the algorithm is cost optimal) then C(W ) Theta(W ) Maximum Number of Processors Usable, p max : The number of processors that yield maximum speedup S max for a given W ....
G. C. Fox, M. Johnson, G. Lyzenga, S. W. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors: Volume 1. Prentice-Hall, Englewood Cliffs, NJ, 1988.
No context found.
G. Fox, et al., Solving Problems on Concurrent Processors: Volume 1, Prentice Hall, Englewood Cliffs, NJ, 1988.
No context found.
G. Fox et al, Solving Problems on Concurrent Processors Volume I, Prentice Hall, Englewood Cliffs, NJ, 1988.
No context found.
G. Fox, et al., Solving Problems on Concurrent Processors: Volume 1, Prentice Hall, Englewood Cliffs, NJ, 1988.
No context found.
G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors: Volume 1. Prentice-Hall, 1988. .
No context found.
G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors: Volume 1. Prentice-Hall, 1988.
No context found.
G. Fox, et al., Solving Problems on Concurrent Processors: Volume 1, Prentice Hall, Englewood Cliffs, NJ, 1988.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC