11 citations found. Retrieving documents...
S. Lennart Johnsson and Kapil K. Mathur. Distributed level 1 and level 2 BLAS. Technical report, Thinking Machines Corp., 1992. In preparation.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
High Performance, Scalable Scientific Software Libraries - Johnsson, Mathur (1994)   (1 citation)  Self-citation (Johnsson Mathur)   (Correct)

....DCHH86, DCDH88] Efficient implementations of this set of routines are architecture dependent, and for most architectures is written in assembly code. Most scientific codes achieve high performance when built on top of this set of routines. On distributed memory architectures a distributed BLAS [JM92, MJ94, Joh93b] DBLAS, is required in addition to a local BLAS, LBLAS, in each node [JO92] Moreover, a set of communication routines are required for data motion between nodes. But, not all algorithms parallelizes well, and there is an algorithmic architectural dependence. Thus, architectural ....

S. Lennart Johnsson and Kapil K. Mathur. Distributed level 1 and level 2 BLAS. Technical report, Thinking Machines Corp., 1992. In preparation.


Scientific Software Libraries for Scalable Architectures - Johnsson, Mathur   Self-citation (Johnsson Mathur)   (Correct)

....per node over a range of CM 5 system sizes. 64 bit precision. are architecture dependent, and for most architectures is written in assembly code. Most scientific codes achieve high performance when built on top of this set of routines. On distributed memory architectures a distributed BLAS [10, 21, 15], DBLAS, is required in addition to a local BLAS, LBLAS, in each node [16] Moreover, a set of communication routines are required for data motion between nodes. But, not all algorithms parallelizes well, and there is an algorithmic architectural dependence. Thus, architectural independence of ....

S. Lennart Johnsson and Kapil K. Mathur. Distributed level 1 and level 2 BLAS. Technical report, Thinking Machines Corp., 1992. In preparation.


CMSSL: A Scalable Scientific Software Library - Johnsson (1993)   Self-citation (Johnsson)   (Correct)

....respect to performance is traditionally captured in the BLAS (Basic Linear Algebra Subprograms) 2, 3, 10] Efficient implementations of this set of routines are architecture dependent, and for most architectures written in assembly code. On distributed memory architectures, a distributed BLAS [6, 8, 12] (DBLAS) is required in addition to a local BLAS (LBLAS) in each node [9] Moreover, a set of communication routines are required for data motion between nodes. But, not all (high level) al 1 ENSA is an Euler and Navier Stokes finite element code [4] developed at the Division of Applied ....

S. Lennart Johnsson and Kapil K. Mathur. Distributed level 1 and level 2 BLAS. Technical report, Thinking Machines Corp., 1992. In preparation.


Local Basic Linear Algebra Subroutines (LBLAS) for.. - Johnsson, Ortiz (1992)   Self-citation (Johnsson)   (Correct)

....When the data motion issues for mixing arrays of different rank are satisfactorily solved for the DBLAS, then the CMSSL LBLAS will be extended to the exact same functionality as the corresponding routine in the conventional BLAS. For a discussion of the data motion issues in the DBLAS see [4, 5, 6, 11]. In the following, for the convenience of the reader, we discuss the LBLAS using the traditional BLAS names whenever the distinction is either irrelevant or clear from the context. Whenever there is a need to stress that the discussion refers to the LBLAS, we prefix the BLAS names with CMSSL, ....

....i.e. concurrent reductions on disjoint segments of the index space, result in arrays that are allocated across some subset of nodes. For a detailed description of the memory management system on the CM 200, see [16, 17] For a description of BLAS operating on distributed data structures, see [5, 6, 11]. Here we focus on the BLAS in each node and do not further address the issues for distributed data structures and communication. Each floating point processor has 64 bit wide data paths, but the path to local memory is only 32 bits wide. Figure 2 illustrates the local node architecture. Each ....

S. Lennart Johnsson and Kapil K. Mathur. Distributed level 1 and level 2 BLAS. Technical report, Thinking Machines Corp., 1992. In preparation.


Local Basic Linear Algebra Subroutines (LBLAS) for.. - Johnsson, Ortiz (1992)   Self-citation (Johnsson)   (Correct)

....When the data motion issues for mixing arrays of different rank are satisfactorily solved for the DBLAS, then the CMSSL LBLAS will be extended to the exact same functionality as the corresponding routine in the conventional BLAS. For a discussion of the data motion issues in the DBLAS see [4, 5, 6, 11]. In the following, for the convenience of the reader, we discuss the LBLAS using the traditional BLAS names whenever the distinction is either irrelevant or clear from the context. Whenever there is a need to stress that the discussion refers to the LBLAS, we prefix the BLAS names with CMSSL, ....

....i.e. concurrent reductions on disjoint segments of the index space, result in arrays that are allocated across some subset of nodes. For a detailed description of the memory management system on the CM 200, see [16, 17] For a description of BLAS operating on distributed data structures, see [5, 6, 11]. Here we focus on the BLAS in each node and do not further address the issues for distributed data structures and communication. Each floating point processor has 64 bit wide data paths, but the path to local memory is only 32 bits wide. Figure 2 illustrates the local node architecture. Each ....

S. Lennart Johnsson and Kapil K. Mathur. Distributed BLAS. Technical report, Thinking Machines Corp., 1992. In preparation.


All-to-all Broadcast and Applications on the Connection Machine - Brunet, Johnsson (1991)   Self-citation (Johnsson)   (Correct)

....were to send its data to every other module before any computations start, then for P processors the memory requirements would grow by a factor of P . This growth in memory is unacceptable for N body computations, but acceptable in some matrix operations, such as matrix vector multiplication [9]. The implementation of the all to all broadcast function is consistent with the programming languages for the Connection Machine systems, in which memory management is largely hidden. However, as on many architectures, programming the Connection Machine systems in an architecturally independent ....

S. Lennart Johnsson and Kapil K. Mathur. Distributed BLAS. Technical report, Thinking Machines Corp., 1992. In preparation.


Local Basic Linear Algebra Subroutines (LBLAS) for the CM-5/5E - Kramer, Johnsson, Hu (1994)   Self-citation (Johnsson)   (Correct)

....their distribution across the memory units, often referred to as layout, the overall availability of memory on the machine, and estimates of the performance for the various alternatives. The global software layer for the BLAS on the CM 5 5E is referred to as the Distributed BLAS (DBLAS) [12]. The ScaLAPACK library [11] was introduced as a library for performing DBLAS computations based on specific data layouts and generic level 3 BLAS routines. The DBLAS controls how data is moved between processing nodes and which operations are to be performed on each node at each step of ....

S. Lennart Johnsson and Kapil K. Mathur. Distributed BLAS. Technical report, Thinking Machines Corp., 1992. In preparation.


Language and Compiler Issues in Scalable High Performance.. - Johnsson (1992)   (3 citations)  Self-citation (Johnsson)   (Correct)

.... the Fast Fourier Transform (FFT) belong to the second [39, 69] For computations that involve more than one object, such as matrix multiplication, the shapes, sizes, and distributions of the different objects have a significant impact on what algorithms should be chosen for optimal performance [27, 29, 43, 44, 53]. Partitioning, replication, and nodal array shape may vary widely. In the CMSSL, the decision of which algorithm to use for a given function is made at run time. Some of the issues in choosing an algorithm, depending upon data distribution and object shapes and sizes, are indirectly addressed in ....

....distributed data. We again focus on the data motion, and demonstrate the consequences of various approaches, and of canonical data layouts. Effective conservation and management of data motion is not possible with a canonical layout of data arrays. In the CMSSL, three levels of a Distributed BLAS [43, 44], DBLAS, are used for the implementation of matrix operations. 8.2.1 Using level 1 DOT DBLAS for matrix vector multiplication. Using a DOT routine to perform matrix vector multiplication implies that the product is evaluated as a sequence of operations of the form: y(I) A(I; x( We first ....

[Article contains additional citation context not shown here]

S. Lennart Johnsson and Kapil K. Mathur. Distributed level 1 and level 2 BLAS. Technical report, Thinking Machines Corp., 1992. In preparation.


Language and Compiler Issues in Scalable High Performance.. - Johnsson (1992)   (3 citations)  Self-citation (Johnsson)   (Correct)

.... the Fast Fourier Transform (FFT) belong to the second [39, 69] For computations that involve more than one object, such as matrix multiplication, the shapes, sizes, and distributions of the different objects have a significant impact on what algorithms should be chosen for optimal performance [27, 29, 43, 44, 53]. Partitioning, replication, and nodal array shape may vary widely. In the CMSSL, the decision of which algorithm to use for a given function is made at run time. Some of the issues in choosing an algorithm, depending upon data distribution and object shapes and sizes, are indirectly addressed in ....

....distributed data. We again focus on the data motion, and demonstrate the consequences of various approaches, and of canonical data layouts. Effective conservation and management of data motion is not possible with a canonical layout of data arrays. In the CMSSL, three levels of a Distributed BLAS [43, 44], DBLAS, are used for the implementation of matrix operations. 8.2.1 Using level 1 DOT DBLAS for matrix vector multiplication. Using a DOT routine to perform matrix vector multiplication implies that the product is evaluated as a sequence of operations of the form: y(I) A(I; x( We first ....

[Article contains additional citation context not shown here]

S. Lennart Johnsson and Kapil K. Mathur. Distributed BLAS. Technical report, Thinking Machines Corp., 1992. In preparation.


Multiplication of Matrices of Arbitrary Shape on a Data.. - Mathur, Johnsson (1992)   (9 citations)  Self-citation (Johnsson Mathur)   (Correct)

....or a two dimensional array of arbitrary shape. The algorithms presented here achieve perfect arithmetic load balance. For operand matrices assigned to a subset of processors, load balancing is an important issue. Algorithms that address load balancing issues in greater detail are discussed in [17]. The algorithms presented here are data parallel adaptations of the standard matrix multiplication algorithm requiring 2PQR arithmetic operations for the multiplication of a P Theta Q matrix by a Q Theta R matrix. The index space for these operations is depicted in Figure 1. R = 1 corresponds ....

....be carried out and the result stored as required. In this presentation, BLAS operating on distributed data structures are referred to as Distributed BLAS (DBLAS) The data motion issues for some level 2 DBLAS are discussed briefly below. For a more extensive discussion the reader is referred to [17]. 3.1 Matrix vector and vector matrix multiplication. The evaluation of the matrix vector product y Ax requires the operations: 1. Aligning the vector x with the column axis of A. 2. Spreading the vector x along the row axis of A, such that there is one copy of the appropriate segment of x ....

[Article contains additional citation context not shown here]

S. Lennart Johnsson and Kapil K. Mathur. Distributed BLAS. Technical report, Thinking Machines Corp., 1992. In preparation.


Multiplication of Matrices of Arbitrary Shape on a Data.. - Kapil Mathur (1994)   (9 citations)  Self-citation (Johnsson Mathur)   (Correct)

....array, or a two dimensional array of arbitrary shape. The algorithms presented here achieve perfect arithmetic load balance. For operand matrices assigned to a subset of processors, load balancing is an important issue. Algorithms that address load balancing issues are discussed in [15]. The algorithms presented here are data parallel adaptations of the standard matrix multiplication algorithm requiring 2PQR arithmetic operations for the multiplication of a P Theta Q matrix by a Q Theta R matrix. The index space for these operations is depicted in Figure 1. R = 1 corresponds ....

....can be carried out and the result stored as required. In this presentation BLAS operating on distributed data structures as referred to as Distributed BLAS (DBLAS) The data motion issues for some level 2 DBLAS are discussed briefly below. For a more extensive discussion the reader is referred to [15]. 3.1 Matrix vector and vector matrix multiplication The evaluation of the matrix vector product y Ax requires the operations: 1. Align the vector x with the column axis of A. 2. Spread the vector x along the row axis of A, such that there is one copy of the appropriate segment of x for ....

[Article contains additional citation context not shown here]

S. Lennart Johnsson and Kapil K. Mathur. Distributed BLAS. Technical report, Thinking Machines Corp., November 1991.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC