19 citations found. Retrieving documents...
S. L. Johnsson, M. Jacquemin, and R. L. Krawitz. Communications Efficient MultiProcessor FFT. Journal of Computational Physics, 102:381--397, 1992.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Scalable Data Parallel Algorithms for Texture.. - Bader.. (1993)   (5 citations)  (Correct)

....of Some Basic Operations A two dimensional Fast Fourier Transform (FFT) is a commonly used technique in digital image processing, and several algorithms in this paper make use of it. The FFT is wellsuited for parallel applications because it is efficient and inherently parallel ( 20] 1] [22], 23] 42] With an image size of n elements, O(n log n) operations are needed for an FFT. On a parallel machine with p processors, O computational steps are required. The communications needed for an FFT are determined by the FFT algorithm implemented on a particular parallel machine. The ....

S. L. Johnsson, M. Jacquemin, and R. L. Krawitz. Communications Efficient MultiProcessor FFT. Journal of Computational Physics, 102:381--397, 1992.


Performance of the CM-5, ENEE 646 Class Report - Martin, Bader (1994)   (3 citations)  (Correct)

.... the network very efficient for regular communications patterns commonly used in massively parallel processing, since highly parallel code utilizes permutations when performing data parallel grid shifts ( 1] or for common mathematical operations such as the Cooley Tukey Fast Fourier Transform ([4], 5] The fat tree interconnection network is a packet switching network. Each processing node is responsible for splitting messages into data packets, and injecting and receiving the packets from the data network. The format of each data packet will be discussed in the following section. 2 ....

S. Lennart Johnsson, Michel Jacquemin, and Robert L. Krawitz. Communications Efficient Multi-Processor FFT. Journal of Computational Physics, 102:381--397, 1992.


Communication Pipelining In Hypercubes - de Cerio, González.. (1995)   (Correct)

....to several neighbors, instead of one long message to a single neighbor. Of course, to apply this technique every node must be able to send messages in parallel along multiple links. In some way, communication pipelining can be seen as a generalization of the direct pipelining technique proposed in [JoJK91, JoKr92] as it is outlined in section 4.1. However, there are significant differences. First, our technique can be applied to a more general class of algorithms. Second, communication pipelining can adapt the degree of pipelining to each particular case whereas direct pipelining has a fixed degree. ....

.... 12 4.1 Related work The FFT has been extensively studied in the literature due to its utility for many scientific areas. One of the most widely used algorithm is known as Cooley Tukey [CoTu65] Such algorithm can be executed in a hypercube multicomputer following the strategies proposed in [JoJK91] and [ToSw91] which are known as bi section and i cycles respectively. However, such strategies make use of only one of the links of each node at any time. S.L. Johnsson and R.L. Krawitz in [JoJK91, JoKr92] introduced by the first time the idea of using all the links at the same time. The ....

[Article contains additional citation context not shown here]

S.L. Johnsson, M. Jacquemin and R.L. Krawitz, Communication Efficient Multiprocessor FFT, Thinking Machine Corp. Technical Report TR-220, Oct. 1991


Communication Pipelining In Hypercubes - de Cerio, González.. (1996)   (Correct)

....to several neighbors, instead of one long message to a single neighbor. Of course, to apply this technique every node must be able to send messages in parallel along multiple links. In some way, communication pipelining can be seen as a generalization of the direct pipelining technique proposed in [JJK91,JK92] as it is outlined in section 4.1. However, there are significant differences. First, our technique can be applied to a more general class of algorithms. Second, communication pipelining can adapt the degree of pipelining to each particular case whereas direct pipelining has a fixed degree. ....

....4.1. Related work The FFT has been extensively studied in the literature due to its utility for many scientific areas. One of the most widely used algorithm is known as Cooley Tukey [CT65] Such algorithm can be executed in a hypercube multicomputer following the strategies proposed in [JJK91] and [TS91] which are known as bi section and i cycles respectively. However, such strategies make use of only one of the links of each node at any time. S.L. Johnsson and R.L. Krawitz in [JJK91,JK92] proposed to use all the links of the hypercube at the same time to compute the FFT. The ....

[Article contains additional citation context not shown here]

S. L. Johnsson, M. Jacquemin, and R. L. Krawitz. Communication Efficient Multiprocessor FFT. Technical Report TR-220, Thinking Machine Corp., October 1991.


Efficient Parallel FFTs for Different Computational Models - Nadia Shalaby   (Correct)

....used in most areas of applied sciences and engineering [8] As more of these applications migrate to the platform of parallel computing, the exigency of a highly efficient and portable parallel FFT kernel becomes evident. Considerable valuable research was conducted on this topic, such as [5, 6, 7, 9, 15, 16]. However, studies tended to be geared towards specific architectures, or did not fully analyze the system and problem parameters at hand, or gave underspecified algorithms, and overall did not present a general approach for the parallelization of other orthogonal transforms. The above perspective ....

....N th root of unity, and jk N are the corresponding twiddle factors. However, the term parallel FFT is overloaded in its many variations, each of which may affect the form of the output, and the computational, storage or communication complexities, while still conforming to the DFT definition [5, 6, 15]. For our purposes, it is sufficient to demonstrate our methodology for any such variation. Thus, arbitrarily and for the sake of clarity, we select the forward, unscaled, ordered, radix 2, one dimensional FFT with precomputed twiddle factors. Likewise, we arbitrarily choose the decimation in ....

[Article contains additional citation context not shown here]

S. L. Johnsson, M. Jacquemin, and R. L. Krawitz, Communication efficient multi--processor FFT, Journal of Computational Physics, 102 (1992), pp. 381--397.


A Technique for Overlapping Computation and Communication.. - Gupta, Huang, al. (1998)   (2 citations)  (Correct)

....is achieved. Proper orchestrating of the communication and computation is the key to our overlapping scheme. This we achieve by introducing no extra synchronization overhead. In the context of FFT, many researchers have developed techniques for overlapping computa3 tion and communication. In [18], Johnsson et al. present fine grained pipelined FFT algorithms for hypercube connected SIMD machines. FFT algorithms for both vector multiprocessors with shared memory and hypercube are presented by Swarztrauber in [25] In [1] Agarwal et al. present a 3D FFT algorithm which overlaps computation ....

S. L. Johnsson, M. Jacquemin, and R. L. Krawitz. Communication efficient multi-processor FFT. J. Comp. Physics, 102(2):381--397, 1992.


Multiprocessor Out-of-Core FFTs with Distributed Memory and.. - Cormen, al. (1997)   (4 citations)  (Correct)

....and it desires FFTs with up to 64 gigapoints. Although the literature contains some related work, the approach in this paper is unique. There have been a few papers on out of core FFTs on uniprocessors [Bai90, Bre69, CN96] There are also some papers on in core FFTs on multiprocessors [Cal96, JJK92, Swa87, Zhu90] each of these papers assumes some interconnection network topology. The only previous out of core implementation for a multiprocessor of which we are aware is by Sweet and Wilson [SW95] They use a CM 5 with a Scalable Disk Array [TMC92] which appears to the programmer as one ....

S. Lennart Johnsson, Michel Jacquemin, and Robert L. Krawitz. Communication efficient multi-processor FFT. Journal of Computational Physics, 102:381--397, 1992.


Scalable Data Parallel Algorithms for Texture Synthesis and.. - Bader (1993)   (5 citations)  (Correct)

....of Some Basic Operations A two dimensional Fast Fourier Transform (FFT) is a commonly used technique in digital image processing, and several algorithms in this paper make use of it. The FFT is wellsuited for parallel applications because it is efficient and inherently parallel ( 20] 1] [22], 23] 42] With an image size of n elements, O(n log n) operations are needed for an FFT. On a parallel machine with p processors, O i n p log n j computational steps are required. The communications needed for an FFT are determined by the FFT algorithm implemented on a particular ....

S. L. Johnsson, M. Jacquemin, and R. L. Krawitz. Communications Efficient MultiProcessor FFT. Journal of Computational Physics, 102:381--397, 1992.


A Vector Space Framework for Parallel Stable Permutations - Shalaby, Johnsson (1995)   Self-citation (Johnsson)   (Correct)

....space, the algebraic representation of permutations is particularly useful in algorithmic design. Algebraic representation of communication operations (broadcast, reduction, multicast and permutations) for the design of optimal algorithms in the link bound model for binary cubes was employed in [13, 14, 15, 19, 20, 21, 24, 25, 30, 37, 38, 39, 42]. Algebraic frameworks were also used in a more general setting, as required in mapping and compiling parallel programs, in [3, 4, 8, 9] A Vector Space Framework for Parallel Stable Permutations 2 By adopting the mathematical concept of an algebrao geometric permutation, and representing our ....

.... m jq; l jq xch;q (i) 0 i=r Gamma1 = fi log Q S Gamma1 j=0 Pi m jq; l jq xch;Q : Similarly, fi 0 j=log Q S Gamma1 Pi m jq; l jq xch;Q (a) fi log Q S Gamma1 j=0 Pi m jq; l jq xch;Q (a) Pi m;l xch;S (a) We can now also easily prove the following well known result [24, 25, 38]. Theorem 3.9 For q; l; m 2 I such that Q = 2 q , 0 l v Gamma q and v m r Gamma q, loc( Pi m;l xch;Q ; n) V=Q. Proof: Let a 2 A r , then, a = a r Gamma1 : am q Gamma1 : am z q bits : a v j a v Gamma1 : a l q Gamma1 : a l z q bits : a 0 ....

S. Lennart Johnsson, Michel Jacquemin, and Robert L. Krawitz. Communication efficient multi-processor FFT. Journal of Computational Physics, 102(2):381--397, October 1992.


High Performance, Scalable Scientific Software Libraries - Johnsson, Mathur (1994)   (1 citation)  Self-citation (Johnsson)   (Correct)

....is naturally load balanced with respect to arithmetic. For the FFT, the butterfly computations proceed from the most significant bit in the index space to the least significant bit. Thus, with the most significant bits allocated to local memory, no communication is required for the leading bits [JJK92, JHJR87, Swa87, TS91] 1.6.1.6 Sparse matrices regular grids The purpose of sparse matrix techniques is to take advantage of the zero nonzero structure of a matrix to reduce both storage and arithmetic needs. Address calculation is often a significant portion of the time required in sparse ....

.... time for the one to all and all to one personalized communication operations, require a time of 2Q P Nr log2 N c [JH89b] The time required for the FFT computation itself is proportional to PQ NrN c in the node limited model, and to PQ 2NrN c in the channel limited model [JHJR87, JJK92, JK92, Swa87, TS91] Thus, the data redistribution time in Alternative 1 far exceeds the communication time required for the FFT computation. 19 9 1994 17:53 PAGE PROOFS for John Wiley Sons Ltd (using jwcbmc01, Vers 01.01 MAY 1992) P4 High Performance, Scalable Scientific Software ....

S. Lennart Johnsson, Michel Jacquemin, and Robert L. Krawitz. Communication efficient multi-processor FFT. Journal of Computational Physics, 102(2):381--397, October 1992.


Scientific Software Libraries for Scalable Architectures - Johnsson, Mathur   Self-citation (Johnsson)   (Correct)

....The FFT computations are uniform across the index space and the load balance is independent of whether cyclic or consecutive allocation is used. However, the cyclic data allocation yields lower communication needs than the consecutive allocation by up to a factor of two for unordered transforms [13, 14]. The reason is that the computations of the FFT always proceed from the high to the low order bit in the index space. With the consecutive allocation the high order bits are associated with processor addresses and must be mapped to local memory addresses before local butterfly computations can be ....

S. Lennart Johnsson, Michel Jacquemin, and Robert L. Krawitz. Communication efficient multi-processor FFT. Journal of Computational Physics, 102(2):381--397, October 1992.


Data Motion and High Performance Computing - S. Lennart Johnsson (1994)   (1 citation)  Self-citation (Johnsson)   (Correct)

....for the loading of 16M bytes of data in 64 bit precision. After the computation this many bytes must be stored as well. A radix M algorithm yields a byte flop requirement of 32= 5 log 2 M ) Precomputed twiddle factors increase the memory bandwidth requirements by less than a factor of two [23]. The idea of using a radix equal to the size of the available memory was first proposed by Gentleman [9] Sorting exhibits a behavior similar to the FFT. The potential benefits of exploiting locality of reference for three typical computations are quantified in Table 2 for relatively large size ....

....reduction, or all to one reduction. It is well known that the FFT requires communication in the form of a butterfly network. However, in the case of performing FFT on distributed memory processors, the communication may actually be better performed as all to all personalized communication [23]. Other data reference patterns associated with the FFT are bit reversal, for ordered FFT, and vector reversal, for real to complex FFT. Many divide and conquer methods in higher dimensions require communication in the form of pyramid networks in one or several dimensions. Finally, we ....

[Article contains additional citation context not shown here]

S. Lennart Johnsson, Michel Jacquemin, and Robert L. Krawitz. Communication efficient multi-processor FFT. Journal of Computational Physics, 102(2):381--397, October 1992.


HPFBench: A High Performance Fortran Benchmark Suite - Hu, Jin, Johnsson..   (1 citation)  Self-citation (Johnsson)   (Correct)

....in the HPFBench benchmark, the bit reversal constitutes an AAPC whenever the size of the local data set of the axis subject to bit reversal is at least as large as the number of processing nodes along the axis subject to bit reversal. A detailed analysis of the parallel FFT can be found in [37, 47]. The FFT is one of the most widely used algorithms in science, engineering design, and in signal processing. By being a very efficient algorithm, the operation count per data point is relatively low, O(log n) and communication is global and extensive. Hence, FFTs tend to expose weaknesses in ....

S. Lennart Johnsson, Michel Jacquemin, and Robert L. Krawitz. Communication efficient multi-processor FFT. Journal of Computational Physics, 102(2):381--397, October 1992.


Hierarchical Load Balancing for Parallel Fast Legendre.. - Shalaby, Johnsson (1997)   (6 citations)  Self-citation (Johnsson)   (Correct)

....4 5 6 7 8 9 10 11 12 13 14 15 stage 3 11 15 13 14 10 12 0 3 4 7 2 6 1 5 8 9 0 0 0 0 0 0 0 0 0 0 0 0 4 4 4 4 0 0 2 2 4 4 6 6 0 1 2 3 4 5 6 7 Fig. 2. Load bal. FFT: N P 2 , N = 16, P = 4 The first effective parallelization of the FFT computation in a load balanced manner 5 was proposed in [9, 22]. Subsequently, this approach was generalized in [16, 19] where an algorithmic specification is derived with a proof of optimality for a variety of parallel computational models. The load balanced algorithm compels all the butterfly operations to be local for N=P stages, employing a ....

S. L. Johnsson, M. Jacquemin, and R. L. Krawitz, Communication efficient multi--processor FFT, Journal of Computational Physics, 102 (1992), pp. 381--397.


Language and Compiler Issues in Scalable High Performance.. - Johnsson (1992)   (3 citations)  Self-citation (Johnsson)   (Correct)

....must pass through a channel. The one to all and all to one personalized communication operations, requires a time of 2Q P Nr log 2 Nc [31] The time required for the FFT computation itself is proportional to PQ NrNc in the node limited model, and to PQ 2NrNc in the channel limited model [38, 40, 41, 64, 69]. Thus, the data redistribution time in Alternative 1 far exceeds the communication time required for the FFT computation. The communication times are summarized in Table 3, and the arithmetic speedups in Table 4. Note that with a single instance library routine and canonical layouts, Alternative ....

....used by the CMSSL FFT are: ffl Butterfly network emulation. ffl Binary code to binary reflected Gray code conversion. ffl Binary reflected Gray code to binary code conversion. ffl Bit reversal. The communication complexities are summarized in Table 3. For performance measurements see [40, 41]. F NaN F NaN 6 F NaN Gamma F NaN Gamma F NaN Gamma F NaN Gamma F NaN Gamma F NaN F NaN Gamma F NaN Gamma F NaN Gamma Psi F NaN Gamma F NaN F NaN Gamma F NaN Gamma F NaN Gamma F NaN Gamma F NaN Gamma F NaN Gamma F NaN Gamma F NaN Gamma F NaN Gamma F ....

[Article contains additional citation context not shown here]

S. Lennart Johnsson, Michel Jacquemin, and Robert L. Krawitz. Communication efficient multi-processor FFT. Journal of Computational Physics, 102(2):381--397, October 1992.


DPF: A Data Parallel Fortran Benchmark Suite - Hu, Johnsson, Kehagias, Shalaby (1995)   Self-citation (Johnsson)   (Correct)

....layout in the DPF benchmark, the bit reversal constitutes an AAPC whenever the size of the local data set of the axis subject to bit reversal is at least as large as the number of processing nodes along the axis subject to bit reversal. A detailed analysis of the parallel FFT can be found in [41, 59]. FFT FLOP Memory Count Dimension Count 32 bit precision 64 bit precision 1 D FFT 5N log N 60N 100N 2 D FFT 10N 2 log N 76N 2 116N 2 3 D FFT 15N 3 log N 92N 3 136N 3 Table 13: FLOP count and memory usage for FFT The FFT is one of the most widely used algorithms in science, ....

S. Lennart Johnsson, Michel Jacquemin, and Robert L. Krawitz. Communication efficient multi-processor FFT. Journal of Computational Physics, 102(2):381--397, October 1992. DPF: A Data Parallel Fortran Benchmark Suite 60


Load-balance in parallel FACR - Johnsson, Pitsianis   Self-citation (Johnsson)   (Correct)

....though communication intensive. For the complex to complex FFT with a data set of N points distributed evenly across P processors the minimum communication per processing node is approximately ( N P Gamma 1) log 2 N log 2 N P for N P data elements per node. For details see [12, 13] and for a proof of optimality see [8] The parallel communication complexity and proof of optimality are different aspects of an early observation made by Gentleman and Sande [4] Real to complex FFT are more communication intensive than the standard FFT, and cosine and sine transforms are ....

S. Lennart Johnsson, Michel Jacquemin, and Robert L. Krawitz. Communication efficient multiprocessor FFT. Journal of Computational Physics, 102(2):381--397, October 1992.


Massively Parallel Computing: Data distribution and communication - Johnsson (1993)   (2 citations)  Self-citation (Johnsson)   (Correct)

.... with the chosen method of aggregation, thus creating an equal load balance for consecutive and cyclic allocation [27, 43] But, for computations where the order of traversal of the index space is fixed, such as the FFT, a cyclic allocation may reduce the communication needs by a factor of two [37, 38, 58, 64]. For a number of important computations on regular arrays, the canonical layout indeed minimizes the number of off node references for a given number of data elements per node. However, when references along the different axes are not uniform, other nodal array shapes may result in a reduced ....

S. Lennart Johnsson, Michel Jacquemin, and Robert L. Krawitz. Communication efficient multi-processor FFT. Journal of Computational Physics, 102(2):381--397, October 1992.


Massively Parallel Computing: Mathematics and communications .. - Johnsson, Mathur (1993)   Self-citation (Johnsson)   (Correct)

....an equal load balance for consecutive and cyclic allocation [38, 59] But, for computations where the order of traversal of the index space is fixed, such as the FFT, a cyclic allocation may reduce the communication needs. For the FFT, the reduction in communication may amount to a factor of two [49, 50, 75, 83]. For a number of important computations on regular arrays, the canonical layout indeed minimizes the number of off node references for a given number of data elements per node. However, when references along the different axes are not uniform, other nodal array shapes may result in a reduced ....

....must pass through a channel. The one to all and all to one personalized communication operations, require a time of 2Q P Nr log 2 Nc [45] The time required for the FFT computation itself is proportional to PQ NrNc in the node limited model, and to PQ 2NrNc in the channel limited model [49, 50, 51, 75, 83]. Thus, the data redistribution time in Alternative 1 far exceeds the communication time required for the FFT computation. The communication times are summarized in Table 4, and the arithmetic speedups in Table 5. Note that with a single instance library routine and canonical layouts, Alternative ....

S. Lennart Johnsson, Michel Jacquemin, and Robert L. Krawitz. Communication efficient multi-processor FFT. Journal of Computational Physics, 102(2):381--397, October 1992.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC