| D.H. Bailey. A high-performance fast Fourier transform algorithm for the Cray2. Journal of Supercomputing, 1:43--60, 1987. |
.... coupled parallel machines, due to its symmetries and scalability properties [26] and is a basic block of the counting and bitonic sorting networks [5] For these reasons the implementation of FFT on parallel computers has raised a great interest in the scienti c community over the last decade [24, 13, 8, 6, 7, 38, 37, 23, 35]. In this paper we analyze through simulation the characteristics of the communication patterns generated by some parallel FFT algorithms on two popular families of interconnection networks, the k ary n cubes [17] and the k ary n trees [34] 32] using wormhole and cut through routing ....
David H. Bailey. A High-Performance Fast Fourier Transform Algorithm for the Cray-2. Journal of Supercomputing, 1:43-60, 1987.
.... due to its symmetries and scalability properties [Hwa93] and is a basic block of the counting and bitonic sorting networks [AHS93] For these reasons the implementation of FFT on parallel computers has raised a great interest in the scientific community over the last decade [GK93, Chu87, Bai90, Bai87, Bai88, Swa87, Swa84, Get92, RWW91] In this paper we analyze through simulation the characteristics of the communication patterns generated by some parallel FFT algorithms on two popular families of interconnection networks, the k ary n cubes [Dal90] and the k ary n trees [PV95a] using ....
David H. Bailey. A High-Performance Fast Fourier Transform Algorithm for the Cray-2. Journal of Supercomputing, 1:43--60, 1987.
....the Cyber 205 in the late 70 s and early 80 s. These machines were vector machines and required that code be vectorized to achieve high performance. Penalties for poorly vectorized code could certainly be an order of magnitude. This problem lead to a flurry of activity in re examining the FFT [1, 2, 8, 9, 10, 15, 27, 38, 39, 40, 41]. 4 Portability The search for optimal implementations of the FFT lead to the development of many variants of the original implementation. These variants were difficult to program, so that it became very desirable to take, say a good vector version, and port it to another vector machine rather ....
D. H. Bailey. A high-performance fast Fourier transform algorithm for the Cray-2. J. Supercomputing, 1:43--60, 1987.
....as in [1] This allows data to be split into two words containing three digits each upon entry to the FFT multiply routine. The FFT routine used in this program is currently the fastest software available to perform a one dimensional FFT on the Cray 2. Details of this FFT algorithm may be found in [2]. Multiprecision multiplication is performed using this FFT as follows. Let x = x 0 ; x 1 ; Delta Delta Delta ; x n Gamma1 ) and y = y 0 ; y 1 ; Delta Delta Delta ; y n Gamma1 ) denote the radix b representations of two multiprecision numbers. Extend x and y to length N = 2n by appending ....
Bailey, D. H., "A High Performance Fast Fourier Transform Algorithm for the Cray2 ", Journal of Supercomputing, to appear March 1987.
....for such data sets involve heavy use of power of two memory strides. Fortunately, it is possible to devise alternative FFT algorithms that do not rely on power of two strides. Indeed, some FFT algorithms can be performed using exclusively unit stride data access in inner computational loops [3] [5], 6] 9] 10] Even for systems with external or hierarchical memory systems, these unit stride algorithms are a definite improvement over conventional algorithms, since unit strides improve the locality of accesses to and from external memory. However, many FFT algorithms, both traditional and ....
Bailey, D. H., "A High-Performance Fast Fourier Transform Algorithm for the Cray2 ", Journal of Supercomputing, vol. 1 (1987), p. 43 - 60.
....the discrete Fourier transform may of course be economically computed using some variation of the fast Fourier transform (FFT) algorithm. It is most convenient to employ the radix two fast Fourier transform since there is a wealth of literature on how to efficiently implement this algorithm (see [1], 8] and [16] Thus it will be assumed from this point that N = 2 m for some integer m. One useful trick can be employed to further reduce the computational requirement for complex transforms. Note that the input data vectors x and y and the result vector z are purely real. This fact can ....
....or [11] Knuth [13] and Borodin [3] also provide excellent information on using these tools for computation. 7. Computational Results The author has implemented all three of the above techniques for multi precision multiplication on the Cray 2. By employing special high performance techniques [1], the complex transform can be made to run the fastest, about four times faster than the two prime transform method. However, the memory requirement of the two prime scheme is significantly less than either the three prime or the complex scheme, and since the two prime scheme permits very ....
Bailey, D. H., "A High-Performance Fast Fourier Transform Algorithm for the Cray2 ", to appear in Journal of Supercomputing, 1987.
....has reported success in implementing this technique on the CDC 205, for instance. Unfortunately, the Pease algorithm requires a bit reversal permutation to be performed on the output data, and most vector computers are not efficient performing this permutation. In a previous paper by the author [1], a technique was presented that avoids the powerof two memory strides inherent in the Stockham FFT. The first trick is to perform the Stockham iterations incrementing k in the inner loop instead of j. In this way, the data arrays X and Y may be accessed with unit stride (provided that complex ....
....the Stockham iterations incrementing k in the inner loop instead of j. In this way, the data arrays X and Y may be accessed with unit stride (provided that complex data is stored with real and imaginary parts separated instead of interleaved, as is the usual custom) The second idea mentioned in [1] is to store the powers of ff t that are needed for a single iteration as contiguous data in a separate section of the array U . In this way, the fetch of roots of unity from U always has stride one. The memory storage required for storing all roots of unity in this manner is only twice the usual ....
[Article contains additional citation context not shown here]
Bailey, D.H., "A High-Performance Fast Fourier Transform Algorithm for the Cray2 ", Journal of Supercomputing, to appear.
No context found.
D.H. Bailey. A high-performance fast Fourier transform algorithm for the Cray2. Journal of Supercomputing, 1:43--60, 1987.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC