16 citations found. Retrieving documents...
P. N. Swarztrauber, FFT Algorithms for Vector Computers, Parallel Comput. 1 (1984), pp. 45--63.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
A Parallel 3-D FFT Algorithm on Clusters of Vector SMPs - Takahashi (2000)   (Correct)

....20.062 0.05518 36.485 2 8 2 2 8 2 2 9 1.02915 4.076 0.63086 6.649 0.40801 10.280 0.20011 20.960 0.10895 38.498 2 8 2 2 9 2 2 9 2.09089 4.172 1.27866 6.823 0.79872 10.923 0.40348 21.622 0. 21736 40.137 On a single vector SMP node we use FFT algorithms of the Stockham autosort algorithm [13] for radix 2, 4 and 8. The SR8000 s microprocessors have a three operand multiply add instruction, which computes a = 6a6 bc, where a, b and c are floating point registers. Goedecker [14] reduced the number of instructions necessary for radix 2, 3, 4 and 5 FFT kernels by maximizing the use of ....

P. N. Swarztrauber, "FFT Algorithms for Vector Computers," Parallel Computing, vol. 1, pp. 45--63, 1984.


Bit Reversal On Uniprocessors - Karp (1996)   (3 citations)  (Correct)

....these algorithms require a so called bit reversal reordering of the data. If the bit reversal is not done properly, it can take a substantial fraction of the total time to do the FFT. In fact, it is common wisdom that bit reversal reordering is too slowtobeusedonamachine with a hierarchical memory[37]. Figure 1 illustrates the problem. It shows the time it takes to do the bit reversal of an array of the indicated length in machine cycles per element. Two methods are shown, a simple scatter operation[23] and one of the first published methods[7] Also shown is the time it takes to do one FFT ....

....transferred between memory and backing storage. His code uses the perfect shuffle on successively smaller segments on each pass. The loops are arranged to keep a block of data in main memory as long as possible. Not surprisingly,thisapproachworks well on hierarchical memory machines. Swarztrauber[37] presented two algorithms that are similar to Singleton s but are coded to look like matrix transposes. Routine ctsort is an inverse perfect shuffle# successive elements are stored in separate halves of the output array. Routine ptsort implements a perfect shuffle similar to Singleton s, but the ....

[Article contains additional citation context not shown here]

P. N. Swarztrauber, FFT Algorithms for Vector Computers,Parallel Computing, 1 (1984), pp. 45--63.


Fast Bit-Reversals On Uniprocessors And Shared-Memory.. - Zhang, Zhang (2001)   (Correct)

....is highly sensitive to how caches and memory hierarchies are used in the implementations. In other words, a fast bit reversal implementation must be cache e#ective. Several papers have well addressed the significance and e#ects of considering memory hierarchy to bit reversals (e.g. 2] 11] and [15]) Besides the important usage for FFT, di#erent versions of bit reversal implementations can also be used as benchmark programs to evaluate the memory hierarchy of various computer systems. With the rapid development of RISC and VLSI technology, the speed of processors has increased dramatically ....

P. N. Swarztrauber, FFT algorithms for vector computers, Parallel Comput., 1 (1984), pp. 45--63.


Latency and Bandwidth Requirements of Massively Parallel.. - Petrini (1999)   (Correct)

.... coupled parallel machines, due to its symmetries and scalability properties [26] and is a basic block of the counting and bitonic sorting networks [5] For these reasons the implementation of FFT on parallel computers has raised a great interest in the scienti c community over the last decade [24, 13, 8, 6, 7, 38, 37, 23, 35]. In this paper we analyze through simulation the characteristics of the communication patterns generated by some parallel FFT algorithms on two popular families of interconnection networks, the k ary n cubes [17] and the k ary n trees [34] 32] using wormhole and cut through routing ....

P. N. Swarztrauber. FFT Algorithms for Vector Computers. Parallel Computing, 1:45-63, 1984.


Latency and Bandwidth Requirements of Massively Parallel.. - Petrini, Vanneschi (1999)   (Correct)

.... and scalability properties [Hwa93] and is a basic block of the counting and bitonic sorting networks [AHS93] For these reasons the implementation of FFT on parallel computers has raised a great interest in the scientific community over the last decade [GK93, Chu87, Bai90, Bai87, Bai88, Swa87, Swa84, Get92, RWW91] In this paper we analyze through simulation the characteristics of the communication patterns generated by some parallel FFT algorithms on two popular families of interconnection networks, the k ary n cubes [Dal90] and the k ary n trees [PV95a] using wormhole and cut through ....

P. N. Swarztrauber. FFT Algorithms for Vector Computers. Parallel Computing, 1:45--63, 1984.


Efficient Overlapped FFT Algorithms for Hypercube-Connected.. - Aykanat, Dervis   (Correct)

....limited by the rate at which the FFT algorithm can be executed. The high performance requirement for real time implementation of these algorithms led to the design of special purpose hardwares. An extensive research has been conducted to implement efficient FFT algorithms on vector processors [6, 11], and general purpose parallel architectures with shared memory [1, 4, 10] and distributed memory [12, 13, 15] The purpose of this paper is to investigate the efficient parallelization of one dimensional FFT algorithm on medium to coarse grain, distributed memory, message passing architectures ....

P.N. Swarztrauber, "FFT algorithms for vector computers", in Parallel Computing 1, pp. 45-63, 1984.


Applications Of FFT - Emiris, Pan (1999)   (Correct)

.... for reducing the complexity of computing the DFT and IDFT for speci c smaller K, even though these algorithms do not decrease the asymptotic complexity estimates [Winograd, 1980, Van Loan, 1992, Bini and Bozzo, 1993] There exist public domain codes implementing FFT freely accessible via netlib [Swarztrauber, 1984, Bailey, 1993, Bailey, 1993b, Frigo and Johnson] and certain libraries of arbitrary precision integer arithmetic [Bailey, 1993b, Biehl et al. 1995, GNU, 1996] use FFT. Some comments are in order on conditioning and numerical stability. It is fortunate that the DFT is a well conditioned ....

Swarztrauber, P. 1984. FFT algorithms for vector computers. Parallel Computing, 1:4563. Implementation at http://www.psc.edu/general/software/packages/fftpack/fftpack.html or ftp://netlib.att.com/netlib.


Performance Evaluation of FFT Routines - Machine.. - Auer, Benedik.. (1999)   (Correct)

....1987 Version 2.0: vsint, vsinti, vcost, vcosti, vsinqf, vsinqb, vsinqi, vcosqf, vcosqb, vcosqi added by Boisvert May 1990 Version 2.1: documentation revised Language: Fortran 77. Target Machines: all computers with a Fortran 77 compiler. Precision: single. Description: Vfftpack (Swarztrauber [22]) is a vectorized version of the scalar package Fftpack (Version 3) by P. N. Swarztrauber (see Section 2.4 on page 11) The algorithm used is a 2.24. VFFTPACK 45 ffl mixed radix Stockham autosort FFT algorithm with radix 2, 3, 4, 5 kernels as well as a general kernel for odd factors; ffl ....

P. N. Swarztrauber, FFT Algorithms for Vector Computers, Parallel Comput. 1 (1984), pp. 45--63.


Challenges of Computing the Fast Fourier Transform - Johnson, Johnson (1997)   (5 citations)  (Correct)

....the Cyber 205 in the late 70 s and early 80 s. These machines were vector machines and required that code be vectorized to achieve high performance. Penalties for poorly vectorized code could certainly be an order of magnitude. This problem lead to a flurry of activity in re examining the FFT [1, 2, 8, 9, 10, 15, 27, 38, 39, 40, 41]. 4 Portability The search for optimal implementations of the FFT lead to the development of many variants of the original implementation. These variants were difficult to program, so that it became very desirable to take, say a good vector version, and port it to another vector machine rather ....

P. N. Swarztrauber. FFT algorithms for vector computers. Parallel Comput. , 1:45--63, 1984.


FFTs in External or Hierarchical Memory - Bailey (1989)   (52 citations)  (Correct)

....data sets involve heavy use of power of two memory strides. Fortunately, it is possible to devise alternative FFT algorithms that do not rely on power of two strides. Indeed, some FFT algorithms can be performed using exclusively unit stride data access in inner computational loops [3] 5] 6] [9], 10] Even for systems with external or hierarchical memory systems, these unit stride algorithms are a definite improvement over conventional algorithms, since unit strides improve the locality of accesses to and from external memory. However, many FFT algorithms, both traditional and modern, ....

Swarztrauber, P. N., "FFT Algorithms for Vector Computers", Parallel Computing, 1 (1984), p. 45 - 63.


The Computation of pi to 29,360,000 Decimal Digits Using.. - Bailey (1987)   (Correct)

....transform may of course be economically computed using some variation of the fast Fourier transform (FFT) algorithm. It is most convenient to employ the radix two fast Fourier transform since there is a wealth of literature on how to efficiently implement this algorithm (see [1] 8] and [16]) Thus it will be assumed from this point that N = 2 m for some integer m. One useful trick can be employed to further reduce the computational requirement for complex transforms. Note that the input data vectors x and y and the result vector z are purely real. This fact can be exploited by ....

Swarztrauber, P. N., "FFT Algorithms for Vector Computers", Parallel Computing, 1 (1984), pp. 45-64.


A High-Performance FFT Algorithm for Vector Supercomputers - Bailey (1988)   (6 citations)  (Correct)

....of the algorithm on the relatively slow Cray 2 main memory, and a significant improvement in performance is obtained. The resulting Fortran program is as much as 30 faster than Cray s assembly coded library routine on the Cray 2. A New Technique for Performing Power of Two FFTs In [4] and [5] Swarztrauber surveys the best algorithms currently available for performing FFTs on parallel and vector computers. Included in [5] are two variations of the Stockham FFT, each different than the algorithm listed above. These variant FFTs each have the property that all inner loop calculations can ....

....Fortran program is as much as 30 faster than Cray s assembly coded library routine on the Cray 2. A New Technique for Performing Power of Two FFTs In [4] and [5] Swarztrauber surveys the best algorithms currently available for performing FFTs on parallel and vector computers. Included in [5] are two variations of the Stockham FFT, each different than the algorithm listed above. These variant FFTs each have the property that all inner loop calculations can be performed with stride one accesses of the main data arrays. It is necessary, however, to switch from one variant to the other ....

Swarztrauber, P.N., "FFT Algorithms for Vector Computers", Parallel Computing, 1 (1984), pp. 45-63.


Memory Hierarchy Considerations for Fast Transpose and.. - Gatlin, Carter (1999)   (4 citations)  (Correct)

....of some FFT algorithms (e.g. Pease s algorithm) is the bit reversal reordering. Other self ordering FFT algorithms have been developed expressly to avoid the costly bit reversal computation. They have a disadvantage of requiring more memory, and still may not avoid memory hierarchy problems [19]. As is well known, naive implementations of these reorderings are slow, and loop transformations (tiling or blocking) can greatly improve performance. Here, we explore in depth a harder issue that often arises: when the array dimensions are large powers of two, there can be extreme cache and TLB ....

P. Swarztrauber. Fft algorithms for vector computers. Parallel Computing, 1:45--63, 1984.


The Fractional Fourier Transform and Applications - Bailey, Swarztrauber (1995)   (4 citations)  Self-citation (Swarztrauber)   (Correct)

....k (z) It should be emphasized that this equality only holds for 0 k m. The remaining 2p Gamma m results of the final inverse DFT are discarded. These three DFTs can of course be efficiently computed using 2p point FFTs (for discussions of computing FFTs, see [1] 4] 5] 7] 9] 11] [16] and [17] To compute a different m long segment G k s (x; ff) 0 k m, it is necessary to slightly modify the above convolution procedure. In this case z is as follows: z j = e i(j s) 2 ff 0 j m (20) z j = 0 m j 2p Gamma m (21) z j = e i(j s Gamma2p) 2 ff 2p Gamma m j ....

Swarztrauber, P. N., "FFT Algorithms for Vector Computers", Parallel Computing, 1 (1984), p. 45 - 63.


Performance Evaluation of FFT Routines: Machine.. - Auer, Benedik.. (1999)   (Correct)

No context found.

P. N. Swarztrauber, FFT Algorithms for Vector Computers, Parallel Comput. 1 (1984), pp. 45--63.


A High-Performance Fast Fourier Transform Algorithm for the Cray-2 - Bailey (1986)   (4 citations)  (Correct)

No context found.

P. N. Swarztrauber, "FFT Algorithms for Vector Computers", Parallel Computing, 1 (1984), pp. 45-63.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC