Download:
|
by Zhao Zhang, Xiaodong Zhang
Tsallis, Physica A
http://www.cs.wm.edu/hpcs/WWW/HTML/publications/./papers/TR-01-2.ps.Z
Add To MetaCart
Abstract:
Abstract. In this paper, we examine di#erent methods using techniques of blocking, bu#ering, and padding for e#cient implementations of bit-reversals. We evaluate the merits and limits of each technique and its application and architecture-dependent conditions for developing cache-optimal methods. Besides testing the methods on di#erent uniprocessors, we conducted both simulation and measurements on two commercial symmetric multiprocessors (SMP) to provide architectural insights into the methods and their implementations. We present two contributions in this paper: (1) Our integrated blocking methods, which match cache associativity and translation-lookaside bu#er (TLB) cache size and which fully use the available registers, are cache-optimal and fast. (2) We show that our padding methods outperform other software-oriented methods, and we believe they are the fastest in terms of minimizing both CPU and memory access cycles. Since the padding methods are almost independent of hardware, they could be widely used on many uniprocessor workstations and multiprocessors.
Citations
|
3148
|
Computer Architecture: A Quantitative Approach
– Hennessy, Patterson
- 1996
|
|
680
|
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and
– Jouppi
- 1990
|
|
291
|
Compiler transformations for high-performance computing
– BACON, GRAHAM, et al.
- 1994
|
|
289
|
Tukey, “An algorithm for the machine calculation of complex Fourier series
– Cooley, W
- 1965
|
|
276
|
LMbench: Portable tools for performance analysis
– McVoy, Staelin
- 1996
|
|
139
|
Using the SimOS Machine Simulator to Study Complex Computer Systems
– Rosenblum, Begnion, et al.
- 1997
|
|
108
|
FFT’s in external or hierarchical memory
– Bailey
- 1990
|
|
104
|
Data transformations for eliminating conflict misses
– Rivera, Tseng
- 1998
|
|
93
|
Avoiding conflict misses dynamically in large direct-mapped caches
– Bershad, Lee, et al.
- 1994
|
|
33
|
Algorithms for matrix transposition for boolean n-cube configured ensemble architectures
– Johnsson, Ho
- 1988
|
|
22
|
Improving memory performance of sorting algorithms
– Xiao, Zhang, et al.
- 2000
|
|
21
|
FFT algorithms for vector computers
– Swarztrauber
- 1984
|
|
14
|
Two fast and high-associativity cache schemes
– Zhang, Zhang, et al.
- 1997
|
|
11
|
Cacheminer: A runtime approach to exploit cache locality on SMP
– Yan, Zhang, et al.
|
|
7
|
Bit reversal on uniprocessors
– Karp
- 1996
|
|
6
|
An improved digit-reversal permutation algorithm for the fast Fourier and Hartley transforms
– Evans
- 1987
|
|
6
|
Memory hierarchy considerations for fast transpose and bit-reversals
– Gatlin, Carter
- 1999
|
|
6
|
P1003.4a: Threads Extension for Portable Operating Systems
– IEEE
- 1994
|
|
4
|
Virtual-address caches
– Cekleov, Dubois, et al.
- 1990
|
|
2
|
Compiler Transformations for High-Performance Computing
– Sharp
- 1994
|
|
1
|
Optimal matrix transpose and bit reversal on hypercube: All-to-all personalized communication
– Edelman
- 1991
|
|
1
|
Using the SimOS machine simulator to study complex computer systems
– Herrod
- 1997
|
|
1
|
Improving memory performance of sorting algorithms
– Kubricht
|
|
1
|
Two fast and high-associativity cache schemes
– Yan
- 1997
|