MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Fast bit-reversals on uniprocessors and shared-memory multiprocessors (1995) [2 citations — 0 self]

Download:
Download as a PDF | Download as a PS
by Zhao Zhang, Xiaodong Zhang
Tsallis, Physica A
http://www.cs.wm.edu/hpcs/WWW/HTML/publications/./papers/TR-01-2.ps.Z
Add To MetaCart

Abstract:

Abstract. In this paper, we examine di#erent methods using techniques of blocking, bu#ering, and padding for e#cient implementations of bit-reversals. We evaluate the merits and limits of each technique and its application and architecture-dependent conditions for developing cache-optimal methods. Besides testing the methods on di#erent uniprocessors, we conducted both simulation and measurements on two commercial symmetric multiprocessors (SMP) to provide architectural insights into the methods and their implementations. We present two contributions in this paper: (1) Our integrated blocking methods, which match cache associativity and translation-lookaside bu#er (TLB) cache size and which fully use the available registers, are cache-optimal and fast. (2) We show that our padding methods outperform other software-oriented methods, and we believe they are the fastest in terms of minimizing both CPU and memory access cycles. Since the padding methods are almost independent of hardware, they could be widely used on many uniprocessor workstations and multiprocessors.

Citations

3148 Computer Architecture: A Quantitative Approach – Hennessy, Patterson - 1996
680 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and – Jouppi - 1990
291 Compiler transformations for high-performance computing – BACON, GRAHAM, et al. - 1994
289 Tukey, “An algorithm for the machine calculation of complex Fourier series – Cooley, W - 1965
276 LMbench: Portable tools for performance analysis – McVoy, Staelin - 1996
139 Using the SimOS Machine Simulator to Study Complex Computer Systems – Rosenblum, Begnion, et al. - 1997
108 FFT’s in external or hierarchical memory – Bailey - 1990
104 Data transformations for eliminating conflict misses – Rivera, Tseng - 1998
93 Avoiding conflict misses dynamically in large direct-mapped caches – Bershad, Lee, et al. - 1994
33 Algorithms for matrix transposition for boolean n-cube configured ensemble architectures – Johnsson, Ho - 1988
22 Improving memory performance of sorting algorithms – Xiao, Zhang, et al. - 2000
21 FFT algorithms for vector computers – Swarztrauber - 1984
14 Two fast and high-associativity cache schemes – Zhang, Zhang, et al. - 1997
11 Cacheminer: A runtime approach to exploit cache locality on SMP – Yan, Zhang, et al.
7 Bit reversal on uniprocessors – Karp - 1996
6 An improved digit-reversal permutation algorithm for the fast Fourier and Hartley transforms – Evans - 1987
6 Memory hierarchy considerations for fast transpose and bit-reversals – Gatlin, Carter - 1999
6 P1003.4a: Threads Extension for Portable Operating Systems – IEEE - 1994
4 Virtual-address caches – Cekleov, Dubois, et al. - 1990
2 Compiler Transformations for High-Performance Computing – Sharp - 1994
1 Optimal matrix transpose and bit reversal on hypercube: All-to-all personalized communication – Edelman - 1991
1 Using the SimOS machine simulator to study complex computer systems – Herrod - 1997
1 Improving memory performance of sorting algorithms – Kubricht
1 Two fast and high-associativity cache schemes – Yan - 1997