Results 1 - 10
of
726
The Landscape of Parallel Computing Research: A View from Berkeley
- TECHNICAL REPORT, UC BERKELEY
, 2006
"... ..."
SPIRAL: Code Generation for DSP Transforms
- PROCEEDINGS OF THE IEEE SPECIAL ISSUE ON PROGRAM GENERATION, OPTIMIZATION, AND ADAPTATION
"... Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL that considers this problem for the performance-critical domain of linear digital signal proces ..."
Abstract
-
Cited by 222 (41 self)
- Add to MetaCart
Fast changing, increasingly complex, and diverse computing platforms pose central problems in scientific computing: How to achieve, with reasonable effort, portable optimal performance? We present SPIRAL that considers this problem for the performance-critical domain of linear digital signal processing (DSP) transforms. For a specified transform, SPIRAL automatically generates high performance code that is tuned to the given platform. SPIRAL formulates the tuning as an optimization problem, and exploits the domain-specific mathematical structure of transform algorithms to implement a feedback-driven optimizer. Similar to a human expert, for a specified transform, SPIRAL “intelligently ” generates and explores algorithmic and implementation choices to find the best match to the computer’s microarchitecture. The “intelligence” is provided by search and learning techniques that exploit the structure of the algorithm and implementation space to guide the exploration and optimization. SPIRAL generates high performance code for a broad set of DSP transforms including the discrete Fourier transform, other trigonometric transforms, filter transforms, and discrete wavelet transforms. Experimental results show that the code generated by SPIRAL competes with, and sometimes outperforms, the best available human tuned transform library code.
Nonequispaced hyperbolic cross fast Fourier transform
"... A straightforward discretisation of problems in d spatial dimensions often leads to an exponential growth in the number of degrees of freedom. Thus, even efficient algorithms like the fast Fourier transform (FFT) have high computational costs. Hyperbolic cross approximations allow for a severe decre ..."
Abstract
-
Cited by 113 (3 self)
- Add to MetaCart
(Show Context)
A straightforward discretisation of problems in d spatial dimensions often leads to an exponential growth in the number of degrees of freedom. Thus, even efficient algorithms like the fast Fourier transform (FFT) have high computational costs. Hyperbolic cross approximations allow for a severe decrease in the number of used Fourier coefficients to represent functions with bounded mixed derivatives. We propose a nonequispaced hyperbolic cross fast Fourier transform based on one hyperbolic cross FFT and a dedicated interpolation by splines on sparse grids. Analogously to the nonequispaced FFT for trigonometric polynomials with Fourier coefficients supported on the full grid, this allows for the efficient evaluation of trigonometric polynomials with Fourier coefficients supported on the hyperbolic cross at arbitrary spatial sampling nodes. Key words and phrases: trigonometric approximation, hyperbolic cross, sparse grid, fast Fourier transform, nonequispaced FFT
PetaBricks: a language and compiler for algorithmic choice, volume 44
, 2009
"... It is often impossible to obtain a one-size-fits-all solution for high performance algorithms when considering different choices for data distributions, parallelism, transformations, and blocking. The best solution to these choices is often tightly coupled to different architectures, problem sizes, ..."
Abstract
-
Cited by 81 (10 self)
- Add to MetaCart
It is often impossible to obtain a one-size-fits-all solution for high performance algorithms when considering different choices for data distributions, parallelism, transformations, and blocking. The best solution to these choices is often tightly coupled to different architectures, problem sizes, data, and available system resources. In some cases, completely different algorithms may provide the best performance. Current compiler and programming language techniques are able to change some of these parameters, but today there is no simple way for the programmer to express or the compiler to choose different algorithms to handle different parts of the data. Existing solutions normally can handle only coarse-grained, library level selections or hand coded cutoffs between base cases and recursive cases. We present PetaBricks, a new implicitly parallel language
Scalable algorithms for molecular dynamics simulations on commodity clusters
- In SC ’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing
, 2006
"... Although molecular dynamics (MD) simulations of biomolecular systems often run for days to months, many events of great scientific interest and pharmaceutical relevance occur on long time scales that remain beyond reach. We present several new algorithms and implementation techniques that significan ..."
Abstract
-
Cited by 68 (5 self)
- Add to MetaCart
(Show Context)
Although molecular dynamics (MD) simulations of biomolecular systems often run for days to months, many events of great scientific interest and pharmaceutical relevance occur on long time scales that remain beyond reach. We present several new algorithms and implementation techniques that significantly accelerate parallel MD simulations compared with current stateof-the-art codes. These include a novel parallel decomposition method and message-passing techniques that reduce communication requirements, as well as novel communication primitives that further reduce communication time. We have also developed numerical techniques that maintain high accuracy while using single precision computation in order to exploit processor-level vector instructions. These methods are embodied in a newly developed MD code called Desmond that achieves unprecedented simulation throughput and parallel scalability on commodity clusters. Our results suggest that Desmond’s parallel performance substantially surpasses that of any previously described code. For example, on a standard benchmark, Desmond’s performance on a conventional Opteron cluster with 2K processors slightly exceeded the reported performance of IBM’s Blue Gene/L machine with 32K processors running its Blue Matter MD code. 1.
Rapidly selecting good compiler optimizations using performance counters
- In Proceedings of the 5th Annual International Symposium on Code Generation and Optimization (CGO
, 2007
"... Applying the right compiler optimizations to a particular program can have a significant impact on program performance. Due to the non-linear interaction of compiler optimizations, however, determining the best setting is nontrivial. There have been several proposed techniques that search the space ..."
Abstract
-
Cited by 66 (24 self)
- Add to MetaCart
(Show Context)
Applying the right compiler optimizations to a particular program can have a significant impact on program performance. Due to the non-linear interaction of compiler optimizations, however, determining the best setting is nontrivial. There have been several proposed techniques that search the space of compiler options to find good solutions; however such approaches can be expensive. This paper proposes a different approach using performance counters as a means of determining good compiler optimization settings. This is achieved by learning a model off-line which can then be used to determine good settings for any new program. We show that such an approach outperforms the state-ofthe-art and is two orders of magnitude faster on average. Furthermore, we show that our performance counter based approach outperforms techniques based on static code features. Finally, we show that such improvements are stable across varying input data sets. Using our technique we achieve a 10 % improvement over the highest optimization setting of the commercial PathScale EKOPath 2.3.1 optimizing compiler on the SPEC benchmark suite on a recent AMD Athlon 64 3700+ platform in just three evaluations. 1
Is Search Really Necessary to Generate High-Performance BLAS?
, 2005
"... Abstract — A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of p ..."
Abstract
-
Cited by 64 (13 self)
- Add to MetaCart
Abstract — A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and loop unrolling factors. Traditional compilers use simple analytical models to compute these values. In contrast, library generators like ATLAS use global search over the space of parameter values by generating programs with many different combinations of parameter values, and running them on the actual hardware to determine which values give the best performance. It is widely believed that traditional model-driven optimization cannot compete with search-based empirical optimization because tractable analytical models cannot capture all the complexities of modern high-performance architectures, but few quantitative comparisons have been done to date. To make such a comparison, we replaced the global search engine in ATLAS with a model-driven optimization engine, and measured the relative performance of the code produced by the two systems on a variety of architectures. Since both systems use the same code generator, any differences in the performance of the code produced by the two systems can come only from differences in optimization parameter values. Our experiments show that model-driven optimization can be surprisingly effective, and can generate code with performance comparable to that of code generated by ATLAS using global search. Index Terms — program optimization, empirical optimization, model-driven optimization, compilers, library generators, BLAS, high-performance computing
Multicore bundle adjustment
- In IEEE Conference on Computer Vision and Pattern Recognition (CVPR
, 2011
"... We present the design and implementation of new inexact Newton type Bundle Adjustment algorithms that exploit hardware parallelism for efficiently solving large scale 3D scene reconstruction problems. We explore the use of multicore CPU as well as multicore GPUs for this purpose. We show that overco ..."
Abstract
-
Cited by 61 (4 self)
- Add to MetaCart
(Show Context)
We present the design and implementation of new inexact Newton type Bundle Adjustment algorithms that exploit hardware parallelism for efficiently solving large scale 3D scene reconstruction problems. We explore the use of multicore CPU as well as multicore GPUs for this purpose. We show that overcoming the severe memory and bandwidth limitations of current generation GPUs not only leads to more space efficient algorithms, but also to surprising savings in runtime. Our CPU based system is up to ten times and our GPU based system is up to thirty times faster than the current state of the art methods [1], while maintaining comparable convergence behavior. The code and additional results are available at
Random sampling for analog-to-information conversion of wideband signals
- Proc. IEEE Dallas Circuits and Systems Workshop (DCAS
, 2006
"... Abstract — We develop a framework for analog-to-information conversion that enables sub-Nyquist acquisition and processing of wideband signals that are sparse in a local Fourier representation. The first component of the framework is a random sampling system that can be implemented in practical hard ..."
Abstract
-
Cited by 60 (14 self)
- Add to MetaCart
(Show Context)
Abstract — We develop a framework for analog-to-information conversion that enables sub-Nyquist acquisition and processing of wideband signals that are sparse in a local Fourier representation. The first component of the framework is a random sampling system that can be implemented in practical hardware. The second is an efficient information recovery algorithm to compute the spectrogram of the signal, which we dub the sparsogram. A simulated acquisition of a frequency hopping signal operates at 33 × sub-Nyquist average sampling rate with little degradation in signal quality. I.