Results 11 - 20
of
158
Optimization of Sparse Matrix-vector Multiplication on Emerging Multicore Platforms
- In Proc. SC2007: High performance computing, networking, and storage conference
, 2007
"... We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore spec ..."
Abstract
-
Cited by 54 (15 self)
- Add to MetaCart
We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization methodologies for important scientific computations. In this work, we examine sparse matrix-vector multiply (SpMV) – one of the most heavily used kernels in scientific computing – across a broad spectrum of multicore designs. Our experimental platform includes the homogeneous AMD dual-core and Intel quad-core designs, the heterogeneous STI Cell, as well as the first scientific study of the highly multithreaded Sun Niagara2. We present several optimization strategies especially effective for the multicore environment, and demonstrate significant performance improvements compared to existing state-of-the-art serial and parallel SpMV implementations. Additionally, we present key insights into the architectural tradeoffs of leading multicore design strategies, in the context of demanding memory-bound numerical algorithms. 1.
The Impact of Technology Scaling on Lifetime Reliability
- In DSN (2004), IEEE, IEEE Computer Society
, 2004
"... The relentless scaling of CMOS technology has provided a steady increase in processor performance for the past three decades. However, increased power densities (hence temperatures) and other scaling effects have an adverse impact on long-term processor lifetime reliability. This paper represents a ..."
Abstract
-
Cited by 52 (0 self)
- Add to MetaCart
The relentless scaling of CMOS technology has provided a steady increase in processor performance for the past three decades. However, increased power densities (hence temperatures) and other scaling effects have an adverse impact on long-term processor lifetime reliability. This paper represents a first attempt at quantifying the impact of scaling on lifetime reliability due to intrinsic hard errors, taking workload characteristics into consideration.
Techniques for multicore thermal management: Classification and new exploration
- In ISCA 2006
"... Power density continues to increase exponentially with each new technology generation, posing a major challenge for thermal management in modern processors. Much past work has examined microarchitectural policies for reducing total chip power, but these techniques alone are insufficient if not aimed ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
Power density continues to increase exponentially with each new technology generation, posing a major challenge for thermal management in modern processors. Much past work has examined microarchitectural policies for reducing total chip power, but these techniques alone are insufficient if not aimed at mitigating individual hotspots. The industry’s current trend has been toward multicore architectures, which provide additional opportunities for dynamic thermal management. This paper explores various thermal management techniques that exploit the distributed nature of multicore processors. We classify these techniques in terms of core throttling policy, whether that policy is applied locally to a core or to the processor as a whole, and process migration policies. We use Turandot and a HotSpot-based thermal simulator to simulate a variety of workloads under thermal duress on a 4-core PowerPC TM processor. Using benchmarks from the SPEC 2000 suite we characterize workloads in terms of instruction throughput as well as their effective duty cycles. Among a variety of options we find that distributed controltheoretic DVFS alone improves throughput by 2.5X under our test conditions. Our final design involves a PI-based core thermal controller and an outer control loop to decide process migrations. This policy avoids all thermal emergencies and yields an average of 2.6X speedup over the baseline across all workloads. 1.
Algorithmic problems in power management
- SIGACT News
, 2005
"... We survey recent research that has appeared in the theoretical computer science literature on algorithmic ..."
Abstract
-
Cited by 46 (3 self)
- Add to MetaCart
We survey recent research that has appeared in the theoretical computer science literature on algorithmic
Reducing Power with Dynamic Critical Path Information
, 2001
"... Recent research has shown that dynamic information regarding instruction criticality can be used to increase microprocessor performance. Critical path information can also be used in processors to achieve a better balance of power and performance. This paper uses the output of a dynamic critical pat ..."
Abstract
-
Cited by 42 (2 self)
- Add to MetaCart
Recent research has shown that dynamic information regarding instruction criticality can be used to increase microprocessor performance. Critical path information can also be used in processors to achieve a better balance of power and performance. This paper uses the output of a dynamic critical path predictor to decrease the power consumption of key portions of the processor without incurring a corresponding decrease in performance. The optimizations include effective use of functional units with different power and latency characteristics and decreased issue logic power. 1.
Overhead-conscious voltage selection for dynamic and leakage energy reduction of time-constrained systems
, 2004
"... Dynamic voltage scaling and adaptive body biasing have been shown to reduce dynamic and leakage power consumption effectively. In this paper, we optimally solve the combined supply voltage and body bias selection problem for multi-processor systems with imposed time constraints, explicitly taking in ..."
Abstract
-
Cited by 34 (6 self)
- Add to MetaCart
Dynamic voltage scaling and adaptive body biasing have been shown to reduce dynamic and leakage power consumption effectively. In this paper, we optimally solve the combined supply voltage and body bias selection problem for multi-processor systems with imposed time constraints, explicitly taking into account the transition overheads implied by changing voltage levels. Both energy and time overheads are considered. We investigate the continuous voltage scaling as well as its discrete counterpart, and we prove NP-hardness in the discrete case. Furthermore, the continuous voltage scaling problem is formulated and solved using nonlinear programming with polynomial time complexity, while for the discrete problem we use mixed integer linear programming. Extensive experiments, conducted on several benchmarks and a real-life example, are used to validate the approaches. 1
Exploiting Choice in Resizable Cache Design to Optimize Deep-Submicron Processor Energy-Delay
- In Proceedings of the Eighth IEEE Symposium on High-Performance Computer Architecture
, 2002
"... Cache memories account for a significant fraction of a chip’s overall energy dissipation. Recent research advocates using “resizable ” caches to exploit cache requirement variability in applications to reduce cache size and eliminate energy dissipation in the cache’s unused sections with minimal imp ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
Cache memories account for a significant fraction of a chip’s overall energy dissipation. Recent research advocates using “resizable ” caches to exploit cache requirement variability in applications to reduce cache size and eliminate energy dissipation in the cache’s unused sections with minimal impact on performance. Current proposals for resizable caches fundamentally vary in two design aspects: (1) cache organization, where one organization, referred to as selective-ways, varies the cache’s set-associativity, while the other, referred to as selectivesets, varies the number of cache sets, and (2) resizing strategy, where one proposal statically sets the cache size prior to an application’s execution, while the other allows for dynamic resizing both within and across applications. In this paper, we compare and contrast, for the first time, the proposed design choices for resizable caches, and evaluate the effectiveness of cache resizings in reducing the overall energy-delay in deep-submicron processors. In addition, we propose a hybrid selective-sets-andways cache organization that always offers equal or better resizing granularity than both of previously proposed organizations. We also investigate the energy savings from resizing d-cache and i-cache together to characterize the interaction between d-cache and i-cache resizings. 1
Interconnect-power Dissipation in a Microprocessor
- in Proceedings of the International Workshop on System-Level Interconnect Prediction
, 2004
"... Interconnect power is dynamic power dissipation due to switching of interconnection capacitances. This paper describes the characterization of interconnect power in a state-of-the-art high-performance microprocessor designed for power efficiency. The analysis showed that interconnect power is over 5 ..."
Abstract
-
Cited by 32 (4 self)
- Add to MetaCart
Interconnect power is dynamic power dissipation due to switching of interconnection capacitances. This paper describes the characterization of interconnect power in a state-of-the-art high-performance microprocessor designed for power efficiency. The analysis showed that interconnect power is over 50 % of the dynamic power. Over 90 % of the interconnect power is consumed by only 10 % of the interconnections. Relations of interconnect power to wire length distribution and hierarchy level of nets were examined. In light of the results, a router’s algorithms were modified, to use larger wire spacing and minimal length routing for the high power consuming interconnects. The power-aware router algorithm was tested on synthesized blocks, demonstrating average saving of 14 % in the dynamic power consumption without timing degradation or area increase. The results demonstrate the obtainable benefits of tuning physical design algorithms to save power.
Static Energy Reduction Techniques for Microprocessor Caches
- IN 2001 INTERNATIONAL CONFERENCE ON COMPUTER DESIGN
, 2001
"... Microprocessor performance has been improved by increasing the capacity of on-chip caches. However, the performance gain comes at the price of static energy consumption due to subthreshold leakage current in cache memory arrays. This paper compares three techniques for reducing static energy consump ..."
Abstract
-
Cited by 31 (4 self)
- Add to MetaCart
Microprocessor performance has been improved by increasing the capacity of on-chip caches. However, the performance gain comes at the price of static energy consumption due to subthreshold leakage current in cache memory arrays. This paper compares three techniques for reducing static energy consumption in on-chip level-1 and level-2 caches. One technique employs low-leakage transistors in the memory cell. Another technique, power supply switching, can be used to turn off memory cells and discard their contents. A third alternative is dynamic threshold modulation, which places memory cells in a standby state that preserves cell contents. In our experiments, we explore the energy and performance trade-offs of these techniques. We also investigate the sensitivity of microprocessor performance and energy consumption to additional cache latency caused by leakage-reduction techniques.
Methodology For Early-Stage, Microarchitecture-Level Power-Performance Analysis Of Microprocessors
"... The PowerTimer toolset has been developed for use in early-stage, microarchitecture-level power-performance analysis of microprocessors. ..."
Abstract
-
Cited by 31 (16 self)
- Add to MetaCart
The PowerTimer toolset has been developed for use in early-stage, microarchitecture-level power-performance analysis of microprocessors.

