Results 1 - 10
of
41
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs
- In Proceedings of the 40th Intl. Symp. on Microarchitecture
, 2007
"... DRAMs require periodic refresh for preserving data stored in them. The refresh interval for DRAMs depends on the vendor and the design technology they use. For each refresh in a DRAM row, the stored information in each cell is read out and then written back to itself as each DRAM bit read is self-de ..."
Abstract
-
Cited by 60 (1 self)
- Add to MetaCart
(Show Context)
DRAMs require periodic refresh for preserving data stored in them. The refresh interval for DRAMs depends on the vendor and the design technology they use. For each refresh in a DRAM row, the stored information in each cell is read out and then written back to itself as each DRAM bit read is self-destructive. The refresh process is inevitable for maintaining data correctness, unfortunately, at the expense of power and bandwidth overhead. The future trend to integrate layers of 3D die-stacked DRAMs on top of a processor further exacerbates the situation as accesses to these DRAMs will be more frequent and hiding refresh cycles in the available slack becomes increasingly difficult. Moreover, due to the implication of temperature increase, the refresh interval of 3D die-stacked DRAMs will become shorter than those of conventional ones. This paper proposes an innovative scheme to alleviate the energy consumed in DRAMs. By employing a time-out counter for each memory row of a DRAM module, all the unnecessary periodic refresh operations can be eliminated. The basic concept behind our scheme is that a DRAM row that was recently read or written to by the processor (or other devices that share the same DRAM) does not need to be refreshed again by the periodic refresh operation, thereby eliminating excessive refreshes and the energy dissipated. Based on this concept, we propose a low-cost technique in the memory controller for DRAM power reduction. The simulation results show that our technique can reduce up to 86 % of all refresh operations and 59.3 % on the average for a 2GB DRAM. This in turn results in a 52.6 % energy savings for refresh operations. The overall energy saving in the DRAM is up to 25.7 % with an average of 12.13 % obtained for SPLASH-2, SPECint2000, and Biobench benchmark programs simulated on a 2GB DRAM. For a 64MB 3D DRAM, the energy saving is up to 21 % and 9.37 % on an average when the refresh rate is 64 ms. For a faster 32ms refresh rate the maximum and average savings are 12 % and 6.8 % respectively. 1.
Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior
"... In a modern chip-multiprocessor system, memory is a shared resource among multiple concurrently executing threads. The memory scheduling algorithm should resolve memory contention by arbitrating memory access in such a way that competing threads progress at a relatively fast and even pace, resulting ..."
Abstract
-
Cited by 48 (11 self)
- Add to MetaCart
(Show Context)
In a modern chip-multiprocessor system, memory is a shared resource among multiple concurrently executing threads. The memory scheduling algorithm should resolve memory contention by arbitrating memory access in such a way that competing threads progress at a relatively fast and even pace, resulting in high system throughput and fairness. Previously proposed memory scheduling algorithms are predominantly optimized for only one of these objectives: no scheduling algorithm provides the best system throughput and best fairness at the same time. This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both. The main idea is to divide threads into two separate clusters and employ different
BigHouse: A simulation infrastructure for data center systems
"... Recently, there has been an explosive growth in Internet services, greatly increasing the importance of data center systems. Applications served from “the cloud ” are driving data center growth and quickly overtaking traditional workstations. Although there are a many tools for evaluating components ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
(Show Context)
Recently, there has been an explosive growth in Internet services, greatly increasing the importance of data center systems. Applications served from “the cloud ” are driving data center growth and quickly overtaking traditional workstations. Although there are a many tools for evaluating components of desktop and server architectures in detail, scalable modeling tools are noticeably missing. We describe BigHouse a simulation infrastructure for data center systems. Instead of simulating servers using detailed microarchitectural models, BigHouse raises the level of abstraction. Using a combination of queuing theory and stochastic modeling, BigHouse can simulate server systems in minutes rather than hours. BigHouse leverages statistical simulation techniques to limit simulation turnaround time to the minimum runtime needed for a desired accuracy. In this paper, we introduce BigHouse, describe its design, and present case studies for how it has already been applied to build and validate models of data center workloads and systems. Furthermore, we describe statistical techniques incorporated into BigHouse to accelerate and parallelize its simulations, and demonstrate its scalability to model large cluster systems while maintaining reasonable simulation time. 1.
Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory
"... Abstract—High density memory is becoming more important as many execution streams are consolidated onto single chip many-core processors. DRAM is ubiquitous as a main memory technology, but while DRAM’s per-chip density and frequency continue to scale, the time required to refresh its dynamic cells ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
(Show Context)
Abstract—High density memory is becoming more important as many execution streams are consolidated onto single chip many-core processors. DRAM is ubiquitous as a main memory technology, but while DRAM’s per-chip density and frequency continue to scale, the time required to refresh its dynamic cells has grown at an alarming rate. This paper shows how currently-employed methods to schedule refresh operations are ineffective in mitigating the significant performance degradation caused by longer refresh times. Current approaches are deficient – they do not effectively exploit the flexibility of DRAMs to postpone refresh operations. This work proposes dynamically reconfigurable predictive mechanisms that exploit the full dynamic range allowed in the JEDEC DDRx SDRAM specifications. The proposed mechanisms are shown to mitigate much of the penalties seen with dense DRAM devices. We refer to the overall scheme as Elastic Refresh, in that the refresh policy is stretched to fit the currently executing workload, such that the maximum benefit of the DRAM flexibility is realized. We extend the GEMS on SIMICS tool-set to include Elastic Refresh. Simulations show the proposed solution provides a ∼10 % average performance improvement over existing techniques across the entire SPEC CPU suite, and up to a 41% improvement for certain workloads. I.
Multi-Execution: Multicore Caching for Data-Similar Executions
"... While microprocessor designers turn to multicore architectures to sustain performance expectations, the dramatic increase in parallelism of such architectures will put substantial demands on off-chip bandwidth and make the memory wall more significant than ever. This paper demonstrates that one prof ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
(Show Context)
While microprocessor designers turn to multicore architectures to sustain performance expectations, the dramatic increase in parallelism of such architectures will put substantial demands on off-chip bandwidth and make the memory wall more significant than ever. This paper demonstrates that one profitable application of multicore processors is the execution of many similar instantiations of the same program. We identify that this model of execution is used in several practical scenarios and term it as “multi-execution.” Often, each such instance utilizes very similar data. In conventional cache hierarchies, each instance would cache its own data independently. We propose the Mergeable cache architecture that detects data similarities and merges cache blocks, resulting in substantial savings in cache storage requirements. This leads to reductions in off-chip memory accesses and overall power usage, and increases in application performance. We present cycle-accurate simulation results of 8 benchmarks (6 from SPEC2000) to demonstrate that our technique provides a scalable solution and leads to significant speedups due to reductions in main memory accesses. For 8 cores running 8 similar executions of the same application and sharing an exclusive 4-MB, 8-way L2 cache, the Mergeable cache shows a speedup in execution by 2.5× on average (ranging from 0.93 × to 6.92×), while posing an overhead of only 4.28 % on cache area and 5.21 % on power when it is used.
Architectural Support for the Stream Execution Model on General-Purpose Processors
"... There has recently been much interest in stream processing, both in industry (e.g., Cell, NVIDIA G80, ATI R580) and academia (e.g., Stanford Merrimac, MIT RAW), with stream programs becoming increasingly popular for both media and more general-purpose computing. Although a special style of programmi ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
(Show Context)
There has recently been much interest in stream processing, both in industry (e.g., Cell, NVIDIA G80, ATI R580) and academia (e.g., Stanford Merrimac, MIT RAW), with stream programs becoming increasingly popular for both media and more general-purpose computing. Although a special style of programming called stream programming is needed to target these stream architectures, huge performance benefits can be achieved. In this paper, we minimally add architectural features to commodity general-purpose processors (e.g., Intel/AMD) to efficiently support the stream execution model. We design the extensions to reuse existing components of the generalpurpose processor hardware as much as possible by investigating low-cost modifications to the CPU caches, hardware prefetcher, and the execution core. With a less than 1 % increase in die area along with judicious use of a software runtime system, we show that we can efficiently support stream programming on traditional processor cores. We evaluate our techniques by running scientific applications on a cyclelevel simulation system. The results show that our system executes stream programs as efficiently as possible, limited only by the ALU performance and the memory bandwidth needed to feed the ALUs. 1
Coordinating processor and main memory for efficientserver power control
- In Proceedings of the international conference on Supercomputing (ICS
, 2011
"... With the number of high-density servers in data centers rapidly increasing, power control with performance optimization has become a key challenge to gain a high return on investment, by safely accommodating the maximized number of servers allowed by the limited power supply and cooling facilities i ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
With the number of high-density servers in data centers rapidly increasing, power control with performance optimization has become a key challenge to gain a high return on investment, by safely accommodating the maximized number of servers allowed by the limited power supply and cooling facilities in a data center. Various power control solutions have been recently proposed for highdensity servers and different components in a server to avoid system failures due to power overload or overheating. Existing solutions, unfortunately, either rely only on the processor for server power control, with the assumption that it is the only major power consumer, or limit power only for a single component, such as main memory. As a result, the synergy between the processor and main memory is impaired by uncoordinated power adaptations, resulting in degraded overall system performance. In this paper, we propose a novel power control solution that can precisely limit the peak power consumption of a server below a desired budget. Our solution adapts the power states of both the processor and memory in a coordinated manner, based on their power demands, to achieve optimized system performance. Our solution also features a control algorithm that is designed rigorously based on advanced feedback control theory for guaranteed control accuracy and system stability. Compared with two state-of-the-art server power control solutions, experimental results show that our solution achieves up to 23 % better average performance than one baseline for CPUintensive benchmarks and doubles the performance of the other baseline when the power budget is tight.
Rebound: Scalable Checkpointing for Coherent Shared Memory
- SIGARCH Comput. Archit. News
, 2011
"... As we move to large manycores, the hardware-based global checkpointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkp ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
As we move to large manycores, the hardware-based global checkpointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multiprocessors with directory-based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15 % for global checkpointing.
Hardware/software co-design for energy-efficient seismic modeling
- In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11
, 2011
"... Reverse Time Migration (RTM) has become the standard for high-quality imaging in the seismic industry. RTM relies on PDE solutions using stencils that are 8 t h order or larger, which require large-scale HPC clusters to meet the computational demands. However, the rising power consumption of convent ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Reverse Time Migration (RTM) has become the standard for high-quality imaging in the seismic industry. RTM relies on PDE solutions using stencils that are 8 t h order or larger, which require large-scale HPC clusters to meet the computational demands. However, the rising power consumption of conventional cluster technology has prompted investigation of architectural alternatives that offer higher computational efficiency. In this work, we compare the performance and energy efficiency of three architectural alternatives – the Intel Nehalem X5530 multicore processor, the NVIDIA Tesla C2050 GPU, and a general-purpose manycore chip design optimized for high-order wave equations called “Green Wave. ” We have developed an FPGA-accelerated architectural simulation platform to accurately model the power and performance of the Green Wave design. Results show that across highly-tuned high-order RTM stencils, the Green Wave implementation can offer up to 8 × and 3.5× energy efficiency improvement per node respectively, compared with the Nehalem and GPU platforms. These results point to the enormous potential energy advantages of our hardware/software co-design methodology.
Timing effects of ddr memory systems in hard real-time multicore architectures: Issues and solutions
"... This is the author’s version of the work. Owing to unexpectedly long publication delays at ACM, the manuscript is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version will be published by ACM. Multicore processors are an effective solution to cope wi ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
This is the author’s version of the work. Owing to unexpectedly long publication delays at ACM, the manuscript is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version will be published by ACM. Multicore processors are an effective solution to cope with the performance requirements of real-time embedded systems due to their good performance-per-watt ratio and high performance capabilities. Unfortunately, their use in integrated architectures such as IMA or AUTOSAR is limited by the fact that multicores do not guarantee a time composable behavior for the applications: the WCET of a task depends on inter-task interferences introduced by other tasks running simultaneously. This paper focuses on the off-chip memory system: the hardware shared resource with the highest impact on the WCET and hence the main impediment for the use of multicores in integrated architectures. We present an analytical model that computes the worst-case delay: Upper Bound Delay (UBD), that a memory request can suffer due to memory interferences generated by other co-running tasks. By considering the UBD in the WCET analysis, the resulting WCET estimation is independent from the other tasks, hence ensuring the time composability property and enabling the use of multicores in integrated architectures. We propose a memory controller for hard real-time multicores compliant with the analytical model that