Results 1 - 10
of
38
A new memory monitoring scheme for memory-aware scheduling and partitioning
, 2002
"... We propose a low overhead, on-line memory monitoring scheme utilizing a set of novel hardware counters. The counters indicate the marginal gain in cache hits as the size of the cache is increased, which gives the cache miss-rate as a function of cache size. Using the counters, we describe a scheme t ..."
Abstract
-
Cited by 149 (2 self)
- Add to MetaCart
(Show Context)
We propose a low overhead, on-line memory monitoring scheme utilizing a set of novel hardware counters. The counters indicate the marginal gain in cache hits as the size of the cache is increased, which gives the cache miss-rate as a function of cache size. Using the counters, we describe a scheme that enables an accurate estimate of the isolated miss-rates of each process as a function of cache size under the standard LRU replacement policy. This information can be used to schedule jobs or to partition the cache to minimize the overall miss-rate. The data collected by the monitors can also be used by an analytical model of cache and memory behavior to produce a more accurate overall miss-rate for the collection of processes sharing a cache in both time and space. This overall miss-rate can be used to
The Vector-Thread Architecture
- In 31st International Symposium on Computer Architecture
, 2004
"... The vector-thread (VT) architectural paradigm unifies the vector and multithreaded compute models. The VT abstraction provides the programmer with a control processor and a vector of virtual processors (VPs). The control processor can use vector-fetch commands to broadcast instructions to all the VP ..."
Abstract
-
Cited by 52 (7 self)
- Add to MetaCart
(Show Context)
The vector-thread (VT) architectural paradigm unifies the vector and multithreaded compute models. The VT abstraction provides the programmer with a control processor and a vector of virtual processors (VPs). The control processor can use vector-fetch commands to broadcast instructions to all the VPs or each VP can use thread-fetches to direct its own control flow. A seamless intermixing of the vector and threaded control mechanisms allows a VT architecture to flexibly and compactly encode application parallelism and locality, and a VT machine exploits these to improve performance and efficiency. We present SCALE, an instantiation of the VT architecture designed for low-power and high-performance embedded systems. We evaluate the SCALE prototype design using detailed simulation of a broad range of embedded applications and show that its performance is competitive with larger and more complex processors. 1.
Direct Addressed Caches for Reduced Power Consumption
- In Proceedings of the 34th Annual International Symposium on Microarchitecture
, 2001
"... A direct addressed cache is a hardware-software design for an energy-efficient microprocessor data cache. Direct addressing allows software to access cache data without a hardware cache tag check. These tag-unchecked loads and stores save the energy of a tag check when the compiler can guarantee an ..."
Abstract
-
Cited by 32 (2 self)
- Add to MetaCart
(Show Context)
A direct addressed cache is a hardware-software design for an energy-efficient microprocessor data cache. Direct addressing allows software to access cache data without a hardware cache tag check. These tag-unchecked loads and stores save the energy of a tag check when the compiler can guarantee an access will be to the same line as an earlier access. We have added support for tag-unchecked loads and stores to C and Java compilers. For Mediabench C programs, the compiler eliminates 16--76% of data cache tag accesses, with half of the benchmarks avoiding over 40% of the data tag checks. For SPECjvm98 Java programs, the compiler eliminates 18--63% of data cache tag checks. These tag check reductions translate into data cache energy savings of 9--40%, and overall processor and cache energy savings of 2--8%. 1.
Increasing and Detecting Memory Address Congruence
- In International Conference on Parallel Architectures and Compilation Techniques
, 2002
"... A static memory reference exhibits a unique property when its dynamic memory addresses are congruent with respect to some non-trivial modulus. Extraction of this congruence information at compile-time enables new classes of program optimization. In this paper, we present methods for forcing congruen ..."
Abstract
-
Cited by 28 (3 self)
- Add to MetaCart
(Show Context)
A static memory reference exhibits a unique property when its dynamic memory addresses are congruent with respect to some non-trivial modulus. Extraction of this congruence information at compile-time enables new classes of program optimization. In this paper, we present methods for forcing congruence among the dynamic addresses of a memory reference. We also introduce a compiler algorithm for detecting this property. Our transformations do not require interprocedural analysis and introduce almost no overhead. As a result, they can be incorporated into real compilation systems.
An Adaptive Serial-Parallel CAM Architecture for Low-Power Cache Blocks
, 2002
"... There is an on- oing debate about which consumes less energy: a RAMtagged associative cache with an intelligent order of accessing its tags and ways (e.g. way prediction), or a CAMtagged high associativity cache. If a CAM search can consume less than twice the energy of reading a tag RAM, it ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
(Show Context)
There is an on- oing debate about which consumes less energy: a RAMtagged associative cache with an intelligent order of accessing its tags and ways (e.g. way prediction), or a CAMtagged high associativity cache. If a CAM search can consume less than twice the energy of reading a tag RAM, it would probably be the preferred option for lowpower applications.
A way-halting cache for low-energy high-performance systems
- ACM Transactions on Architecture and Code Optimization (TACO
"... Caches contribute to much of a microprocessor system's power and energy consumption. We have developed a new cache architecture, called a way-halting cache, that reduces energy while imposing no performance overhead. Our way-halting cache is a four-way set-associative cache that stores the four ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
(Show Context)
Caches contribute to much of a microprocessor system's power and energy consumption. We have developed a new cache architecture, called a way-halting cache, that reduces energy while imposing no performance overhead. Our way-halting cache is a four-way set-associative cache that stores the four lowest-order bits of all ways ’ tags into a fully associative memory, which we call the halt tag array. The lookup in the halt tag array is done in parallel with, and is no slower than, the set-index decoding. The halt tag array pre-determines which tags cannot match due to their low-order four bits mismatching. Further accesses to ways with known mismatching tags are then halted, thus saving power. Our halt tag array has an additional feature of using static logic only, rather than dynamic logic used in highly associative caches. We provide data from experiments on 17 benchmarks drawn from MediaBench and Spec 2000, based on our layouts in 0.18 micron CMOS technology. On average, 55 % savings of memory-access related energy were obtained over a conventional four-way set-associative cache. We show that energy savings are greater than previous methods, and nearly twice that of highly-associative caches, while imposing no performance overhead and only 2 % cache area overhead.
Comparing fpga vs. custom cmos and the impact on processor microarchitecture
- In Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays, FPGA ’11
, 2011
"... As soft processors are increasingly used in diverse applications, there is a need to evolve their microarchitectures in a way that suits the FPGA implementation substrate. This paper compares the delay and area of a comprehensive set of processor building block circuits when implemented on custom CM ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
(Show Context)
As soft processors are increasingly used in diverse applications, there is a need to evolve their microarchitectures in a way that suits the FPGA implementation substrate. This paper compares the delay and area of a comprehensive set of processor building block circuits when implemented on custom CMOS and FPGA substrates. We then use the results of these comparisons to infer how the microarchitecture of soft processors on FPGAs should be different from hard processors on custom CMOS. We find that the ratios of the area required by an FPGA to that of custom CMOS for different building blocks varies significantly more than the speed ratios. As area is often a key design constraint in FPGA circuits, area ratios have the most impact on microarchitecture choices. Complete processor cores have area ratios of 17-27 × and delay ratios of 18-26×. Building blocks that have dedicated hardware support on FPGAs such as SRAMs, adders, and multipliers are particularly area-efficient (2-7 × area ratio), while multiplexers and CAMs are particularly area-inefficient (> 100× area ratio), leading to cheaper ALUs, larger caches of low associativity, and more expensive bypass networks than on similar hard processors. We also find that a low delay ratio for pipeline latches (12-19×) suggests soft processors should have pipeline depths 20 % greater than hard processors of similar complexity.
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
"... A new dynamic cache resizing scheme for low-power CAMtag caches is introduced. A control algorithm that is only activated on cache misses uses a duplicate set of tags, the miss tags, to minimize active cache size while sustaining close to the same hit rate as a full size cache. The cache partitionin ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
(Show Context)
A new dynamic cache resizing scheme for low-power CAMtag caches is introduced. A control algorithm that is only activated on cache misses uses a duplicate set of tags, the miss tags, to minimize active cache size while sustaining close to the same hit rate as a full size cache. The cache partitioning mechanism saves both switching and leakage energy in unused partitions with little impact on cycle time. Simulation results show that the scheme saves 28--56% of data cache energy and 34--49% of instruction cache energy with minimal performance impact.
A CAM With Mixed Serial-Parallel Comparison for Use in Low Energy Caches
, 2004
"... A novel, low-energy content addressable memory (CAM) structure is presented which achieves an approximately four-fold improvement in energy per access, compared to a standard parallel CAM, when used as tag storage for caches. It exploits the address patterns commonly found in application programs, ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
A novel, low-energy content addressable memory (CAM) structure is presented which achieves an approximately four-fold improvement in energy per access, compared to a standard parallel CAM, when used as tag storage for caches. It exploits the address patterns commonly found in application programs, where testing the four least significant bits of the tag is sufficient to determine over 90 % of the tag mismatches; the proposed CAM checks those bits first and evaluates the remainder of the tag only if they match. Although, the energy savings come at the cost of a 25 % increase in search time, the proposed CAM organization also supports a parallel operating mode without a speed loss but with reduced energy savings.
ZOOM: A Performance-Energy Cache Simulator
, 2002
"... Introduction: Caches play a crucial role in narrowing the ever-widening processor-memory performance gap. A cache designer has to balance, often conflicting, demands for power, speed and area. Reaching an optimum compromise between these three dimensions of design requires a firm grasp of the effect ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Introduction: Caches play a crucial role in narrowing the ever-widening processor-memory performance gap. A cache designer has to balance, often conflicting, demands for power, speed and area. Reaching an optimum compromise between these three dimensions of design requires a firm grasp of the effects of micro-architectural changes. However, exploring the large design space of candidate cache architectures using conventional circuit simulation tools would be extremely tedious and time-consuming. Hence, simplified analytical models are very valuable. ZOOM is a fast,