• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Cacti 3.0: An integrated cache timing, power, and area model. (2001)

by P Shivakumar, N P Jouppi
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 274
Next 10 →

Temperature-aware microarchitecture

by Kevin Skadron, Mircea R. Stan, Wei Huang, Sivakumar Velusamy, Karthik Sankaranarayanan, David Tarjan - In Proceedings of the 30th Annual International Symposium on Computer Architecture , 2003
"... With power density and hence cooling costs rising exponentially, processor packaging can no longer be designed for the worst case, and there is an urgent need for runtime processor-level techniques that can regulate operating temperature when the package’s capacity is exceeded. Evaluating such techn ..."
Abstract - Cited by 478 (52 self) - Add to MetaCart
With power density and hence cooling costs rising exponentially, processor packaging can no longer be designed for the worst case, and there is an urgent need for runtime processor-level techniques that can regulate operating temperature when the package’s capacity is exceeded. Evaluating such techniques, however, requires a thermal model that is practical for architectural studies. This paper describes HotSpot, an accurate yet fast model based on an equivalent circuit of thermal resistances and capacitances that correspond to microarchitecture blocks and essential aspects of the thermal package. Validation was performed using finiteelement simulation. The paper also introduces several effective methods for dynamic thermal management (DTM): “temperaturetracking” frequency scaling, localized toggling, and migrating computation to spare hardware units. Modeling temperature at the microarchitecture level also shows that power metrics are poor predictors of temperature, and that sensor imprecision has a substantial impact on the performance of DTM. 1.

Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction

by Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan, Dean M. Tullsen , 2003
"... This paper proposes and evaluates single-ISA heterogeneous multi-core architectures as a mechanism to reduce processor power dissipation. Our design incorporates heterogeneous cores representing different points in the power/performance design space; during an application 's execution, system s ..."
Abstract - Cited by 349 (22 self) - Add to MetaCart
This paper proposes and evaluates single-ISA heterogeneous multi-core architectures as a mechanism to reduce processor power dissipation. Our design incorporates heterogeneous cores representing different points in the power/performance design space; during an application 's execution, system software dynamically chooses the most appropriate core to meet specific performance and power requirements.
(Show Context)

Citation Context

...ers and large enough load/store queues to ensure no conflicts for these structures. The various miss penalties and L2 cache access latencies for the simulated cores were determined using CACTI. CACTI =-=[37]-=- provides an integrated model of cache access time, cycle time, area, aspect ratio, and power. To calculate the penalties, we used CACTI to get access times and then added one cycle each for L1-miss d...

An adaptive, nonuniform cache structure for wire-delay dominated on-chip caches

by Changkyu Kim, Doug Burger, Stephen W. Keckler - In International Conference on Architectural Support for Programming Languages and Operating Systems , 2002
"... Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the hit time of large on-chip caches a function of a ..."
Abstract - Cited by 314 (39 self) - Add to MetaCart
Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the hit time of large on-chip caches a function of a line's physical location within the cache. Consequently, cache access times will become a continuum of latencies rather than a single discrete latency. This nonuniformity can be exploited to provide faster access to cache lines in the portions of the cache that reside closer to the processor. In this paper, we evaluate a series of cache designs that provides fast hits to multi-megabyte cache memories. We first propose physical designs for these Non-Uniform Cache Architectures (NUCAs). We extend these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache. We show that, for multi-megabyte level-two caches, an adaptive, dynamic NUCA design achieves 1.5 times the IPC of a Uniform Cache Architecture of any size, outperforms the best static NUCA scheme by 11%, outperforms the best three-level hierarchywhile using less silicon area-by 13%, and comes within 13 % of an ideal minimal hit latency solution. 1.
(Show Context)

Citation Context

...cally choosing the optimal sub-bank count, size, and orientation. To estimate the cache bank delay, we used Cacti 3.0, which accounts for capacity, sub-bank organization, area, and process technology =-=[27]-=-. Figure 2 contains an example of a Cacti-style bank, shown in the circular expanded section of one bank. The cache is modeled assuming a central pre-decoder, which drives signals to the local decoder...

Secure Program Execution via Dynamic Information Flow Tracking

by G. Edward Suh, Jaewook Lee, Srinivas Devadas , 2004
"... Dynamic information flow tracking is a hardware mechanism to protect programs against malicious attacks by identifying spurious information flows and restricting the usage of spurious information. Every security attack to take control of a program needs to transfer the program’s control to malevolen ..."
Abstract - Cited by 271 (3 self) - Add to MetaCart
Dynamic information flow tracking is a hardware mechanism to protect programs against malicious attacks by identifying spurious information flows and restricting the usage of spurious information. Every security attack to take control of a program needs to transfer the program’s control to malevolent code. In our approach, the operating system identifies a set of input channels as spurious, and the processor tracks all information flows from those inputs. A broad range of attacks are effectively defeated by disallowing the spurious data to be used as instructions or jump target addresses. We describe two different security policies that track differing sets of dependencies. Implementing the first policy only incurs, on average, a memory overhead of 0.26 % and a performance degradation of 0.02%. This policy does not require any modification of executables. The stronger policy incurs, on average, a memory overhead of 4.5 % and a performance degradation of 0.8%, and requires binary annotation. 1
(Show Context)

Citation Context

...ce for tag caches, we increased the cache size of the baseline case by the amount used by the tag caches in each configuration. The access latency for a larger cache is estimated using the CACTI tool =-=[22]-=-. For some benchmarks, increased cache latency results in worse performance, even though the cache is larger. In our experiments we report the worst-case performance degradation by choosing the baseli...

System level analysis of fast, per-core DVFS using on-chip switching regulators

by Wonyoung Kim, Meeta S. Gupta, Gu-yeon Wei, David Brooks - in International Symposium on High-Performance Computer Architecture , 2008
"... Portable, embedded systems place ever-increasing demands on high-performance, low-power microprocessor design. Dynamic voltage and frequency scaling (DVFS) is a well-known technique to reduce energy in digital systems, but the effectiveness of DVFS is hampered by slow voltage transitions that occur ..."
Abstract - Cited by 147 (8 self) - Add to MetaCart
Portable, embedded systems place ever-increasing demands on high-performance, low-power microprocessor design. Dynamic voltage and frequency scaling (DVFS) is a well-known technique to reduce energy in digital systems, but the effectiveness of DVFS is hampered by slow voltage transitions that occur on the order of tens of microseconds. In addition, the recent trend towards chipmultiprocessors (CMP) executing multi-threaded workloads with heterogeneous behavior motivates the need for per-core DVFS control mechanisms. Voltage regulators that are integrated onto the same chip as the microprocessor core provide the benefit of both nanosecond-scale voltage switching and per-core voltage control. We show that these characteristics provide significant energy-saving opportunities compared to traditional off-chip regulators. However, the implementation of on-chip regulators presents many challenges including regulator efficiency and output voltage transient characteristics, which are significantly impacted by the system-level application of the regulator. In this paper, we describe and model these costs, and perform a comprehensive analysis of a CMP system with on-chip integrated regulators. We conclude that on-chip regulators can significantly improve DVFS effectiveness and lead to overall system energy savings in a CMP, but architects must carefully account for overheads and costs when designing next-generation DVFS systems and algorithms. 1.
(Show Context)

Citation Context

...ework We employ an architectural power-performance simulator that generates realistic current traces. We use SESC [25], a multi-core simulator, integrated with power-models based on Wattch [7], Cacti =-=[29]-=-, and Orion [32]. A simple in-order processor model represents configurations similar to embedded processors like Xscale [10]. The per-core current load is 400mA when fully active and 120mA when idle....

Deterministic Memory-Efficient String Matching Algorithms for Intrusion Detection

by Nathan Tuck, Timothy Sherwood, Brad Calder, George Varghese - In IEEE Infocom, Hong Kong
"... Intrusion Detection Systems (IDSs) have become widely recognized as powerful tools for identifying, deterring and deflecting malicious attacks over the network. Essential to almost every intrusion detection system is the ability to search through packets and identify content that matches known attac ..."
Abstract - Cited by 143 (4 self) - Add to MetaCart
Intrusion Detection Systems (IDSs) have become widely recognized as powerful tools for identifying, deterring and deflecting malicious attacks over the network. Essential to almost every intrusion detection system is the ability to search through packets and identify content that matches known attacks. Space and time efficient string matching algorithms are therefore important for identifying these packets at line rate.
(Show Context)

Citation Context

...ations of Aho-Corasick is the default case, one or two nodes traversed per character of input. We then explore tradeoffs in SRAM memory widths, sizes and numbers of ports using a version of CACTI 3.0 =-=[17]-=- modified to correlate closely with the results generated by 130 nm memory generators. We use the methodology of [16] to find a Pareto optimal design for a pipelined wide-word unified memory subsystem...

Exploring interconnections in multi-core architectures

by Rakesh Kumar, Victor Zyuban, Dean M. Tullsen , 2005
"... This paper examines the area, power, performance, and design issues for the on-chip interconnects on a chip multiprocessor, attempting to present a comprehensive view of a class of interconnect architectures. It shows that the design choices for the interconnect have significant effect on the rest o ..."
Abstract - Cited by 128 (6 self) - Add to MetaCart
This paper examines the area, power, performance, and design issues for the on-chip interconnects on a chip multiprocessor, attempting to present a comprehensive view of a class of interconnect architectures. It shows that the design choices for the interconnect have significant effect on the rest of the chip, potentially consuming a significant fraction of the real estate and power budget. This research shows that designs that treat interconnect as an entity that can be independently architected and optimized would not arrive at the best multicore design. Several examples are presented showing the need for careful co-design. For instance, increasing interconnect bandwidth requires area that then constrains the number of cores or cache sizes, and does not necessarily increase performance. Also, shared level-2 caches become significantly less attractive when the overhead of the resulting crossbar is accounted for. A hierarchical bus structure is examined which negates some of the performance costs of the assumed baseline architecture. 1
(Show Context)

Citation Context

...ed for as well. The tools and our interconnection models have been validated against a real, implemented design. The cache access times are calculated using assumptions similar to those made in CACTI =-=[28]-=-. Memory latency is set to 500 cycles. The average CPI of the modeled core over all the workloads that we use, assuming perfect L2, is measured to be 2.65. Core frequency is assumed to be 5GHz for the...

The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

by M. S. Hrishikesh, Keith I. Farkas, Doug Burgert, Stephen W. Keckler, Premkishore Shivakumar - in Proceedings of the 29th Annual International Symposium on Computer Architecture , 2002
"... Microprocessor clock frequency has improved by nearly 40% annually over the past decade. This improvement has been provided, in equal measure, by smaller technologies and deeper pipelines. From our study of the SPEC 2000 bench-marks, we find that for a high-performance architecture imple-mented in l ..."
Abstract - Cited by 120 (14 self) - Add to MetaCart
Microprocessor clock frequency has improved by nearly 40% annually over the past decade. This improvement has been provided, in equal measure, by smaller technologies and deeper pipelines. From our study of the SPEC 2000 bench-marks, we find that for a high-performance architecture imple-mented in lOOnm technology, the optimal clock period is ap-proximately 8fan-out-of-four ( F04) inverter delays for integer benchmarks, comprised of 6 F04 of useful work and an over-head of about 2 F04. The optimal clock period for floating-point benchmarks is 6F04. We find these optimal points to be insensitive to latch and clock skew overheads. Our study indi-cates that further pipelining can at best improve performance of integer programs by a factor of 2 over current designs. At these high clock frequencies it will be difficult to design the instruction issue window to operate in a single cycle. Con-sequently, we propose and evaluate a high-frequency design called a segmented instruction window. 1
(Show Context)

Citation Context

... FP benchmarks separately. All experiments skip the first 500 million instructions of each benchmark and simulate the next 500 million instructions. 3.2 Microarchitectural Structures We use Cacti 3.0 =-=[12]-=- to model on-chip microarchitectural structures and to estimate their access times. Cacti is an analytical tool originally developed by Jouppi and Wilton [7]. All major microarchitectural structures--...

Optimizing replication, communication, and capacity allocation in cmps

by Zeshan Chishti, Michael D. Powell, T. N. Vijaykumar - INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE , 2005
"... Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to symmetric multiprocessors (SMPs), CMPs change the l ..."
Abstract - Cited by 107 (0 self) - Add to MetaCart
Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to symmetric multiprocessors (SMPs), CMPs change the latency-capacity tradeoff in two significant ways. We propose three novel ideas to exploit the changes: (1) Though placing copies close to requestors allows fast access for read-only sharing, the copies also reduce the already-limited on-chip capacity in CMPs. We propose controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy. This option is not suitable for SMPs because obtaining data from another processor is expensive and capacity is not limited to on-chip storage. (2) Unlike SMPs, CMPs allow fast on-chip communication between processors for read-write sharing. Instead of incurring slow access to read-write shared data through coherence misses as do SMPs, we propose in-situ communication to provide fast access without making copies or incurring coherence misses. (3) Accessing neighborsý caches is not as expensive in CMPs as it is in SMPs. We propose capacity stealing in which private data that exceeds a coreýs capacity is placed in a neighboring cache with less capacity demand. To incorporate our ideas, we use a hybrid of private, per-processor tag arrays and a shared data array. Because the shared data array is slow, we employ non-uniform access and distance associativity from previous proposals to hold frequently-accessed data in regions close to the requestor. We extend the previously-proposed Non-uniform access with Replacement And Placement usIng Distance associativity (NuRAPID) to CMPs, and call our cache CMP-NuRAPID. Our results show that for a 4-core CMP with 8 MB cache, CMP-NuRAPID improves performance by 13% over a shared cache and 8% over private caches for three commercial multithreaded workloads.
(Show Context)

Citation Context

... [6]. We do not evaluate CMP-DNUCA from [6], because [6] shows realistic CMP-DNUCA to perform worse than CMPSNUCA. We model both the bandwidth and latency of on-chip caches carefully. We modify Cacti =-=[22]-=- version 3.2 to derive the access times and wire delays for our conventional caches and for each d-group in CMP-NuRAPID. Because Cacti is not generally used for monolithic large caches (e.g., greater ...

Distance associativity for high-performance energy-efficient non-uniform cache architectures

by Zeshan Chishti, Michael D. Powell, T. N. Vijaykumar - IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE , 2003
"... Wire delays continue to grow as the dominant component oflatency for large caches.A recent work proposed an adaptive,non-uniform cache architecture (NUCA) to manage large, on-chipcaches.By exploiting the variation in access time acrosswidely-spaced subarrays, NUCA allows fast access to closesubarray ..."
Abstract - Cited by 98 (1 self) - Add to MetaCart
Wire delays continue to grow as the dominant component oflatency for large caches.A recent work proposed an adaptive,non-uniform cache architecture (NUCA) to manage large, on-chipcaches.By exploiting the variation in access time acrosswidely-spaced subarrays, NUCA allows fast access to closesubarrays while retaining slow access to far subarrays.Whilethe idea of NUCA is attractive, NUCA does not employ designchoices commonly used in large caches, such as sequential tag-dataaccess for low power.Moreover, NUCA couples dataplacement with tag placement foregoing the flexibility of dataplacement and replacement that is possible in a non-uniformaccess cache.Consequently, NUCA can place only a few blockswithin a given cache set in the fastest subarrays, and mustemploy a high-bandwidth switched network to swap blockswithin the cache for high performance.In this paper, we proposethe Non-uniform access with Replacement And PlacementusIng Distance associativity" cache, or NuRAPID, whichleverages sequential tag-data access to decouple data placementfrom tag placement.Distance associativity, the placementof data at a certain distance (and latency), is separated from setassociativity, the placement of tags within a set.This decouplingenables NuRAPID to place flexibly the vast majority offrequently-accessed data in the fastest subarrays, with fewerswaps than NUCA.Distance associativity fundamentallychanges the trade-offs made by NUCA's best-performingdesign, resulting in higher performance and substantiallylower cache energy.A one-ported, non-banked NuRAPIDcache improves performance by 3% on average and up to 15%compared to a multi-banked NUCA with an infinite-bandwidthswitched network, while reducing L2 cache energy by 77%.
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University