Results 1 - 10
of
21
Vantage: Scalable and Efficient FineGrain Cache Partitioning
- In Proc. of the International Symposium on Computer Architecture (ISCA), SJ
, 2011
"... ......Shared caches are pervasive in chip multiprocessors (CMPs). In particular, CMPs almost always feature a large, fully shared last-level cache (LLC) to mitigate the high latency, high energy, and limited bandwidth of main memory. A shared LLC has several advantages over multiple, private LLCs: i ..."
Abstract
-
Cited by 35 (11 self)
- Add to MetaCart
(Show Context)
......Shared caches are pervasive in chip multiprocessors (CMPs). In particular, CMPs almost always feature a large, fully shared last-level cache (LLC) to mitigate the high latency, high energy, and limited bandwidth of main memory. A shared LLC has several advantages over multiple, private LLCs: it increases cache utilization, accelerates intercore communication (which happens through the cache instead of main memory), and reduces the cost of coherence (because only non-fully-shared caches must be kept coherent). Unfortunately, these advantages come at a significant cost. When multiple applications share the CMP, they
The ZCache: Decoupling Ways and Associativity
- In Proc. of the 43rd annual IEEE/ACM intl. symp. on Microarchitecture
, 2010
"... Abstract—The ever-increasing importance of main memory latency and bandwidth is pushing CMPs towards caches with higher capacity and associativity. Associativity is typically im-proved by increasing the number of ways. This reduces conflict misses, but increases hit latency and energy, placing a str ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
(Show Context)
Abstract—The ever-increasing importance of main memory latency and bandwidth is pushing CMPs towards caches with higher capacity and associativity. Associativity is typically im-proved by increasing the number of ways. This reduces conflict misses, but increases hit latency and energy, placing a stringent trade-off on cache design. We present the zcache, a cache design that allows much higher associativity than the number of physical ways (e.g. a 64-associative cache with 4 ways). The zcache draws on previous research on skew-associative caches and cuckoo hashing. Hits, the common case, require a single lookup, incurring the latency and energy costs of a cache with a very low number of ways. On a miss, additional tag lookups happen off the critical path, yielding an arbitrarily large number of replacement candidates for the incoming block. Unlike conventional designs, the zcache provides associativity by increasing the number of replacement candidates, but not the number of cache ways. To understand the implications of this approach, we develop a general analysis framework that allows to compare associativity across different cache designs (e.g. a set-associative cache and a zcache) by representing associativity as a probability distribution. We use this framework to show that for zcaches, associativity depends only on the number of replacement candidates, and is independent of other factors (such as the number of cache ways or the workload). We also show that, for the same number of replacement candidates, the associativity of a zcache is superior than that of a set-associative cache for most workloads. Finally, we perform detailed simulations of multithreaded and multiprogrammed workloads on a large-scale CMP with zcache as the last-level cache. We show that zcaches provide higher performance and better energy efficiency than conventional caches without incurring the overheads of designs with a large number of ways. I.
Flexible architectural support for fine-grain scheduling
- In Proceedings of the fifteenth edition of ASPLOS on Architectural
, 2010
"... Abstract To make efficient use of CMPs with tens to hundreds of cores, it is often necessary to exploit fine-grain parallelism. However, managing tasks of a few thousand instructions is particularly challenging, as the runtime must ensure load balance without compromising locality and introducing s ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
(Show Context)
Abstract To make efficient use of CMPs with tens to hundreds of cores, it is often necessary to exploit fine-grain parallelism. However, managing tasks of a few thousand instructions is particularly challenging, as the runtime must ensure load balance without compromising locality and introducing small overheads. Software-only schedulers can implement various scheduling algorithms that match the characteristics of different applications and programming models, but suffer significant overheads as they synchronize and communicate task information over the deep cache hierarchy of a large-scale CMP. To reduce these costs, hardware-only schedulers like Carbon, which implement task queuing and scheduling in hardware, have been proposed. However, a hardware-only solution fixes the scheduling algorithm and leaves no room for other uses of the custom hardware. This paper presents a combined hardware-software approach to build fine-grain schedulers that retain the flexibility of software schedulers while being as fast and scalable as hardware ones. We propose asynchronous direct messages (ADM), a simple architectural extension that provides direct exchange of asynchronous, short messages between threads in the CMP without going through the memory hierarchy. ADM is sufficient to implement a family of novel, software-mostly schedulers that rely on low-overhead messaging to efficiently coordinate scheduling and transfer task information. These schedulers match and often exceed the performance and scalability of Carbon when using the same scheduling algorithm. When the ADM runtime tailors its scheduling algorithm to application characteristics, it outperforms Carbon by up to 70%.
Demanddriven software race detection using hardware performance counters
- In ISCA
, 2011
"... Dynamic data race detectors are an important mechanism for creating robust parallel programs. Software race detectors instrument the program under test, observe each memory access, and watch for inter-thread data sharing that could lead to concurrency errors. While this method of bug hunting can fin ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
(Show Context)
Dynamic data race detectors are an important mechanism for creating robust parallel programs. Software race detectors instrument the program under test, observe each memory access, and watch for inter-thread data sharing that could lead to concurrency errors. While this method of bug hunting can find races that are normally difficult to observe, it also suffers from high runtime overheads. It is not uncommon for commercial race detectors to experience 300× slowdowns, limiting their usage. This paper presents a hardware-assisted demand-driven race detector. We are able to observe cache events that are indicative of data sharing between threads by taking advantage of hardware available on modern commercial microprocessors. We use these to build a race detector that is only enabled when it is likely that inter-thread data sharing is occurring. When little sharing takes place, this demand-driven analysis is much faster than contemporary continuous-analysis tools without a large loss of detection accuracy. We modified the race detector in Intel R ○ Inspector XE to utilize our hardware-based sharing indicator and were able to achieve performance increases of 3 × and 10 × in two parallel benchmark suites and 51 × for one particular program. Categories andSubject Descriptors
Borel-Wadge degrees
- Fund. Math
"... Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License. ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License. ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate
RADISH: Always-On Sound and Complete Race Detection in Software and Hardware
"... Data-race freedom is a valuable safety property for multithreaded programs that helps with catching bugs, simplifying memory consistency model semantics, and verifying and enforcing both atomicity and determinism. Unfortunately, existing software-only dynamic race detectors are precise but slow; pro ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
Data-race freedom is a valuable safety property for multithreaded programs that helps with catching bugs, simplifying memory consistency model semantics, and verifying and enforcing both atomicity and determinism. Unfortunately, existing software-only dynamic race detectors are precise but slow; proposals with hardware support offer higher performance but are imprecise. Both precision and performance are necessary to achieve the many advantages always-on dynamic race detection could provide. To resolve this trade-off, we propose RADISH, a hybrid hardware-software dynamic race detector that is always-on and fully precise. In RADISH, hardware caches a principled subset of the metadata necessary for race detection; this subset allows the vast majority of race checks to occur completely in hardware. A flexible software layer handles persistence of race detection metadata on cache evictions and occasional queries to this expanded set of metadata. We show that RADISH is correct by proving equivalence to a conventional happens-before race detector. Our design has modest hardware complexity: caches are completely unmodified and we piggy-back on existing coherence messages but do not otherwise modify the protocol. Furthermore,RADISH can leverage type-safe languages to reduce overheads substantially. Our evaluation of a simulated 8-core RADISH processor using PARSEC benchmarks shows runtime overheads from negligible to 2x, outperforming the leading software-only race detector by 2x-37x. 1
Virtues and Obstacles of Hardware-Assisted MultiProcessor Execution Replay
- In HotPAR
, 2010
"... The trend towards multi-core processors has spurred a renewed interest in developing parallel programs. Parallel applications exhibit non-deterministic behavior across runs, primarily due to the changing inter-leaving of sharedmemory accesses from run to run. The non-deterministic nature of these pr ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
(Show Context)
The trend towards multi-core processors has spurred a renewed interest in developing parallel programs. Parallel applications exhibit non-deterministic behavior across runs, primarily due to the changing inter-leaving of sharedmemory accesses from run to run. The non-deterministic nature of these programs lead to many problems, which will be described in this paper. In order to get a handle on the non-determinism, techniques to record and replay these programs have been proposed. These allow capturing the non-determinism and repeat the execution of a program, hence enabling better understanding and eliminating problems due to non-determinism. In this paper, we argue that an efficient record and replay (R&R) system must be assisted by hardware support. However, adding support for R&R in hardware requires strong justification, in terms of added value to the processor. Up until now, debugging has been the primary motivation used by previous work. We argue that stronger usage models are required to justify the cost of adding hardware support for R&R. Hence the first contribution of this paper are additional usage models (the virtues of R&R) and how we think they will add value to future micro-processors. We think the current lack of focus on wider and more applicable usage models is an obstacle for its adoption. The other obstacle for implementation of hardware-assisted R&R is that some complexities of such endeavor have not been addressed and studied properly. Current research approaches have shortcomings that prevent them from becoming an implementable feature. In this paper, we also identify those shortcomings and indicate research directions to bridge the gap between the research and a product implementation. 1.
Accelerating Data Race Detection with Minimal Hardware Support
"... Abstract. We propose a high performance hybrid hardware/software solution to race detection that uses minimal hardware support. This hardware extension consists of a single extra instruction, StateChk, that simply returns the coherence state of a cache block without requiring any complex traps to ha ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract. We propose a high performance hybrid hardware/software solution to race detection that uses minimal hardware support. This hardware extension consists of a single extra instruction, StateChk, that simply returns the coherence state of a cache block without requiring any complex traps to handlers. To leverage this support, we propose a new algorithm for race detection. This detection algorithm uses StateChk to eliminate many expensive operations. We also propose a new execution schedule manipulation heuristic to achieve high coverage rapidly. This approach is capable of detecting virtually all data races detected by a traditional happened-before data race detection approach, but at significantly lower space and performance overhead. 1
Time-Ordered Event Traces: A New Debugging Primitive for Concurrency Bugs
"... Abstract—Non-determinism makes concurrent bugs extremely difficult to reproduce and to debug. In this work, we propose a new debugging primitive to facilitate the debugging process by exposing this non-deterministic behavior to the programmer. The key idea is to generate a time-ordered trace of even ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Non-determinism makes concurrent bugs extremely difficult to reproduce and to debug. In this work, we propose a new debugging primitive to facilitate the debugging process by exposing this non-deterministic behavior to the programmer. The key idea is to generate a time-ordered trace of events such as function calls/returns and memory accesses across different threads. The architectural support for this primitive is lightweight, including a high-precision, frequency-invariant time-stamp counter and an event trace buffer in each processor core. The proposed primitive continuously records and timestamps the last N function calls/returns per core by default, and can also be configured to watch specific memory addresses or code regions through a flexible software interface. To examine the effectiveness of the proposed primitive, we studied a variety of concurrent bugs in large commercial software and our results show that exposing the time-ordered information, function calls/returns in particular, to the programmer is highly beneficial for diagnosing the root causes of these bugs. Keywords-component: debugging; concurrent; bugs; I.
Leveraging the Short-Term Memory of Hardware to Diagnose Production-Run Software Failures
"... Failures caused by software bugs are widespread in produc-tion runs, causing severe losses for end users. Unfortunately, diagnosing production-run failures is challenging. Existing work cannot satisfy privacy, run-time overhead, diagnosis capability, and diagnosis latency requirements all at once. T ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Failures caused by software bugs are widespread in produc-tion runs, causing severe losses for end users. Unfortunately, diagnosing production-run failures is challenging. Existing work cannot satisfy privacy, run-time overhead, diagnosis capability, and diagnosis latency requirements all at once. This paper designs a low overhead, low latency, privacy preserving production-run failure diagnosis system based on two observations. First, short-term memory of program exe-cution is often sufficient for failure diagnosis, as many bugs have short propagation distances. Second, maintaining a short-term memory of execution is much cheaper than main-taining a record of the whole execution. Following these observations, we first identify an existing hardware unit, Last Branch Record (LBR), that records the last few taken branches to help diagnose sequential bugs. We then propose a simple hardware extension, Last Cache-coherence Record (LCR), to record the last few cache accesses with specified coherence states and hence help diagnose concurrency bugs. Finally, we design LBRA and LCRA to automatically lo-cate failure root causes using LBR and LCR. Our evaluation uses 31 real-world sequential and concur-rency bug failures from 18 representative open-source soft-ware. The results show that with just 16 record entries, LBR and LCR enable our system to automatically locate the root causes for 27 out of 31 failures, with less than 3 % run-time overhead. As our system does not rely on sampling, it also provides good diagnosis latency.