• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Microarchitecture optimizations for exploiting memory-level parallelism,” in ISCA, (2004)

by Y Chou
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 97
Next 10 →

Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors

by Onur Mutlu, et al.
"... DRAM memory is a major resource shared among cores in a chip multiprocessor (CMP) system. Memory requests from different threads can interfere with each other. Existing memory access scheduling techniques try to optimize the overall data throughput obtained from the DRAM and thus do not take into ac ..."
Abstract - Cited by 139 (50 self) - Add to MetaCart
DRAM memory is a major resource shared among cores in a chip multiprocessor (CMP) system. Memory requests from different threads can interfere with each other. Existing memory access scheduling techniques try to optimize the overall data throughput obtained from the DRAM and thus do not take into account inter-thread interference. Therefore, different threads running together on the same chip can experience extremely different memory system performance: one thread can experience a severe slowdown or starvation while another is unfairly prioritized by the memory scheduler. This paper proposes a new memory access scheduler, called the Stall-Time Fair Memory scheduler (STFM), that provides quality of service to different threads sharing the DRAM memory system. The goal of the proposed scheduler is to “equalize ” the DRAM-related slowdown experienced by each thread due to interference from other threads, without hurting overall system performance. As such, STFM takes into account inherent memory characteristics of each thread and does not unfairly penalize threads that use the DRAM system without interfering with other threads. We show that STFM significantly reduces the unfairness in the DRAM system while also improving system throughput (i.e., weighted speedup of threads) on a wide variety of workloads and systems. For example, averaged over 32 different workloads running on an 8-core CMP, the ratio between the highest DRAM-related slowdown and the lowest DRAM-related slowdown reduces from 5.26X to 1.4X, while the average system throughput improves by 7.6%. We qualitatively and quantitatively compare STFM to one new and three previouslyproposed memory access scheduling algorithms, including network fair queueing. Our results show that STFM provides the best fairness, system throughput, and scalability.
(Show Context)

Citation Context

...o be serviced and therefore experience an increased stall-time. However, increasing TInterference of these threads by the service latency of R is too simplistic as it ignores memory-level parallelism =-=[8, 2]-=- of threads. This is best illustrated with an example. Assume two requests R1 and R2 are simultaneously being serviced in two different banks. Assume further that another thread C ′ has ready requests...

Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems

by Onur Mutlu, et al. , 2008
"... In a chip-multiprocessor (CMP) system, the DRAM system is shared among cores. In a shared DRAM system, requests from a thread can not only delay requests from other threads by causing bank/bus/row-buffer conflicts but they can also destroy other threads’ DRAM-bank-level parallelism. Requests whose l ..."
Abstract - Cited by 136 (48 self) - Add to MetaCart
In a chip-multiprocessor (CMP) system, the DRAM system is shared among cores. In a shared DRAM system, requests from a thread can not only delay requests from other threads by causing bank/bus/row-buffer conflicts but they can also destroy other threads’ DRAM-bank-level parallelism. Requests whose latencies would otherwise have been overlapped could effectively become serialized. As a result both fairness and system throughput degrade, and some threads can starve for long time periods. This paper proposes a fundamentally new approach to designing a shared DRAM controller that provides quality of service to threads, while also improving system throughput. Our parallelism-aware batch scheduler (PAR-BS) design is based on two key ideas. First, PAR-BS processes DRAM requests in batches to provide fairness and to avoid starvation of requests. Second, to optimize system throughput, PAR-BS employs a parallelism-aware DRAM scheduling policy that aims to process requests from a thread in parallel in the DRAM banks, thereby reducing the memory-related stall-time experienced by the thread. PAR-BS seamlessly incorporates support for system-level thread priorities and can provide different service levels, including purely opportunistic service, to threads with different priorities. We evaluate the design trade-offs involved in PAR-BS and compare it to four previously proposed DRAM scheduler designs on 4-, 8-, and 16-core systems. Our evaluations show that, averaged over 100 4-core workloads, PAR-BS improves fairness by 1.11X and system throughput by 8.3 % compared to the best previous scheduling technique, Stall-Time Fair Memory (STFM) scheduling. Based on simple request prioritization rules, PAR-BS is also simpler to implement than STFM.
(Show Context)

Citation Context

... of requests being serviced in the DRAM banks when there is at least one request being serviced in the DRAM banks. This definition follows the memory-level parallelism (MLP) definition of Chou et al. =-=[2]-=-. We characterize a thread based on the average stall time per DRAM request (AST/req) metric, which is computed by dividing the number of cycles in which the thread cannot commit instructions because ...

A case for MLP-aware cache replacement

by Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, Yale N. Patt - In ISCA , 2006
"... Performance loss due to long-latency memory accesses can be reduced by servicing multiple memory accesses concurrently. The notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP). MLP is not uniform across cache misses – some misses occur in ..."
Abstract - Cited by 78 (14 self) - Add to MetaCart
Performance loss due to long-latency memory accesses can be reduced by servicing multiple memory accesses concurrently. The notion of generating and servicing long-latency cache misses in parallel is called Memory Level Parallelism (MLP). MLP is not uniform across cache misses – some misses occur in isolation while some occur in parallel with other misses. Isolated misses are more costly on performance than parallel misses. However, tradi-tional cache replacement is not aware of the MLP-dependent cost differential between different misses. Cache replacement, if made MLP-aware, can improve performance by reducing the number of performance-critical isolated misses. This paper makes two key contributions. First, it proposes a framework for MLP-aware cache replacement by using a run-time technique to compute the MLP-based cost for each cache miss. It then describes a simple cache replacement mechanism that takes both MLP-based cost and recency into account. Second, it proposes a novel, low-hardware overhead mechanism called Sampling Based Adaptive Replacement (SBAR), to dynamically choose between an MLP-aware and a traditional replacement pol-icy, depending on which one is more effective at reducing the number of memory related stalls. Evaluations with the SPEC CPU2000 benchmarks show that MLP-aware cache replacement can improve performance by as much as 23%. 1.
(Show Context)

Citation Context

...ility to increase MLP is limited by the instruction window size. Several proposals [15][1][4][25] have looked at the problem of scaling the instruction window for out-of-order processors. Chou et al. =-=[3]-=- analyzed the effectiveness of different microarchitectural techniques such as out-of-order execution, value prediction, and runahead execution on increasing MLP. They concluded that microarchitecture...

Chip Multithreading: Opportunities and Challenges

by Lawrence Spracklen, Santosh G. Abraham - In Proceedings of the 11th International Conference on High-Performance Computer Architecture (HPCA-11 , 2005
"... Chip Multi-Threaded (CMT) processors provide support for many simultaneous hardware threads of execution in various ways, including Simultaneous Multithreading (SMT) and Chip Multiprocessing (CMP). CMT processors are especially suited to server workloads, which generally have high levels of Thread-L ..."
Abstract - Cited by 57 (1 self) - Add to MetaCart
Chip Multi-Threaded (CMT) processors provide support for many simultaneous hardware threads of execution in various ways, including Simultaneous Multithreading (SMT) and Chip Multiprocessing (CMP). CMT processors are especially suited to server workloads, which generally have high levels of Thread-Level Parallelism (TLP). In this paper, we describe the evolution of CMT chips in industry and highlight the pervasiveness of CMT designs in upcoming general-purpose processors. The CMT design space accommodates a range of designs between the extremes represented by the SMT and CMP designs and a variety of attractive design options are currently unexplored. Though there has been extensive research on utilizing multiple hardware threads to speed up single-threaded applications via speculative parallelization, there are many challenges in designing CMT processors, even when sufficient TLP is present. This paper describes some of these challenges including, hot sets, hot banks, speculative prefetching strategies, request prioritization and off-chip bandwidth reduction. 1
(Show Context)

Citation Context

...ocessor chip needs to sustain 7800/64 or over 100 parallel requests. A single strand on an aggressive out-of-order processor core generates less than two parallel requests on typical server workloads =-=[4]-=-: therefore, a large number of strands are required to sustain a high utilization of the memory ports. Finally, power considerations also favor CMT processors. Given the almost cubic dependence betwee...

Exploiting platform heterogeneity for power efficient data centers

by Ripal Nathuji - In Proceedings of the IEEE International Conference on Autonomic Computing (ICAC , 2007
"... It has recently become clear that power management is of critical importance in modern enterprise computing environments. The traditional drive for higher performance has influenced trends towards consolidation and higher densities, artifacts enabled by virtualization and new small form factor serve ..."
Abstract - Cited by 57 (4 self) - Add to MetaCart
It has recently become clear that power management is of critical importance in modern enterprise computing environments. The traditional drive for higher performance has influenced trends towards consolidation and higher densities, artifacts enabled by virtualization and new small form factor server blades. The resulting effect has been increased power and cooling requirements in data centers which elevate ownership costs and put more pressure on rack and enclosure densities. To address these issues, in this paper, we enable power-efficient management of enterprise workloads by exploiting a fundamental characteristic of data centers: “platform heterogeneity”. This heterogeneity stems from the architectural and management-capability variations of the underlying platforms. We define an intelligent workload allocation method that leverages heterogeneity characteristics and efficiently maps workloads to the best fitting platforms, significantly improving the power efficiency of the whole data center. We perform this allocation by employing a novel analytical prediction layer that accurately predicts workload power/performance across different platform architectures and power management capabilities. This prediction infrastructure relies upon platform and workload descriptors that we define as part of our work. Our allocation scheme achieves on average 20 % improvements in power efficiency for representative heterogeneous data center configurations, highlighting the significant potential of heterogeneity-aware management. 1
(Show Context)

Citation Context

.... CPU cycles represent the execution with a perfect last-level cache (LLC), while memory cycles capture the finite cache effects. This model is similar to the “overlap model” described by Chou et al. =-=[5]-=-. With the BF model, the CPI of a workload can be represented as in Equation 1. Here CPICORE represents the CPI with a perfect LLC. This term is independent from the underlying memory subsystem. CPIME...

Dual-core execution: building a highly scalable single-thread instruction window

by Huiyang Zhou , 2005
"... Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, we propose a new execution paradigm that utilizes multi-cores on a singl ..."
Abstract - Cited by 54 (3 self) - Add to MetaCart
Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, we propose a new execution paradigm that utilizes multi-cores on a single chip collaboratively to achieve high performance for single-thread memoryintensive workloads while maintaining the flexibility to support multithreaded applications. The proposed execution paradigm, dual-core execution, consists of two superscalar cores (a front and back processor) coupled with a queue. The front processor fetches and preprocesses instruction streams and retires processed instructions into the queue for the back processor to consume. The front processor executes instructions as usual except for cache-missing loads, which produce an invalid value instead of blocking the pipeline. As a result, the front processor runs far ahead to warm up the data caches and fix branch mispredictions for the back processor. In-flight instructions are distributed in the front processor, the queue, and the back processor, forming a very large instruction window for single-thread out-oforder execution. The proposed architecture incurs only minor hardware changes and does not require any large centralized structures such as large register files, issue queues, load/store queues, or reorder buffers. Experimental results show remarkable latency hiding capabilities of the proposed architecture, even outperforming more complex single-thread processors with much larger instruction windows than the front or back processor. 1.
(Show Context)

Citation Context

...ty. Our results confirm such observations from a different perspective (see Section 5.3). It has been proposed to use value prediction to further improve the effectiveness of run-ahead execution [7], =-=[8]-=-, [19], [42]. Similarly, DCE can also benefit from such optimizations and achieve higher performance. 2.2. DCE and leader/follower architectures Running a program on two processors, one leading and th...

Mechanisms for Store-wait-free Multiprocessors

by Thomas F. Wenisch, Anastasia Ailamaki, Babak Falsafi, Andreas Moshovos - In Proceedings of the 34th Annual International Symposium on Computer Architecture , 2007
"... Store misses cause significant delays in shared-memory multiprocessors because of limited store buffering and ordering constraints required for proper synchronization. Today, programmers must choose from a spectrum of memory consistency models that reduce store stalls at the cost of increased progra ..."
Abstract - Cited by 43 (7 self) - Add to MetaCart
Store misses cause significant delays in shared-memory multiprocessors because of limited store buffering and ordering constraints required for proper synchronization. Today, programmers must choose from a spectrum of memory consistency models that reduce store stalls at the cost of increased programming complexity. Prior research suggests that the performance gap among consistency models can be closed through speculation—enforcing order only when dynamically necessary. Unfortunately, past designs either provide insufficient buffering, replace all stores with read-modify-write operations, and/or recover from ordering violations via impractical fine-grained rollback mechanisms. We propose two mechanisms that, together, enable store-wait–free implementations of any memory consistency model. To eliminate buffer-capacity–related stalls, we propose the scalable store buffer, which places private/speculative values directly into the L1 cache, thereby eliminating the non-scalable associative search of conventional store buffers. To eliminate ordering-related stalls, we propose atomic sequence ordering, which enforces ordering constraints over coarse-grain access sequences while relaxing order among individual accesses. Using cycle-accurate full-system simulation of scientific and commercial applications, we demonstrate that these mechanisms allow the simplified programming of strict ordering while outperforming conventional implementations on average by 32 % (sequential consistency), 22% (SPARC total store order) and 9 % (SPARC relaxed memory order).
(Show Context)

Citation Context

...e, the capacity of these structures constrains maximum speculation depth. To assess ASO’s speculative-data capacity requirements, we analyze our workloads using the epoch execution model described in =-=[6,7]-=-. This approach models the execution of out-of-order processors with long off-chip access latencies as a series of execution epochs. Each epoch consists of a computation period followed by a long stal...

Spatial memory streaming

by Stephen Somogyi, Thomas F. Wenisch, Anastassia Ailamaki, Babak Falsafi, Andreas Moshovos - In ISCA , 2006
"... Prior research indicates that there is much spatial variation in applications ' memory access patterns. Modern memory systems, however, use small fixed-size cache blocks and as such cannot exploit the variation. Increasing the block size would not only prohibitively increase pin and interconnec ..."
Abstract - Cited by 40 (11 self) - Add to MetaCart
Prior research indicates that there is much spatial variation in applications ' memory access patterns. Modern memory systems, however, use small fixed-size cache blocks and as such cannot exploit the variation. Increasing the block size would not only prohibitively increase pin and interconnect bandwidth demands, but also increase the likelihood of false sharing in shared-memory multiprocessors. In this paper, we show that memory accesses in commercial workloads often exhibit repetitive layouts that span large memory regions (e.g., several kB), and these accesses recur in patterns that are predictable through codebased correlation. We propose Spatial Memory Streaming, a practical on-chip hardware technique that identifies codecorrelated spatial access patterns and streams predicted blocks to the primary cache ahead of demand misses. Using cycle-accurate full-system multiprocessor simulation of commercial and scientific applications, we demonstrate that Spatial Memory Streaming can on average predict 58 % of L1 and 65 % of off-chip misses, for a mean performance improvement of 37 % and at best 307%. 1.
(Show Context)

Citation Context

...scientific, and DSS (except Qry1) workloads. In OLTP workloads, many of the misses that SMS predicts coincide with misses that the out-of-order core is able to overlap. Even though overall MLP is low =-=[6]-=-, misses that the core can issue in parallel also tend to be spatially correlated (e.g., accesses to multiple fields in a structure). Therefore, the impact of correctly predicting these misses is redu...

Temporal Streaming of Shared Memory

by Thomas F. Wenisch, Stephen Somogyi, Nikolaos Hardavellas, Jangwoo Kim, Anastassia Ailamaki, Babak Falsafi - In Proceedings of the 32nd Annual International Symposium on Computer Architecture , 2005
"... Coherent read misses in shared-memory multiprocessors account for a substantial fraction of execution time in many important scientific and commercial workloads. We propose Temporal Streaming, to eliminate coherent read misses by streaming data to a processor in advance of the corresponding memory a ..."
Abstract - Cited by 38 (12 self) - Add to MetaCart
Coherent read misses in shared-memory multiprocessors account for a substantial fraction of execution time in many important scientific and commercial workloads. We propose Temporal Streaming, to eliminate coherent read misses by streaming data to a processor in advance of the corresponding memory accesses. Temporal streaming dynamically identifies address sequences to be streamed by exploiting two common phenomena in shared-memory access patterns: (1) temporal address correlation—groups of shared addresses tend to be accessed together and in the same order, and (2) temporal stream locality—recently-accessed address streams are likely to recur. We present a practical design for temporal streaming. We evaluate our design using a combination of trace-driven and cycle-accurate full-system simulation of a cache-coherent distributed shared-memory system. We show that temporal streaming can eliminate 98 % of coherent read misses in scientific applications, and between 43 % and 60 % in database and web server workloads. Our design yields speedups of 1.07 to 3.29 in scientific applications, and 1.06 to 1.21 in commercial workloads. 1.
(Show Context)

Citation Context

...can retrieve all blocks within a stream in parallel, thereby eliminating consumptions despite short stream lengths. To verify our hypothesis, we measure the consumption memory level parallelism (MLP) =-=[4]-=-—the average number of coherent read misses outstanding when at least one is outstanding—in our baseline timing model, and report the results in Table 3. Our results show that, in general, the commerc...

Techniques for Efficient Processing in Runahead Execution Engines

by Onur Mutlu, Hyesoon Kim, Yale N. Patt - In Proc. 32nd Intl. Symp. on Computer Architecture , 2005
"... Runahead execution is a technique that improves proces-sor performance by pre-executing the running application instead of stalling the processor when a long-latency cache miss occurs. Previous research has shown that this tech-nique significantly improves processor performance. How-ever, the effici ..."
Abstract - Cited by 30 (7 self) - Add to MetaCart
Runahead execution is a technique that improves proces-sor performance by pre-executing the running application instead of stalling the processor when a long-latency cache miss occurs. Previous research has shown that this tech-nique significantly improves processor performance. How-ever, the efficiency of runahead execution, which directly affects the dynamic energy consumed by a runahead pro-cessor, has not been explored. A runahead processor exe-cutes significantly more instructions than a traditional out-of-order processor, sometimes without providing any per-formance benefit, which makes it inefficient. In this paper, we describe the causes of inefficiency in runahead execu-tion and propose techniques to make a runahead processor more efficient, thereby reducing its energy consumption and possibly increasing its performance. Our analyses and results provide two major insights: (1) the efficiency of runahead execution can be greatly im-proved with simple techniques that reduce the number of short, overlapping, and useless runahead periods, which we identify as the three major causes of inefficiency, (2) sim-ple optimizations targeting the increase of useful prefetches generated in runahead mode can increase both the perfor-mance and efficiency of a runahead processor. The tech-niques we propose reduce the increase in the number of in-structions executed due to runahead execution from 26.5% to 6.2%, on average, without significantly affecting the per-formance improvement provided by runahead execution. 1.
(Show Context)

Citation Context

...xecution core. Previous research has shown that “runahead execution” is a technique that significantly increases the ability of a high-performance processor to tolerate the long main memory latencies =-=[5, 13, 2]-=-. Runahead execution improves the performance of a processor by speculatively pre-executing the application program while a long-latency data cache miss is being serviced, instead of stalling the proc...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University