Results 1 - 10
of
11
An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture
- In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems
, 2000
"... This paper presents the first analysis of operating system execution on a simultaneous multithreaded (SMT) processor. While SMT has been studied extensively over the past 6 years, previous research has focused entirely on user-mode execution. However, many of the applications most amenable to multit ..."
Abstract
-
Cited by 57 (3 self)
- Add to MetaCart
(Show Context)
This paper presents the first analysis of operating system execution on a simultaneous multithreaded (SMT) processor. While SMT has been studied extensively over the past 6 years, previous research has focused entirely on user-mode execution. However, many of the applications most amenable to multithreading technologies spend a significant fraction of their time in kernel code. A full understanding of the behavior of such workloads therefore requires execution and measurement of the operating system, as well as the application itself. To carry out this study, we (1) modified the Digital Unix 4.0d operating system to run on an SMT CPU, and (2) integrated our SMT Alpha instruction set simulator into the SimOS simulator to provide an execution environment. For an OS-intensive workload, we ran the multithreaded Apache Web server on an 8-context SMT. We compared Apache's user- and kernel-mode behavior to a standard multiprogrammed SPECInt workload, and compared the SMT processor to an out-...
Thread-Sensitive Scheduling for SMT Processors
, 2000
"... A simultaneous-multithreaded (SMT) processor executes multiple instructions from multiple threads every cycle. As a result, threads on SMT processors -- unlike those on traditional shared-memory machines -- simultaneously share all low-level hardware resources in a single CPU. Because of this fine-g ..."
Abstract
-
Cited by 55 (2 self)
- Add to MetaCart
(Show Context)
A simultaneous-multithreaded (SMT) processor executes multiple instructions from multiple threads every cycle. As a result, threads on SMT processors -- unlike those on traditional shared-memory machines -- simultaneously share all low-level hardware resources in a single CPU. Because of this fine-grained resource sharing, SMT threads have the ability to interfere or conflict with each other, as well as to share these resources to mutual benefit. This paper examines thread-sensitive scheduling for SMT processors. When more threads exist than hardware execution contexts, the operating system is responsible for selecting which threads to execute at any instant, inherently deciding which threads will compete for resources. Thread-sensitive scheduling uses thread-behavior feedback to choose the best set of threads to execute together, in order to maximize processor throughput. We introduce several thread-sensitive scheduling schemes and compare them to traditional oblivious schemes, such as round-robin. Our measurements show how these scheduling algorithms impact performance and the utilization of low-level hardware resources. We also demonstrate how thread-sensitive scheduling algorithms can be tuned to trade-off performance and fairness. For the workloads we measured, we show that an IPC-based thread-sensitive scheduling algorithm can achieve speedups over oblivious schemes of 7% to 15%, with minimal hardware costs. 1
Quantifying Loop Nest Locality Using SPEC'95 and the Perfect Benchmarks
, 1999
"... This paper analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast majority of cache optimization techniques target lo ..."
Abstract
-
Cited by 39 (6 self)
- Add to MetaCart
This paper analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast majority of cache optimization techniques target loop nests. In contrast, the locality characteristics that drive these optimizations are usually collected across the entire application rather than the nest level. Researchers have studied numerical codes for so long that a number of commonly held assertions have emerged on their locality characteristics. In light of these assertions, we use the SPEC'95 and Perfect Benchmarks to take a new look at measuring locality on numerical codes based on references, loop nests, and program locality properties. Our results show that several popular assertions are at best overstatements. For example, although most reuse is within a loop nest, in line with popular assertions, most misses are inter-nest cap...
Online Ensemble Learning: An Empirical Study
- In Proceedings of the Seventeenth International Conference on Machine Learning
, 2000
"... We study resource-limited online learning, motivated by the problem of conditional-branch outcome prediction in computer architecture. In particular, we consider (parallel) time and space-efficient ensemble learners for online settings, empirically demonstrating benefits similar to those shown pre ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
(Show Context)
We study resource-limited online learning, motivated by the problem of conditional-branch outcome prediction in computer architecture. In particular, we consider (parallel) time and space-efficient ensemble learners for online settings, empirically demonstrating benefits similar to those shown previously for offline ensembles.
Improving Cache Performance Via Active Management
, 1999
"... This dissertation analyzes a way to improve cache performance via active management of a target cache space. As microprocessor speeds continue to grow faster than memory subsystem speeds, minimizing the average data access time grows in importance. As current data caches are often poorly and ineffic ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
This dissertation analyzes a way to improve cache performance via active management of a target cache space. As microprocessor speeds continue to grow faster than memory subsystem speeds, minimizing the average data access time grows in importance. As current data caches are often poorly and inefficiently managed, a good management technique can improve the average data access time. Cache management involves two main processes: block allocation decisions and block replacement decisions. Active block allocation can be performed most efficiently in multilateral caches (several distinct data stores with disjoint contents placed in parallel within L1), where blocks exhibiting particular characteristics can be placed in the appropriate store. To aid in our evaluation of different active block management schemes, we have developed a multilateral cache simulator, mlcach...
An evaluation of speculative instruction execution on simultaneous multithreaded processors
- Systems (TOCS) archive Volume 21 , Issue 3 (August 2003) Pages: 314 - 340, 2003
, 2002
"... Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93 % of committed instructions for SPECINT95 are speculative. Without speculation, processor resources on such machines would be largely idle. In con ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93 % of committed instructions for SPECINT95 are speculative. Without speculation, processor resources on such machines would be largely idle. In contrast to superscalars, simultaneous multithreaded (SMT) processors achieve high resource utilization by issuing instructions from multiple threads every cycle. An SMT processor thus has two means of hiding latency: speculation and multithreaded execution. However, these two techniques may conflict; on an SMT processor, wrong-path speculative instructions from one thread may compete with and displace useful instructions from another thread. For this reason, it is important to understand the trade-offs between these two latency-hiding techniques, and to ask whether multithreaded processors should speculate differently than conventional superscalars. This paper evaluates the behavior of instruction speculation on SMT processors using both multiprogrammed (SPECINT and SPECFP) and multithreaded (the Apache Web server) workloads. We measure and analyze the impact of speculation and demonstrate how speculation on an 8-context SMT differs from superscalar speculation. We also examine the effect of speculation-aware fetch and branch prediction policies in the processor. Our results quantify the extent to which (1) speculation
An Analysis of Software Interface Issues for SMT Processors
, 2002
"... Simultaneous Multithreading (SMT) has gradually progressed from a research concept to commercial processor technology. This thesis explores three software interface issues on SMT that are important to its real-world applicability. These issues are: operating system performance on SMT, the impact o ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Simultaneous Multithreading (SMT) has gradually progressed from a research concept to commercial processor technology. This thesis explores three software interface issues on SMT that are important to its real-world applicability. These issues are: operating system performance on SMT, the impact of spinning on SMT, and register file limitations to scaling SMT. We investigate these issues with a new, detailed simulation infrastructure capable of modeling all operating system activity. First, we
The Split Spatial/Non-Spatial Cache: A Performance and Complexity Evaluation
"... A simple new method of detecting useful spatial locality is proposed in this paper. The new method is tested by incorporating it into a new split cache design. Complexity estimation and performance evaluation of the new split cache design is done in order to compare it to the conventional cache arch ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
A simple new method of detecting useful spatial locality is proposed in this paper. The new method is tested by incorporating it into a new split cache design. Complexity estimation and performance evaluation of the new split cache design is done in order to compare it to the conventional cache architecture and the split temporal/spatial cache design. Introduction In recent years, the speed gap between dynamic memories and microprocessors has been steadily increasing. For this reason, a lot of effort is invested into finding ways to reduce or hide memory latency. One of the oldest and most powerful ways of reducing the memory latency is through use of cache memories. Caches exploit the locality of data access. A small (but fast) memory is able to satisfy most memory access requests issued by the processor, so in most cases there is no need to wait for slow (but large) main memory to respond. Conventional cache designs non-selectively cache all data. If the memory request is not satis...
Empirical Study of Opportunities for Bit-Level Specialization in WordBased Programs
, 2000
"... ..."
(Show Context)
FRAMEWORKS FOR PRECISE
, 2001
"... as a dissertation for the degree of Doctor of Philosophy. Monica S. Lam (Principal Adviser) ..."
Abstract
- Add to MetaCart
(Show Context)
as a dissertation for the degree of Doctor of Philosophy. Monica S. Lam (Principal Adviser)