Results 1 - 10
of
15
Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers
, 1990
"... Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper prese ..."
Abstract
-
Cited by 747 (4 self)
- Add to MetaCart
Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches. Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches. Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching. St...
The Effect of Context Switches on Cache Performance
- Jeffrey C. Mogul and Anita
, 1990
"... research relevant to the design and application of high performance scientific computers. We test our ideas by designing, building, and using real systems. The systems we build are research prototypes; they are not intended to become products. There is a second research laboratory located in Palo Al ..."
Abstract
-
Cited by 156 (1 self)
- Add to MetaCart
research relevant to the design and application of high performance scientific computers. We test our ideas by designing, building, and using real systems. The systems we build are research prototypes; they are not intended to become products. There is a second research laboratory located in Palo Alto, the Systems Research Center (SRC). Other Digital research groups are located in Paris (PRL) and in Cambridge,
Systems for Late Code Modification
- WRL Research Report 91/5
, 1991
"... Modifying code after the compiler has generated it can be useful for both optimization and instrumentation. This paper compares the code modification systems of Mahler and pixie, and describes two new systems we have built that are hybrids of the two. This paper covers material presented at the CODE ..."
Abstract
-
Cited by 88 (5 self)
- Add to MetaCart
Modifying code after the compiler has generated it can be useful for both optimization and instrumentation. This paper compares the code modification systems of Mahler and pixie, and describes two new systems we have built that are hybrids of the two. This paper covers material presented at the CODE '91 International Workshop on Code Generation, Schloss Dagstuhl, Germany, May 20-24, 1991. i 1. Introduction Late code modification is the process of modifying the output of a compiler after the compiler has generated it. The reasons one might want to do this fall into two categories, optimization and instrumentation. Some forms of optimization must be performed on assembly-level or machinelevel code. The oldest is peephole optimization [11], which acts to tidy up code that a compiler has generated; it has since been generalized to include transformations on more machine-independent code [2,3]. Reordering of code to avoid pipeline stalls [4,7,18] is most often done after the code is gene...
A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches
- IEEE Transactions on Computers
, 1994
"... This paper compares the trace-sampling techniques of set sampling and time sampling. Using the multi-billion-reference traces of Borg et al., we apply both techniques to multi-megabyte caches, where sampling is most valuable. We evaluate whether either technique meets a 10% sampling goal: a method m ..."
Abstract
-
Cited by 74 (2 self)
- Add to MetaCart
This paper compares the trace-sampling techniques of set sampling and time sampling. Using the multi-billion-reference traces of Borg et al., we apply both techniques to multi-megabyte caches, where sampling is most valuable. We evaluate whether either technique meets a 10% sampling goal: a method meets this goal if, at least 90% of the time, it estimates the trace's true misses per instruction with 10% relative error using 10% of the trace. Results for these traces and caches show that set sampling meets the 10% sampling goal, while time sampling does not. We also find that cold-start bias in time samples is most effectively reduced by the technique of Wood et al. Nevertheless, overcoming cold-start bias requires tens of millions of consecutive references. Index Terms - Cache memory, cache performance, cold start, computer architecture, memory systems, performance evaluation, sampling techniques, trace-driven simulation.
SCHEME->C: a Portable Scheme-to-C Compiler
- WRL Research Report 89/1
, 1989
"... One way to make compilers portable is to have them compile to an intermediate language which is implemented on multiple architectures. By using C as the intermediate language and compiling the LISP dialect Scheme to it, it might be possible to achieve the following benefits. First, since C is the li ..."
Abstract
-
Cited by 61 (3 self)
- Add to MetaCart
One way to make compilers portable is to have them compile to an intermediate language which is implemented on multiple architectures. By using C as the intermediate language and compiling the LISP dialect Scheme to it, it might be possible to achieve the following benefits. First, since C is the lingua franca of workstations the resulting system should be very portable. Second, it should allow Scheme programs to interact with those written in other languages. Finally, it should simplify the compiler as it need not repeat the optimization capability available in the C compiler. However, there might be some unacceptable costs associated with this. First, there might not be a clean translation from Scheme to C, so that the implementation is not quite Scheme. Second, the two-stage translation might result in inefficient code. Finally, the generated code might be so stylized that it is neither portable nor compatible with other programming languages. To investigate these issues, such a compiler and run-time system were constructed at Digital Equipment Corporation's Western Reaearch Laboratory. Experience with the system shows, that there is a translation of Scheme to C, that has good performance, and is portable.
Experience with a Software-Defined Machine Architecture
- Unreachable Procedures in Object-oriented WRL Research Report 91/10
, 1991
"... We built a system in which the compiler back end and the linker work together to present an abstract machine at a considerably higher level than the actual machine. The intermediate language translated by the back end is the target language of all high-level compilers and is also the only assembl ..."
Abstract
-
Cited by 53 (7 self)
- Add to MetaCart
We built a system in which the compiler back end and the linker work together to present an abstract machine at a considerably higher level than the actual machine. The intermediate language translated by the back end is the target language of all high-level compilers and is also the only assembly language generally available. This lets us do intermodule register allocation, which would be harder if some of the code in the program had come from a traditional assembler, out of sight of the optimizer. We do intermodule register allocation and pipeline instruction scheduling at link time, using information gathered by the compiler back end. The mechanism for analyzing and modifying the program at link time was also useful in a wide array of instrumentation tools. i 1. Introduction When our lab built its experimental RISC workstation, the Titan, we defined a high-level assembly language as the official interface to the machine. This high-level assembly language, called Mahler,...
Architectural And Organizational Tradeoffs in the Design of the MultiTitan CPU
, 1989
"... This paper describes the architectural and organizational tradeoffs made during the design of the MultiTitan, and provides data supporting the decisions made. These decisions covered the entire space of processor design, from the instruction set and virtual memory architecture through the pipeline a ..."
Abstract
-
Cited by 32 (2 self)
- Add to MetaCart
This paper describes the architectural and organizational tradeoffs made during the design of the MultiTitan, and provides data supporting the decisions made. These decisions covered the entire space of processor design, from the instruction set and virtual memory architecture through the pipeline and organization of the machine. In particular, some of the tradeoffs involved the use of an on-chip instruction cache with off-chip TLB and floating-point unit, the use of direct-mapped instead of associative caches, the use of a 64-bit vs. 32-bit data bus, and the implementation of hardware pipeline interlocks. This is a preprint of a paper that will be presented at the 16th Annual International Symposium on Computer Architecture, IEEE and ACM, Jerusalem, Israel, May 28-June 1, 1989. An early draft of this paper appeared as WRL Technical Note TN-8. Copyright 1989 ACM i 1. Introduction The MultiTitan is a high-performance general-purpose 32-bit microprocessor developed at Digital Eq...
The Distribution of Instruction-Level and Machine Parallelism and Its Effect on Performance
, 1989
"... This paper examines non-uniformities in the distribution of instructionlevel and machine parallelism. Non-uniformities in instruction-level parallelism include variations between benchmarks, variations within benchmarks, and variations by instruction class. Non-uniformities in machine-level parallel ..."
Abstract
-
Cited by 32 (0 self)
- Add to MetaCart
This paper examines non-uniformities in the distribution of instructionlevel and machine parallelism. Non-uniformities in instruction-level parallelism include variations between benchmarks, variations within benchmarks, and variations by instruction class. Non-uniformities in machine-level parallelism include variations in latency between different operations, and variations in parallel execution capability depending on the instruction opcode. The results presented in this paper were obtained with a parameterizable code reorganization and simulation system. The discussion of machine parallelism is based on the concepts of superscalar and superpipelined machines. Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latencies of their functional units. These two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. The average ...
Page placement algorithms for large real-indexed caches
- ACM Transactions on Computer Systems
, 1992
"... When a computer system supports both paged virtual memory and large real-indexed caches, cache performance depends in part on the main memory page placement. To date, most operating systems place pages by selecting an arbitrary page frame from a pool of page frames that have been made available by t ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
When a computer system supports both paged virtual memory and large real-indexed caches, cache performance depends in part on the main memory page placement. To date, most operating systems place pages by selecting an arbitrary page frame from a pool of page frames that have been made available by the page replacement algorithm. We give a simple model that shows that this naive (arbitrary) page placement leads to up to 30 % unnecessary cache conflicts. We develop several page placement algorithms, called careful-mappmg algonthnzs, that try to select a page frame (from the pool of available page frames) that is likely to reduce cache contention. Using trace-driven simulation, we find that careful mappmg results in 1O–2OTO fewer (dynamic) cache misses than naive mapping (for a direct-mapped real-indexed multimegabyte cache). Thus, our results suggest that careful mapping by the operating system can get about half the cache miss reduction that a cache size (or associativity) doubling can.
The Impact of Software Structure and Policy on CPU and Memory System Performance
, 1994
"... Operating systems, when compared to application programs, have received disappointingly little benefit from the performance improvements of the most recent generation of microprocessors. This thesis used complete traces of software activity from a RISCbased uniprocessor to expose the dynamic behavio ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Operating systems, when compared to application programs, have received disappointingly little benefit from the performance improvements of the most recent generation of microprocessors. This thesis used complete traces of software activity from a RISCbased uniprocessor to expose the dynamic behavior of operating system execution and explore the sources of poor performance. Traces from both Mach 3.0 and Ultrix implementations of UNIX permitted a study of performance differences between microkernel and monolithic implementations of the same operating system interface. The comparison showed that both system structure and policy implemented in the system have a significant impact on performance. Measurements of X11 workloads showed that memory system behavior for these large workloads differs significantly from the kinds of workloads traditionally used for performance analysis. Structural and behavioral similarities between large X11 workloads and the operating system are reflected in the...

