Results 1 - 10
of
28
Simultaneous Subordinate Microthreading (SSMT)
, 1999
"... Current work in Simultaneous Multithreading provides little benefit to programs that aren't partitioned into threads. We propose Simultaneous Subordinate Microthreading (SSMT) to correct this by spawning subordinate threads that perform optimizations on behalf of the single primary thread. Thes ..."
Abstract
-
Cited by 111 (6 self)
- Add to MetaCart
(Show Context)
Current work in Simultaneous Multithreading provides little benefit to programs that aren't partitioned into threads. We propose Simultaneous Subordinate Microthreading (SSMT) to correct this by spawning subordinate threads that perform optimizations on behalf of the single primary thread. These threads, written in microcode, are issued and executed concurrently with the primary thread. They directly manipulate the microarchitecture to improve the primary thread's branch prediction accuracy, cache hit rate, and prefetch effectiveness. All contribute to the performance of the primary thread. This paper introduces SSMT and discusses its potential to increase performance. We illustrate its usefulness with an SSMT machine that executes subordinate microthreads to improve the branch prediction of the primary thread. We show simulation results for the SPECint95 benchmarks. 1. Introduction Many current generation microprocessors provide substantial resources in order to exploit Instruction-L...
Optimizing compiler for the cell processor
- In PACT
, 2005
"... Developed for multimedia and game applications, as well as other numerically intensive workloads, the CELL processor provides support both for highly parallel codes, which have high computation and memory requirements, and for scalar codes, which require fast response time and a full-featured progra ..."
Abstract
-
Cited by 53 (1 self)
- Add to MetaCart
Developed for multimedia and game applications, as well as other numerically intensive workloads, the CELL processor provides support both for highly parallel codes, which have high computation and memory requirements, and for scalar codes, which require fast response time and a full-featured programming environment. This first generation CELL processor implements on a single chip a Power Architecture processor with two levels of cache, and eight attached streaming processors with their own local memories and globally coherent DMA engines. In addition to processor-level parallelism, each processing element has a Single Instruction Multiple Data (SIMD) unit that can process from 2 double precision floating points up to 16 bytes per instruction. This paper describes, in the context of a research prototype, several compiler techniques that aim at automatically generating high quality codes over a wide range of heterogeneous parallelism available on the CELL processor. Techniques include compiler-supported branch prediction, compiler-assisted instruction fetch, generation of scalar codes on SIMD units, automatic generation of SIMD codes, and data and code partitioning across the multiple processor elements in the system. Results indicate that significant speedup can be achieved with a high level of support from the compiler. 1.
Procedure Placement Using Temporal-Ordering Information
- ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS
, 1997
"... ..."
(Show Context)
Buffering Database Operations for Enhanced Instruction Cache Performance
- In Proc. SIGMOD
, 2004
"... As more and more query processing work can be done in main memory, memory access is becoming a signicant cost component of database operations. Recent database re-search has shown that most of the memory stalls are due to second-level cache data misses and rst-level instruction cache misses. While a ..."
Abstract
-
Cited by 35 (2 self)
- Add to MetaCart
(Show Context)
As more and more query processing work can be done in main memory, memory access is becoming a signicant cost component of database operations. Recent database re-search has shown that most of the memory stalls are due to second-level cache data misses and rst-level instruction cache misses. While a lot of research has focused on re-ducing the data cache misses, relatively little research has been done on improving the instruction cache performance of database systems. We rst answer the question \Why does a database system incur so many instruction cache misses? " We demonstrate that current demand-pull pipelined query execution engines suer from signicant instruction cache thrashing between dierent operators. We propose techniques to buer database operations during query execution to avoid instruction cache thrashing. We implement a new light-weight \buer " oper-ator and study various factors which may aect the cache performance. We also introduce a plan renement algorithm that considers the query plan and decides whether it is ben-e cial to add additional \buer " operators and where to put them. The benet is mainly from better instruction locality and better hardware branch prediction. Our techniques can be easily integrated into current database systems without signi cant changes. Our experiments in a memory-resident PostgreSQL database system show that buering techniques can reduce the number of instruction cache misses by up to 80 % and improve query performance by up to 15%. 1.
Call Graph Prefetching for Database Applications
- ACM Transactions on Computer Systems
, 2000
"... With the continuing technological trend of ever cheaper and larger memory, most data sets in database servers will soon be able to reside in main memory. In this configuration, the performance bottleneck is likely to be the gap between the processing speed of the CPU and the memory access latency. P ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
(Show Context)
With the continuing technological trend of ever cheaper and larger memory, most data sets in database servers will soon be able to reside in main memory. In this configuration, the performance bottleneck is likely to be the gap between the processing speed of the CPU and the memory access latency. Previous work has shown that database applications have large instruction and data footprints and hence do not use processor caches effectively. In this paper, we propose Call Graph Prefetching (CGP), a hardware technique that analyzes the call graph of a database system and prefetches instructions from the function that is deemed likely to be called next. CGP capitalizes on the highly predictable function call sequences that are typical of database systems. We evaluate the performance of CGP on sets of Wisconsin and TPC-H queries, as well as on CPU-2000 benchmarks. For most CPU-2000 applications the number of I-cache misses were very few even without any prefetching, obviating the need for C...
Ispike: A post-link optimizer for the intel itanium architecture
- In IEEE/ACM International Symposium on Code Generation and Optimization
, 2004
"... Ispike is a post-link optimizer developed for the Intel R Itanium Processor Family (IPF) processors. The IPF architecture poses both opportunities and challenges to post-link optimizations. IPF offers a rich set of performance counters to collect detailed profile information at a low cost, which is ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
(Show Context)
Ispike is a post-link optimizer developed for the Intel R Itanium Processor Family (IPF) processors. The IPF architecture poses both opportunities and challenges to post-link optimizations. IPF offers a rich set of performance counters to collect detailed profile information at a low cost, which is essential to post-link optimization being practical. At the same time, the predication and bundling features on IPF make post-link code transformation more challenging than on other architectures. In Ispike, we have implemented optimizations like code layout, instruction prefetching, data layout, and data prefetching that exploit the IPF advantages, and strategies that cope with the IPF-specific challenges. Using SPEC CINT2000 as benchmarks, we show that Ispike improves performance by as much as 40 % on the Itanium R 2 processor, with average improvement of 8.5% and 9.9 % over executables generated by the Intel R Electron compiler and by the Gcc compiler, respectively. We also demonstrate that statistical profiles collected via IPF performance counters and complete profiles collected via instrumentation produce equal performance benefit, but the profiling overhead is significantly lower for performance counters.
Branch History Guided Instruction Prefetching
- In Proceedings of the Seventh International Conference on High Performance Computer Architecture (HPCA
, 2001
"... Instruction cache misses stall the fetch stage of the processor pipeline and hence affect instruction supply to the processor. Instruction prefetching has been proposed as a mechanism to reduce instruction cache (I-cache) misses. However, a prefetch is effective only if accurate and initiated suffic ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
(Show Context)
Instruction cache misses stall the fetch stage of the processor pipeline and hence affect instruction supply to the processor. Instruction prefetching has been proposed as a mechanism to reduce instruction cache (I-cache) misses. However, a prefetch is effective only if accurate and initiated sufficiently early to cover the miss penalty. This paper presents a new hardware-based instruction prefetching mechanism, Branch History Guided Prefetching (BHGP), to improve the timeliness of instruction prefetches. BHGP correlates the execution of a branch instruction with I-cache misses and uses branch instructions to trigger prefetches of instructions that occur (N 1) branches later in the program execution, for a given N > 1. Evaluations on commercial applications, windows-NT applications, and some CPU2000 applications show an average reduction of 66% in miss rate over all applications. BHGP improved the IPC by 12 to 14% for the CPU2000 applications studied; on average 80% of the BHGP prefetc...
Retargetable Static Timing Analysis for Embedded Software
, 2001
"... This paper presents a novel approach for retargetable static software timing analysis. Specifically, we target the problem of determining bounds on the execution time of a program on modern processors, and solve this problem in a retargetable software development environment. Another contribution of ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
This paper presents a novel approach for retargetable static software timing analysis. Specifically, we target the problem of determining bounds on the execution time of a program on modern processors, and solve this problem in a retargetable software development environment. Another contribution of this paper is the modeling of important features in contemporary architectures, such as branch prediction, predication, and instruction pre-fetching, which have great impact on system performance, and have been rarely handled thus far. These ideas allow to build a timing analysis tool that is efficient, accurate, modular and retargetable. We present preliminary results for sample embedded programs to demonstrate the applicability of the proposed approach.
Cache Prefetching
- in ‘Tech Report UW-CSE
, 2002
"... Cache prefetching is a memory latency hiding technique that attempts to bring data to the caches before the occurrence of a miss. A central aspect of all cache prefetching techniques is their ability to detect and predict particular memory reference patterns. In this paper we will introduce and comp ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Cache prefetching is a memory latency hiding technique that attempts to bring data to the caches before the occurrence of a miss. A central aspect of all cache prefetching techniques is their ability to detect and predict particular memory reference patterns. In this paper we will introduce and compare how this is done for each of the specific memory reference patterns that have been identified. Because most applications contain many different memory reference patterns, we will also discuss how prefetching techniques can be combined into a mechanism to deal with a larger number of memory reference patterns. Finally, we will discuss how applicable the currently used prefetching techniques are for a multimedia processing system.
Temporal Instruction Fetch Streaming
"... Abstract—L1 instruction-cache misses pose a critical performance bottleneck in commercial server workloads. Cache access latency constraints preclude L1 instruction caches large enough to capture the application, library, and OS instruction working sets of these workloads. To cope with capacity cons ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Abstract—L1 instruction-cache misses pose a critical performance bottleneck in commercial server workloads. Cache access latency constraints preclude L1 instruction caches large enough to capture the application, library, and OS instruction working sets of these workloads. To cope with capacity constraints, researchers have proposed instruction prefetchers that use branch predictors to explore future control flow. However, such prefetchers suffer from several fundamental flaws: their lookahead is limited by branch prediction bandwidth, their accuracy suffers from geometrically-compounding branch misprediction probability, and they are ignorant of the cache contents, frequently predicting blocks already present in L1. Hence, L1 instruction misses remain a bottleneck. We propose Temporal Instruction Fetch Streaming (TIFS)—a mechanism for prefetching temporally-correlated instruction streams from lower-level caches. Rather than explore a program's control flow graph, TIFS predicts future instruction-cache misses directly, through recording and replaying recurring L1 instruction miss sequences. In this paper, we first present an informationtheoretic offline trace analysis of instruction-miss repetition to show that 94 % of L1 instruction misses occur in long, recurring sequences. Then, we describe a practical mechanism to record these recurring sequences in the L2 cache and leverage them for instruction-cache prefetching. Our TIFS design requires less than 5 % storage overhead over the baseline L2 cache and improves performance by 11 % on average and 24 % at best in a suite of commercial server workloads.