| J. Torrellas, C. Xia, and R. Daigle, "Optimizing Instruction Cache Performance for Operating System Intensive Workloads, " Proceedings of the First International Symposium on High-Performance Computer Architecture, pp. 360--369, January 1995. |
....to increase fetch performance both in BTB architectures and trace cache architectures. Code layout optimizations were initially proposed to improve the performance of the instruction memory hierarchy (instruction cache, instruction TLB) by reducing the code footprint and minimizing conflict misses [15, 24, 37, 13, 11], and to align branches to benefit the underlying fetch architecture and branch predictor [4] Our previous work presents a detailed analysis [25] of the effects of these optimizations, concluding that the improvements on the instruction cache performance are due to an increase in the sequential ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. Proceedings of the 1st Intl. Conference on High Performance Computer Architecture, pages 360--369, Jan. 1995.
....with the goal of more efficiently using the instruction cache. Several compile time code placement techniques have been proposed that use heuristics and profile information to reduce the number of conflict misses in the primary (firstlevel or L1) instruction cache by reordering the program code [3, 18, 26, 27, 36]. Most of this work uses cache parameters such as cache size and line size as well as procedure sizes to accurately model the cache mapping of the code. The code placement algorithms typically use some kind of profile information to find a cache mapping that reduces cache conflict misses. These ....
....shows two programs that result in the same WCG but have substantially different temporal behavior. McFarling [26] uses profile data that incorporates loop counts and probabilities for conditionals, but still retains the limitations mentioned above. Basic block transitions, used by Torrellas et al. [36], share these limitations. Our technique is based on a profiling scheme that captures important information about the temporal behavior of the program. By temporal behavior, we mean precisely the difference between the two programs in Figure 2. They differ in the way in which execution of the ....
[Article contains additional citation context not shown here]
J. Torellas, C. Xia, and R. Daigle, "Optimizing instruction cache performance for operating system intensive workloads." Proc. HPCA-1: 1st Intl. Symposium on High-Performance Computer Architecture, p.360, January 1995.
....such instrumentation can capture all memory references, it perturbs workload execution [16] Other studies employed bus monitors [26] which have the drawback of capturing only memory activity reaching the bus. To overcome this, some have used a combination of instrumentation and bus monitors [78, 88, 79, 14]. As an example of more recent studies, Torrellas, Gupta, and Hennessy [78] measured L2 cache misses on an SMP of MIPS R3000 processors; they report sharing and invalidation misses and distinguish between user and kernel conflict misses. Maynard, Donnelly, and Olszewski [48] looked at a ....
TORRELLAS, J., XIA, C., AND DAIGLE, R. Optimizing instruction cache performance for operating system intensive workloads. In Proceedings of the International Symposium on High-Performance Computer Architecture (January 1995).
....procedures are split in two parts: the hot section which contains the frequently executed code, and the cold part which contains mostly unused code. After splitting, the procedures are mapped in memory so that two procedures which call each other will be mapped close in memory. Torrellas et al. [28] designed a basic block reordering algorithm for operating system code. Using a basic block chaining algorithm similar to that of Hwu and Chang, they build traces spanning several functions, and then keep a section of the cache address space reserved for the most frequently referenced basic ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. Proceedings of the 1st Intl. Conference on High Performance Computer Architecture, pages 360--369, Jan. 1995.
....and spatial locality for all levels of memory. Many software techniques have been developed for improving instruction cache performance. Techniques such as basic block reordering [37, 66] function grouping [26, 31, 37, 66] reordering based on control structure [60] and reordering of system code [83] have all been shown to significantly improve instruction cache performance. The increasing latency of second level caches means that expensive cache usage patterns, such as ping ponging between code laid out on the same cache line, can have dramatic effects on program performance. 25 Java ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. In Proceedings of the First International Symposium on HighPerformance Computer Architecture, pages 360--369, January 1995.
.... in a program so that they do not conflict with each other, we can reduce the number of cache misses by almost an order of magnitude [17, 7, 6] By aligning basic blocks so that they execute sequentially, we can further increase spatial locality increasing both cache performance and fetch bandwidth [8, 17, 25, 19]. The third factor has motivated the search of more accurate branch predictors. The performance loss due to branch instructions was first approached with static branch predictors, which always predict the same outcome for a given branch. This prediction was obtained either using very simple ....
....in two main groups: code layout optimization techniques, and branch prediction techniques. Code layout optimizations usually target a better utilization of the instruction cache, and use profile data or heuristics to lay out the routines in a program [17, 7, 6] and the basic blocks in a routine [8, 17, 25, 19] to minimize the number of conflict misses. Reducing the number of conflict misses in the instruction cache, code reordering increases fetch performance, and overall processor performance. The use of both routine placement and basic block reordering can also increase the effective fetch bandwidth ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. Proceedings of the 1st Intl. Conference on High Performance Computer Architecture, pages 360--369, Jan. 1995.
....while such instrumentation can capture all memory references, it perturbs workload execution [7] Other studies employed bus monitors [12] which have the drawback of capturing only memory activity reaching the bus. To overcome this, some have used a combination of instrumentation and bus monitors [5, 39, 46, 40]. As an example of more recent studies, Torrellas, Gupta, and Hennessy [39] measured L2 cache misses on an SMP of MIPS R3000 processors; they report sharing and invalidation misses and distinguish between user and kernel conflict misses. Maynard, Donnelly, and Olszewski [25] looked at a ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. In 5th International Conference on Architectural Support for Programming Languages and Operating Systems, January 1995.
....Procedures are then split into two regions, the primary that includes all the frequently accessed chains of basic blocks and the fluff that includes the infrequently executed blocks. The linker forces all fluff procedures to the end of the code area in the modified executable. Torrellas et al. [72, 73] propose an algorithm for repositioning operating system code. They identify the reference and miss patterns and characterize the spatial, temporal and loop locality of operating system code. They propose an interprocedural basic block repositioning algorithm, where spatial locality is exploited ....
....as well as code size, improving performance when the application s working set did not fit in the instruction cache. In [85] the authors compare profile guided code reordering with a hardware trace cache (HTC) 87] They consider an interprocedural basic block reordering algorithm, described in [72, 73], which they name a Software Trace Cache (STC) Their goal is twofold: 1) provide a cache conscious algorithm and 2) maximize code sequentiality. Their findings show that, for applications with few loops and deterministic execution sequences, that span a large set of basic blocks (e.g. databases, ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing Instruction Cache Performance for Operating System Intensive Workloads. In Proceedings of the International Conference on High Performance Computer Architecture, pages 360--369, January 1995.
....block globally. Branch alignment is a form of basic block positioning technique that attempts to minimize the effects of 11 branch mispredictions and misfetches [69, 70, 71] Most other related work on basic block reordering has targeted improving fetch unit effectiveness and memory access time [53, 54, 66, 72, 67]. The main idea behind all these strategies is to rearrange code units so that conflicts between them at different levels of the memory hierarchy (1st and 2nd level caches, main memory) are reduced. In addition, the new ordering of code should improve spatial locality and cache utilization. We ....
....Procedures are then split into two regions, the primary that includes all the frequently accessed chains of basic blocks and the fluff that includes the infrequently executed blocks. The linker forces all fluff procedures to the end of the code area in the modified executable. Torrellas et al. [72, 73] propose an algorithm for repositioning operating system code. They identify the reference and miss patterns and characterize the spatial, temporal and loop locality of operating system code. They propose an interprocedural basic block repositioning algorithm, where spatial locality is exploited ....
[Article contains additional citation context not shown here]
J. Torrellas, C. Xia, and R. Daigle. Optimizing the Instruction Cache Performance of an Operating System. IEEE Transactions on Computers, 47(12):1363--1381, December 1998.
....In Section 3 we describe our code repositioning algorithm, while in Section 4 we report simulation results. We conclude the paper in Section 5. 2 Related work There has been a considerable amount of work done on profile guided code positioning for improved instruction cache performance [1, 2, 3, 4, 5, 6]. We will limit our discussion here to work that is directly related to this paper. Hashemi et.al. in [2] use a Call Graph (CG) to guide procedure placement. Their placement algorithm utilizes information about the cache organization to color procedures so that they do not overlap in the cache ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing Instruction Cache Performance for Operating System Intensive Workloads. In Proceedings of the International Conference on High Performance Computer Architecture, pages
....on cache organization we concentrate on the layout of a program on the memory space. Bershad et.al. suggested remapping cache addresses dynamically to avoid conflict misses in large direct mapped caches [3] An alternative approach is to perform code repositioning at compile or linktime [8, 10, 11, 15, 18]. The idea is to place frequently used sections of a program next to each other in the address space, thereby reducing the chances of cache conflicts while increasing spatial locality within the program. Code reordering algorithms for improved memory performance can span several different levels ....
....cache lines used by each procedure in the mapping. This allows our algorithm to effectively eliminate first generation cache conflicts, even when the popular subgraph size is larger than the instruction cache, by using the color mapping and the unavailable set of colors. Torrellas, Xia and Daigle [18] (TXD) also described an algorithm for code layout for operating system intensive workloads. Their work takes into consideration the size of the cache and the popularity of code. Their algorithm partitions the operating system code into executed and non executed parts at the basic block level. It ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. In Proceedings of the First International Symposium on High-Performance Computer Architecture, pages 360--369, January 1995.
....two or more procedures whose addresses map to overlapping sets of cache lines. Several compile time code placement techniques have been developed that use heuristics and profile information to reduce the number of conflict misses in the instruction cache by a reordering of the program code blocks [5,6,7,8,11]. Though these techniques successfully remove a sizeable number of the conflict misses when compared to the default code layout produced during the typical compilation process, it is possible to do even better if we gather improved profile information and consider the specifics of the hardware ....
....cache size and its modulo property when evaluating potential layouts, but his cost calculation is obviously different from ours. Finally, his algorithm is unique in its ability to determine which portions of the text segment should be excluded from the instruction cache. Torellas, Xia, and Daigle [11] propose a code placement technique for kernel intensive applications. Their algorithm considers the cache address mapping when performing code placement. They define an array of logical caches, equal in size and address alignment to the hardware cache. Code placed within a single logical cache is ....
J. Torrellas, C. Xia, and R. Daigle. "Optimizing Instruction Cache Performance for Operating System Intensive Workloads," Proc. First Intl. Symp. on High-Performance Computer Architecture, pp. 360--369, January 1995.
....to rearrange the order by which code modules are laid out in memory in order to make the most efficient use of the available memory and cache address space. Several algorithms have been proposed, most of which work at compile time, and use either control flow analysis [12, 13] and or profile data [18, 15, 7]. Recently, several systems have been proposed that attempt to reorder code at run time [16, 3] Besides the issue of when to reorder, there remains the issue of what granule size to use when reordering. Procedure reordering [7, 6, 11] interprocedural basic block reordering [18, 9] and combined ....
....data [18, 15, 7] Recently, several systems have been proposed that attempt to reorder code at run time [16, 3] Besides the issue of when to reorder, there remains the issue of what granule size to use when reordering. Procedure reordering [7, 6, 11] interprocedural basic block reordering [18, 9], and combined basic block and procedure reordering [15, 10, 4] are the most popular approaches. In this paper we present an approach that provides a single pass simulation of multiple code reordering algorithms and their accurate cycle based evaluation on a modern outof order superscalar ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing the Instruction Cache Performance of an Operating System. IEEE Transactions on Computers, 47(12):1363--1381, December 1998.
.... sequence of instructions is called the code layout problem; algorithms that attack code layout attempt to reduce cache misses or pipeline stalls in the program by changing the order of basic blocks or procedures [Calder and Grunwald 1994; Hwu and Chang 1989; McFarling 1993; Pettis and Hansen 1990; Torellas et al. 1995; Young et al. 1997] In the first conference paper on SCBP [Young and Smith 1994] we attempted to improve branch prediction accuracy without increasing instruction count. To do this, we used a simple and inefficient layout algorithm that copied code until an unconditional branch or the end of a ....
Torellas, J., Xia, C., and Daigle, R. 1995. Optimizing instruction cache performance for operating system intensive workloads. In Proceedings of the First IEEE Symposium on HighPerformance Computer Architecture. IEEE, Los Alamitos, California, 360--369.
....techniques define a logical ordering of basic blocks and reorder instructions in those logical groups crossing the basic block boundary to optimize instruction scheduling for VLIW processors. There are also other code reordering algorithms which target an increase in the instruction cache hit rate [7, 11, 16]. In order to exploit more spatial locality, these techniques also increase the code sequentiality, but they target a single basic block fetch per cycle. The full potential of these techniques has not been exploited. The software trace cache [12, 13, 14] is the first code reordering targeting the ....
Josep Torrellas, Chun Xia, and Russell Daigle. Optimizing instruction cache performance for operating system intensive workloads. Proceedings of the 1st Intl. Conference on High Performance Computer Architecture, pages 360--369, January 1995.
....the probability of a conflict, we will attempt to keep temporally local cache lines as spatially local as possible when mapping them in the memory address space. Past work in code reordering includes dynamically remapping addresses during runtime [2 ] and relinking the program before runtime [3, 16, 18, 21, 22, 28]. Code reordering algorithms for improved memory performance can span several different levels of granularity, from basic blocks, to loops, and to procedures. Research has shown that basic block reordering and procedure reordering can significantly improve a program s execution performance. Pettis ....
....algorithm to the gzip program, and obtained a 95 reduction in the cache miss rate versus an unoptimized compilation of the program. As we have found in our work, gzip has 2 important procedures, and if layed out properly, can remove most of the instruction cache misses. Torrellas, Xia and Daigle [28] (TXD) also described an algorithm for code layout for operating system intensive workloads. Their work takes into consideration the size of the cache and the popularity of code. Their algorithm partitions the operating system code into executed and non executed parts at the basic block level. It ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. In Proceedings of the First International Symposium on High-Performance Computer Architecture, pages 360--369, January 1995.
....miss rate, but their basic block ordering technique also works quite well to reduce control penalties. Since their work appears to be the basis for many existing commercial code placement tools, we have used their technique as a basis for our greedy implementation. More recently, Torellas et al. [28] have investigated code placement to improve operating system performance. Their paper describes an algorithm that parti Control Penalties Execution Times TSP based Method Greedy Method 0.90 0.92 0.94 0.96 0.98 1.00 1.02 Benchmark and Data Set Normalized Execution Time self cross 0.90 0.92 ....
J. Torrellas, C. Xia, and R. Daigle. "Optimizing Instruction Cache Performance for Operating System Intensive Workloads," Proc. First Intl. Symp. on High-Performance Computer Architecture, pp. 360--369, January 1995.
....the probability of a conflict, we will attempt to keep temporally local cache lines as spatially local as possible when mapping them in the memory address space. Past work in code reordering includes dynamically remapping addresses during runtime [2 ] and relinking the program before runtime [3, 16, 18, 21, 22, 28]. Code reordering algorithms for improved memory performance can span several different levels of granularity, from basic blocks, to loops, and to procedures. Research has shown that basic block reordering and procedure reordering can significantly improve a program s execution performance. Pettis ....
....[18 ] Mendlson et al. also examined using static information (i.e. locations of program loops) to avoid cache conflicts [19 ] Their work was at the instruction level of a program and both replicate and rearrange code in order to provide an improved program layout. Torrellas, Xia and Daigle [28] (TXD) also described an algorithm for code layout for operating system intensive workloads. Their work takes into consideration the size of the cache and the popularity of code. The TXD algorithm is designed for mapping operating system code to increase performance, by keeping commonly used ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. In Proceedings of the First International Symposium on High-Performance Computer Architecture, pages 360--369, January 1995.
.... require complex hardware as in [32] or detailed cache profiling as in [44] The proposed scheme requires only a profile file that contains basic block execution counts as in McFarling s scheme [31] Generation of such a profile file with block counts is simple as demonstrated by past research [31] [42]. Simple block counts also permit using an average if the behavior of the program varies widely depending on the input. To our knowledge, no past study quantifies the performance improvement possible by simply excluding the low usage references from the cache. The scheme requires the cache to have ....
....of various data structures referenced by the program are analyzed and the heavily referenced data structures identified. In practice, it will be difficult to find clear demarcations between HU, MU, or LU categories. However, this is true of the code layout schemes in Torrellas et al. as well [42]. They took a size of 1k for their SelfConffree area which is the equivalent of our heavy use (HU) area. The threshold for setting each demarcation does not need to be fixed, but it could be variable depending on the cache size and program size. It may be noted that if the code data set is ....
[Article contains additional citation context not shown here]
J. Torrellas, C. Xia, and R. Daigle, "Optimizing Instruction Cache Performance for Operating System Intensive Workloads", Proceedings of the High Performance Computer Architecture Symposium, Jan 1995, pp. 360-369.
....protocol stacks. Many parallels to this work can be found in software techniques developed for improving instruction cache performance. Techniques such as basic block re ordering [15, 28] function grouping [34, 15, 28] reordering based on control structure [24] and reordering of system code [33] have all been shown to significantly improve instruction cache performance. Like this work, the approaches usually rely on profile information to guide heuristic algorithms in placing instructions to minimize instruction cache conflicts, and maximize cache line utilization and block prefetch. ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. In Proceedings of the First International Symposium on High-Performance Computer Architecture, pages 360--369, January 1995.
....spatial locality for all levels of memory. Many software techniques have been developed for improving instruction cache performance. Techniques such as basic block re ordering [14, 22] function grouping [10, 12, 14, 22] reordering based on control structure [20] and reordering of system code [25] have all been shown to significantly improve instruction cache performance. The increasing latency of second level caches means that expensive cache usage patterns, such as ping ponging between code laid out on the same cache line, can have dramatic effects on program performance. Most of the ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. In Proceedings of the First International Symposium on High-Performance Computer Architecture, pages 360--369, January 1995.
....two or more procedures whose addresses map to overlapping sets of cache lines. Several compile time code placement techniques have been developed that use heuristics and profile information to reduce the number of conflict misses in the instruction cache by a reordering of the program code blocks [5,6,7,8,11]. Though these techniques successfully remove a sizeable number of the conflict misses when compared to the default code layout produced during the typical compilation process, it is possible to do even better if we gather improved profile information and consider the specifics of the hardware ....
....to collect temporal interleaving information, his algorithm assumes and optimizes for a worst case interleaving of blocks. Finally, his algorithm is unique in its ability to determine which portions of the text segment should be excluded from the instruction cache. Torrellas, Xia, and Daigle [11] propose a code placement technique for kernel intensive applications. Their algorithm considers the cache address mapping when performing code placement. They define an array of logical caches, equal in size and address alignment to the hardware cache. Code placed within a single logical cache is ....
J. Torrellas, C. Xia, and R. Daigle. "Optimizing Instruction Cache Performance for Operating System Intensive Workloads," Proc. First Intl. Symp. on High-Performance Computer Architecture, pp. 360--369, January 1995.
....strategy as was used in Hwu Change (1989) Averaging over seven small benchmarks (15 kB to 140 kB static code size, fewer than 150 million instructions executed) they found a 25 improvement in the miss rate for moderate (5 ) miss rates, and no improvement for low ( 1 ) miss rates. Torrellas, Xia Daigle (1998) used a procedure similar to that used in (Hwu Change 1989) to determine whether reordering the operating system will reduce the instruction cache miss rate; the primary difference is they reorder basic blocks across the entire program, not just within procedures. They did not actually reorder ....
Torrellas, J., Xia, C. & Daigle, R. (1998). Optimizing the instruction cache performance of the operating system, IEEE Transactions on Computers 47(12): 1363--1381.
....on cache organization we concentrate on the layout of a program on the memory space. Bershad et.al. suggested remapping cache addresses dynamically to avoid conflict misses in large direct mapped caches [3] An alternative approach is to perform code repositioning at compile or link time [4, 9, 11, 12, 16, 20]. The idea is to place frequently used sections of a program next to each other in the address space, thereby reducing the chances of cache conflicts while increasing spatial locality within the program. Code reordering algorithms for improved memory performance can span several different levels ....
....cache lines used by each procedure in the mapping. This allows our algorithm to effectively eliminate first generation cache conflicts, even when the popular subgraph size is larger than the instruction cache, by using the color mapping and the unavailable set of colors. Torrellas, Xia and Daigle [20] (TXD) also described an algorithm for code layout for operating system intensive workloads. Their work takes into consideration the size of the cache and the popularity of code. Their algorithm partitions the operating system code into executed and non executed parts at the basic block level. It ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. In Proceedings of the First International Symposium on High-Performance Computer Architecture, pages 360--369, January 1995.
....work where the critical path is narrow, getting more out of an underutilized machine. Accurate statistics can also point out instructions that are likely to be executed in the future. The compiler can do a better job tailoring code to the target machine, allocating registers [25] laying out code [12, 18, 48, 56, 65] and scheduling instructions [11, 21, 33, 39, 44, 52] to closely match the implementation s fetch, issue, and execution units. Combined, these kinds of optimizations give much larger benefits than simply avoiding processor stalls. In a hybrid static dynamic approach, some architectures, such as ....
....to ensure correct program semantics. Transforming a control flow graph into a linear sequence of instructions is called the code layout problem; algorithms that attack code layout attempt to reduce cache misses or pipeline stalls in the program by changing the order of basic blocks or procedures [12, 31, 43, 48, 56, 65]. init div2.t inc2 div3.t2 inc3 div5.t2 inc5 div6.f inc6 loop.t1 div3.t1 div5.t1 inc3 div6.t inc5 div2.f div3.f loop.t2 Figure 30. CFG resulting from global reconciliation of the corr program. Shaded blocks predict that their terminal conditional branch will jump. Page 70 of 136 In the first ....
J. Torrellas, C. Xia, and R. Daigle. "Optimizing Instruction Cache Performance for Operating System Intensive Workloads," Proc. First Intl. Symp. on High-Performance Computer Architecture. Los Alamitos, Calif.: IEEE Computer Society, Jan. 1995.
....the remaining portion of X) placing X1 with respect to Y is only constrained by the size of X1, and the same is true for placing X2 with respect to Z. 7. Comparison to previous work There has been a considerable amount of work done on code positioning for improved instruction cache performance [2, 3, 4, 6, 7, 8, 9, 10, 11]. We next discuss some of this work, as it relates to our algorithm. In [9] McFarling proposes a basic block remapping algorithm which captures control flow in the form of a Directed Acyclic Graph. The algorithm partitions the graph, paying special attention to loop nodes, with the goal of ....
....counts to minimize instruction cache conflicts. They build a call graph by traversing edges in decreasing edge weight order using a closest is best placement strategy. They form chains by merging nodes, laying them out next to each other until the entire graph is processed. Torrellas et al. [6] propose an algorithm for repositioning operating system codes. They identify the most frequently executed paths spanning several procedures, and then attempt to lay them out contiguously in the cache. They also try to fit loops in the cache. They manage to avoid a large number of cache conflicts ....
[Article contains additional citation context not shown here]
J.Torrellas, C.Xia, and R.Daigle. Optimizing Instruction Cache Performance for Operating System Intensive Workloads. In Proceedings of the Int. Conference on High Performance Computer Architecture, pages 360--369, January 1995.
....in x4.1. 2.2 Procedure Placement Algorithms Many software techniques have been developed for improving instruction cache performance. Techniques such as basic block re ordering [9, 13] function grouping [7, 8, 9, 13] reordering based on control structure [12] and reordering of system code [16] have all been shown to significantly improve instruction cache performance. The increasing latency of second level caches means that cache conflicts have a dramatic effect on program performance. Recent work on procedure placement to improve instruction cache performance shows that further ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. In 1stIntl. Symp. on High Performance Computer Architecture, pages 360--369, January 1995.
....basic blocks. In an operating system, temporal relationships change with each application, but each application has few temporal dependencies. Once again, if different applications can be associated with differently laid out kernel modules, each will observe improved performance. Torrellas et al. [Torr95] show that code layout is beneficial for improving instruction cache behavior. Other compiler techniques, such as procedure inlining [Chang91] and global register allocation [Wall86] can all take advantage of a global view of the entire computation. We have listed just a few of the more obvious ....
J. Torrellas, C. Xia, and R. Daigle, "Optimizing Instruction Cache Performance for Operating System Intensive Workloads," Proceedings of the First Symposium on High-Performance Computer Architecture, pp. 360--369 (January 1995).
....system can observe all the accesses issued by the processor because the processor monitored does not have on chip caches. In general, systems based on performance monitors tend to work with samples because they have a very limited trace storage capacity. The two systems described in this paper [18, 19], however, can generate continuous and complete traces. Finally, there are hardware monitors that simply count events. One example is Monster [14] a logic analyzer carefully programmed to count cache misses, TLB misses, write buffer stalls, and other events. Counters, however, only allow the ....
....the past few years, we have developed two trace generation systems that have these characteristics and are connected to two bus based shared memory multiprocessors. The first one is connected to a Silicon Graphics multiprocessor [18] while the second one is connected to an Alliant multiprocessor [19]. In the remainder of this section, we describe these two systems. Example 1: Silicon Graphics This trace generation system [18] is connected to a Silicon Graphics POWER Station 4D 340 [5] The machine is a bus based cache coherent multiprocessor with four 33 MHz MIPS R3000 processors. Each ....
[Article contains additional citation context not shown here]
J. Torrellas, C. Xia, and R. Daigle. Optimizing Instruction Cache Performance for Operating System Intensive Workloads. In Proceedings of the 1st International Symposium on High-Performance Computer Architecture, pages 360--369, January 1995.
....and to predict multiple branches per cycle [23, 26] Instruction cache misses have been addressed with software and hardware techniques. Software solutions include code reordering based on procedure placement [8, 7] or basic block mapping, either procedure oriented [18] or using a global scope [11, 24]. Hardware solutions include set associative caches, hardware prefetching, victim caches and other classic techniques. Finally, the number of instructions provided by the fetch unit each cycle can also be improved with software or hardware techniques. Software solutions include trace scheduling ....
....frequently used functions, placing functions which reference each other close in memory. They also reorder the basic blocks in a procedure, moving unused basic blocks to the bottom of the function code, even splitting the procedures in two, and moving away the unused basic blocks. Torrellas et al. [24] designed a basic block reordering algorithm for Operating System code, running on a very conservative vector processor. They map the code in the form of sequences of basic blocks spanning several functions, and keep a section of the cache address space reserved for the most frequently referenced ....
Josep Torrellas, Chun Xia, and Russell Daigle. Optimizing instruction cache performance for operating system intensive workloads. Proceedings of the 1st Intl. Conference on High Performance Computer Architecture, pages 360-369, January 1995. 13
....and to predict multiple branches per cycle [22, 28] Instruction cache misses have been addressed with software and hardware techniques. Software solutions include code reordering based on procedure placement [8, 7] or basic block mapping, either procedure oriented [18] or using a global scope [9, 24]. Hardware solutions include set associative caches, hardware prefetching, victim caches and other techniques. Finally, the number of instructions provided by the fetch unit each cycle can also be improved with software or hardware techniques. Software solutions include trace scheduling [5] and ....
....basic blocks to the bottom of the function code, even splitting the procedures in two, and moving away the unused basic blocks. Their algorithm did not consider the target cache information, and their basic block reordering was limited to the basic blocks within a function body. Torrellas et al. [24] designed a basic block reordering algorithm for Operating System code, running on a very conservative vector processor. They map the code in the form of sequences of basic blocks spanning several functions, and keep a section of the cache address space reserved for the most frequently referenced ....
Josep Torrellas, Chun Xia and Russell Daigle, Optimizing Instruction Cache Performance for Operating System Intensive Workloads, Proceedings of the 1st Intl. Conference on High Performance Computer Architecture, pp 360-369, January 1995.
....University of Illinois at Urbana Champaign, USA. blocks in consecutive memory positions and, therefore, increase the number of useful instructions fetched per access. Unfortunately, past work on code reordering techniques has largely focused on simply reducing the instruction cache miss rate [9, 11, 14, 6, 5, 8]. This approach made sense in the context of the simple, less aggressive processors for which past work was done. However, in the modern, wide issue superscalars, ensuring that sequentially executed instructions are mapped in consecutive memory positions can be more crucial than keeping the number ....
....basic blocks to the bottom of the function code, even splitting the procedures in two, and moving away the unused basic blocks. Their algorithm did not consider the target cache information, and their basic block reordering was limited to the basic blocks within a function body. Torrellas et al. [14] designed a basic block reordering algorithm for Operating System code, running on a very conservative vector processor. They map the code in the form of sequences of basic blocks spanning several functions, and keep a section of the cache address space reserved for the most frequently referenced ....
Josep Torrellas, Chun Xia, and Russell Daigle. Optimizing instruction cache performance for operating system intensive workloads. Proceedings of the 1st Intl. Conference on High Performance Computer Architecture, pages 360--369, January 1995.
....and to predict multiple branches per cycle [20, 24] Instruction cache misses have been addressed with software and hardware techniques. Software solutions include code reordering based on procedure placement [7, 6] or basic block mapping, either procedure oriented [16] or using a global scope [8, 21]. Hardware solutions include set associative caches, hardware prefetching, victim caches and other techniques. Finally, the number of instructions provided by the fetch unit each cycle can also be improved with software or hardware techniques. Software solutions include trace scheduling [4] and ....
....the sequences in the first pass free of code in all logical caches. This way, the first sequences will not be replaced from the cache by any other code, and so will be free of interference. We call this area the Conflict Free Area (CFA) and derives directly form the SelfConfFree area proposed in [21]. The size of this CFA is determined by the Exec and Branch Thresholds used for the first pass of our sequence building algorithm. Most popular traces Least popular traces 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 11111111 11111111 11111111 11111111 ....
[Article contains additional citation context not shown here]
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. Proceedings of the 1st Intl. Conference on High Performance Computer Architecture, pages 360--369, Jan. 1995.
....than loop intensive engineering codes. Since these applications are used frequently in real life, it is important to understand and improve their cache performance. The first step to improve the instruction cache performance of these codes is to optimize the layout of their instructions in memory [12]. The purpose of this step is to expose more locality in the code and, as a result, minimize conflicts in the cache. The approach taken usually involves building the basic block graph of the code and then, based on profile information, carefully placing the basic blocks in memory. For example, ....
....in memory. For example, basic blocks that are usually fetched in sequence are laid out in sequence, while basic blocks that form a loop are placed to avoid any conflicts within the loop. Overall, it has been shown that these schemes work well, both for engineering [3, 7] and for systems codes [12]. They work particularly well for systems codes because the original unoptimized layout has poor performance. After this optimization, the misses that remain tend to be spread out in the code in a uniform manner; there are no obvious hot spot areas of conflict misses. Consequently, no simple ....
[Article contains additional citation context not shown here]
J. Torrellas, C. Xia, and R. Daigle. Optimizing Instruction Cache Performance for Operating System Intensive Workloads. In IEEE Trans. on Computers, to appear. A shorter version appeared in Proceedings of the 1st International Symposium on High-Performance Computer Architecture, pages 360-369, January 1995.
....into hardware based and software based. Hardware based systems rely on a trace gathering hardware device, either attached to the machine or built into the architecture, while software based systems are purely based on software instrumentation of the code to be traced. Hardware based systems [23, 26, 27] can potentially gather very detailed information with practically no perturbation. However, they can be difficult to use. Indeed, often, they collect physical addresses only, are prevented from recording all the references by one or more levels of caches, and have very limited trace storage ....
....Unfortunately, the extra work slowed down the machine by 20 times. Furthermore, the traces were collected in a buffer that filled in less than a second and had to be dumped to disk. While the buffer was being dumped, the references issued by the processors were lost. The other two systems [26, 27] are the ones described in this paper. They can generate continuous traces without perturbing the system much. They are described in the next section. Other trace collecting hardware devices have been used in uniprocessors. One example is ATUM [1] the uniprocessor predecessor of ATUM 2. Other ....
[Article contains additional citation context not shown here]
J. Torrellas, C. Xia, and R. Daigle. Optimizing Instruction Cache Performance for Operating System Intensive Workloads. In Proceedings of the 1st Symposium on High-Performance Computer Architecture, pages 360-- 369, January 1995.
....For this experiment, we have selected the 12 most active miss hot spots. They account for 29, 44, 22, and 51 of the remaining operating system data misses in the primary caches in TRFD 4, TRFD Make, ARC2D Fsck, and Shell respectively. These hot spots are 5 loops and 7 sequences. A sequence [20] is an ordered set of basic blocks executed with high frequency and in the same order. These miss hot spots do the following: ffl Four loops loop over the array of page table entries, performing initialization or copies of several entries. One loop traverses a linked list of pages to find a free ....
....system call, perform context switching, and schedule a process. Typical prefetchable data structures are the table of system call functions or the timer data structure. Note that many miss hot spots are not loops. This is a result of the relatively low frequency of loops in the operating system [20]. We expect that other UNIX systems will have quite similar miss hot spots. Once the miss hot spots have been identified, we manually insert the prefetches. We use a prefetch instruction like the one used by Blk Pref in Section 4.2 and supported, for example, by Alpha [18] For the loops, we ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing Instruction Cache Performance for Operating System Intensive Workloads. In IEEE Trans. on Computers, To appear 1995. A shorter version appeared in Proceedings of the 1st International Symposium on High-Performance Computer Architecture, pages 360-369, January 1995.
....corresponding solutions; Chapter 7 combines all effective optimization schemes together and discusses the cost performance trade offs among them; and, finally, Chapter 8 concludes this work and discusses issues to be considered in future work 1 . 1 Most part of Chapter 3 to 6 can be found in [42, 45, 44]. The electronic form of this thesis and papers is available at http: www.csrd.uiuc.edu iacoma iacomapapers.html. Chapter 2 Experiment Method and Setup 2.1 Methodology This thesis study is conducted using empirical methodology. We carry out a series of experiments using the following ....
....5.5 Related Work To our knowledge, none of the previous work has focused on prefetching codes with optimized layouts, especially systems codes like OS. This is our main contribution. There is, however, a large body of related work. First, McFarling [30] Hwu and Chang [24] and Torrellas et al. [42] studied code layout optimization for cache performance. However, they did not investigate instruction prefetching on the optimized codes. Instruction prefetching without layout optimization has been a topic frequently addressed in the past. For example, Smith examined the three next line ....
J. Torrellas, C. Xia, and R. Daigle. Optimizing Instruction Cache Performance for Operating System Intensive Workloads. In IEEE Trans. on Computers, to appear. A shorter version appeared in Proceedings of the 1st International Symposium on High-Performance Computer Architecture, pages 360-369, January 1995.
....task, even in the presence of aiding devices like the Hardware Trace Cache (HTC) 4, 12] On the software side, it is possible to reorder the code in memory so that it is easier to supply useful instructions to the execution unit. Code reordering can target the elimination of cache conflicts [5, 6, 8, 7, 10, 13]. In addition, it can also map sequentially executed basic blocks in consecutive memory positions [7, 10, 13] Both aspects may increase the number of useful instructions fetched per access for future wide issue superscalars. In this paper, we focus on the interaction between hardware and software ....
....it is possible to reorder the code in memory so that it is easier to supply useful instructions to the execution unit. Code reordering can target the elimination of cache conflicts [5, 6, 8, 7, 10, 13] In addition, it can also map sequentially executed basic blocks in consecutive memory positions [7, 10, 13]. Both aspects may increase the number of useful instructions fetched per access for future wide issue superscalars. In this paper, we focus on the interaction between hardware and software to provide a high instruction bandwidth. We start presenting a fully automated, compile time code reordering ....
[Article contains additional citation context not shown here]
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. Proceedings of the 1st Intl. Conference on High Performance Computer Architecture, pages 360-- 369, Jan. 1995.
....buffers have been emptied, processors are restarted via another hardware interrupt. With this approach, we can trace an unbounded continuous stretch of the workload. Furthermore, this is done with negligible perturbation because the processors are stopped in hardware. More details are presented in [15]. 2.2 Software Setup The multiprocessor operating system used in our experiments is a slightly modified version of Alliant s Concentrix 3.0. Concentrix is symmetric and is based on Unix BSD 4.2. All processors share all operating system data structures. The performance monitor cannot capture ....
....inform the buffer (via escape references in the code segment) when a page fault occurs. These escapes will be encoded to tell the trace buffer what virtual to physical page mapping has occurred. Hence, when analyzing the address trace, we can reconstruct the virtual addresses of the application [15]. Using this approach, we first insert escape sequences at the entry and exit of each routine. With this information, we gather statistics such as the most frequently executed routines, and the common paths through the operating system and application code. This minimal amount of instrumentation ....
[Article contains additional citation context not shown here]
J. Torrellas, R Daigle, and C. Xia. Optimizing Instruction Cache Performance for Operating System Workloads. Technical Report 1387, Center for Supercomputing Research and Development, November 1994.
No context found.
J. Torrellas, C. Xia, and R. Daigle, "Optimizing Instruction Cache Performance for Operating System Intensive Workloads, " Proceedings of the First International Symposium on High-Performance Computer Architecture, pp. 360--369, January 1995.
No context found.
J. Torrellas et al. Optimizing instruction cache performance for operating system intensive workloads. IEEE HPCA, pages 360--369, 1995.
No context found.
J. Torrellas et al. Optimizing instruction cache performance for operating system intensive workloads. IEEE HPCA, pages 360--369, 1995.
No context found.
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensiveworkloads. In Proceedings of the First International Symposium on HighPerformance Computer Architecture, pages 360--369, January 1995.
No context found.
J. Torrellas, C. Xia, R. Daigle, " Optimizing Instruction Cache Performance for Operating System Intensive Workloads," in Proceedings of the 1 International Symposium on High-Performance Computer Architecture, pp. 360-369, January 1995.
No context found.
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. In Proceedings of the First International Symposium on High-Performance Computer Architecture, pages 360--369, January 1995.
No context found.
J. Torrellas, C. Xia, and R. L. Daigle. Optimizing the instruction cache performance of the operating system. IEEE Transactions on Computers, 47(12):1363--1381, 1998.
No context found.
Josep Torrellas, Chun Xia, and Russell Daigle, \Optimizing instruction cache performance for operating system intensive workloads, " Proceedings of the 1st Intl. Conference on High Performance Computer Architecture, pp. 360-369, Jan. 1995.
No context found.
J. Torrellas, C. Xia, and R. Daigle. Optimizing instruction cache performance for operating system intensive workloads. In Proceedings of the First International Symposium on High-Performance Computer Architecture, pages 360-- 369, Jan. 1995.
No context found.
Torrellas, J., Xia, C. and Daigle, R. Optimizing instruction cache performance for operating system intensive workloads. In Proceedings of the 1st International Syposium on High-Performance Computer Architecture, Raleigh, North Carolina, to appear, 1995.
No context found.
J. Torrellas, C. Xia, and R. Daigle, "Optimizing Instruction Cache Performance for Operating System Intensive Workloads", Proceedings of the High Performance Computer Architecture Symposium, Jan 1995, pp. 360-369.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC