| S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. Journal of Instruction Level Parallelism, 1:1--24, 1999. |
....potential of aggressive hardware prefetching, bringing the performance gap versus a perfect L2 cache under 20 . 1 Introduction Modern out of order processors can tolerate latencies for multicycle level one cache hits, and many of the level one cache misses that result in level two hits [42]. However, the hundreds of cycles that result from DRAM accesses cannot be tolerated, thus causing significant performance degradations. For the SPEC2000 benchmarks running on a modern, high performance microprocessor, over half of the time is spent stalling for loads that miss in the level two ....
S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. Journal of Instruction Level Parallelism, 1:1--24, 1999.
....of misses and miss latency can be observed with a contemporary processor model. On an out of order execution processor, some cache misses are more critical than others. In particular, some architectures can fully tolerate the cache miss if there is sufficient work to hide the memory latency [Srinivasan 98] Cache miss rate alone fails to distinguish critical misses. 2.2 Simulators The strategies in this dissertation were primarily evaluated with two forms of simulators, tracedriven and microarchitecture simulators. I used trace driven simulations to primarily measure cache miss rates. ....
S. T. Srinivasan and A. R. Lebeck. Load Latency Tolerance in Dynamically Scheduled Processors. In Proceedings of the 31st Annual International Symposium on Microarchitecture, pages 148--159. December 1998.
....of the controller is such that this relationship will always hold, feedback to tighten this bound would result in improved energy savings. We are exploring mechanisms to measure the actual delay costs as they relate to the instruction commit rate. Incorporating information from critical loads [8, 14] may help. 14 Averages Across Benchmarks Threshold 1.5 6.2 25.0 Delay Increase 0.3 1.1 5.7 Instruction L1 Cache Energy Savings 58.4 59.0 59.0 Tags full 0.2 0.2 0.2 A B 99.8 99.8 99.8 Data parallel 100.0 100.0 100.0 serial 0.0 0.0 0.0 Ways Ave ....
S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. In Journal of Instruction-Level Parallelism, October 1999.
....have been a primary determinant of application performance. As the processor memory performance gap has widened, increasingly aggressive techniques have been proposed for understanding and improving cache memory performance. These techniques have included prefetching, victim caches, and more [3, 7, 8, 12, 15, 16, 17, 18]. More recently, the static and dynamic power dissipation of cache structures has also been a vexing problem, and has led to another set of techniques for improving cache behavior from the power perspective [6, 9, 13, 19, 23] For most of these cache power and performance optimizations, a key ....
S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. In International Symposium on Microarchitecture, pages 148--159, 1998.
....stalls. The processor stall depends on data dependency in the target program, so that some cache misses which a#ect the data dependency are more critical than others. In addition, if there are enough instructions which can be issued, the cache miss might not a#ect ILP. Actually, Srinivasan et al. [55] showed that not all data accesses need to occur immediately if there are enough ready instructions to execute. Non Critical Bu#er proposed by Fisk et al. 12] is a small associative bu#er (for example 16 entry) which 5 works in parallel with level 1 data cache (main cache) The Non Critical ....
S. T. Srinivasan and A. R. Lebeck,"Load Latency Tolerance In Dynamically Scheduled Processors," Proc. of the 31st International Symposium on Microarchitecture, Nov./Dec. 1998.
....Equation 1 by thus gives: 2) Unfortunately, even this simplified equation is not directly useful as an on line estimate. Determining the effect of cache misses on total runtime ( is extremely difficult, given the widely varying latency tolerance of loads both within and across benchmarks [14]. Measuring the number of additional writebacks caused by resizing (W) is also challenging; we must distinguish writebacks that would not have occurred in the full size cache, i.e. a dirty block that is written back, reloaded, and modified again in an interval where it would have remained in the ....
Srikanth T. Srinivasan and Alvin R. Lebeck. Load latency tolerance in dynamically scheduled processors. In 30th Annual International Symposium on Microarchitecture, pages 148--159, November 1998. 16
....(e.g. branch predictors, out of order issue queues) that are purely used for performance purposes. We enumerate some of the many applications for such on line data dependence information. These include dynamic scheduling, selective value prediction [6] criticality measures and their application [11, 29, 30], and decoupled architectures [3, 33] to name a few. We then investigate in depth how dynamic data dependence information can be exploited to provide another dimension for branch prediction. Our approach, called ARVI, bases its prediction on partial register values along the data dependence chain ....
....from the Reorder Buffer. Dependence chain information can potentially provide a more accurate parallelism estimate to guide these and other parallelism based optimizations. Improving the accuracy of criticality measures: Load criticality was originally investigated by Srinivasan and Lebeck [29, 30] in order to improve load performance. Other researchers, including Bodik [11] have proposed techniques for identifying critical instructions. Cycle by cycle dependence chain information can potentially improve the accuracy of critical instruction detection. For instance, Bodik s random sampling ....
S. Srinivasan and A. Lebeck. Load Latency Tolerance in Dynamically Scheduled Processors. 31st International Symposium on Microarchitecture, pages 148--159, November 1998.
.... 100.0 100.0 Ways Ave 6.7 6.7 6.7 Hits B 0.3 0.3 0.3 Account Used 0.5 0.5 0.5 back to tighten this bound would in improve the energy savings. We are exploring how to measure the actual delay costs as they relate to the instruction commit rate. Information from critical loads [10, 19] may help. Due to the mismatch between reward and penalty in a serial tag data access cache, the adaptive accounting cache design is most appropriate when a cache offers the parallel tag data access option. As an extended policy, a metacontroller could activate the full LRU state and the ....
S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. In Journal of Instruction-Level Parallelism, October 1999.
....classify instructions for use in some optimization. These papers differ in what instructions are targeted (loads vs. all instructions) how critical instructions are identified or predicted, and how the criticality information is applied. Srinivasan et al. study the latency tolerance of loads in [16]. In their work, latency tolerance refers to the longest latency that a load instruction could have before impacting 1 performance. They find, for many loads, that the latency tolerance of a load does not match the level of the memory hierarchy where its data resides. Building on this ....
S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. Journal of Instruction Level Parallelism, (1):1--24, 1999.
....their instructions are fetched, re ordered, executed, and committed in parallel. We argue that the level of their parallelism and sophistication has grown enough to justify the use of critical path analysis of their microarchitectural execution. This view is shared by Srinivasan and Lebeck [18], who computed an indirect measure of the critical path, called latency tolerance, that provided non trivial insights into the parallelism in the memory system, such as that up to 37 of L1 cache hits have enough latency tolerance to be satisfied by a lower level cache. The goal of this paper is ....
....Nonetheless, the potential for using the critical path to improve speculation techniques via misspeculation reduction is illustrated by 5 times more effective value prediction for perl and 7 20 more effectiveness for the rest of the benchmarks. 5 Related Work Srinivasan and Lebeck [18] defined an alternative measure of the critical path, called latency tolerance, that provided non trivial insights into the performance characteristics of the memory system. Their methodology illustrated how difficult it is to measure criticality even in a simulator wherein a complete execution ....
S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture (MICRO-98), pages 148--159, Los Alamitos, November 30--December 2 1998.
....the data shows that no L1 data cache is necessary at all. These results are somewhat surprising, but there are a couple of reasons for them. First, many loads are latency tolerant and as long as they are satisfied within the secondlevel cache hit time, performance will not be degraded much [Srin98]. Second, other nonideal effects in the processor attenuate the effects of cache latency. For example, imperfect branch prediction effectively shrinks the size of the instruction window because of frequent processor restarts. This reduces the pressure on the cache to provide results immedi ....
Srikanth T. Srinivasan and Alvin R. Lebeck. Load Latency Tolerance in Dynamically Scheduled Processors. Proc. 31st Intl. Symp. Microarchitecture, pp. 148-159, Nov. 30-Dec 2, 1998.
....increase and the performance impact of the compensation code, Nakra et al. describe a two execution unit system in which the second unit generates and executes the compensation code onthe fly and concurrently with the regular program execution on the first execution engine. Srinivasan and Lebeck [SrLe98] show that in some programs over sixty percent of the executed load instructions produce a value that is already needed in the next cycle. They further found that up to thirty six percent of the loads miss in the L1 data cache but have a latency demand that is lower than the L2 cache s access ....
S. T. Srinivasan, A. R. Lebeck. "Load Latency Tolerance in Dynamically Scheduled Processors". Proceedings of the 31 st Annual ACM/IEEE International Symposium on Microarchitecture, 148-159. 1998.
....[1] This IPC decay comes from the dispersion of instruction latencies. In particular, load latencies in CPU cycles tend to increase across technology generations, while ALU operation latency remains one cycle. A solution for overcoming the IPC decay is to enlarge the processor instruction window [6, 15], both physically (issue buffer, physical registers. and logically, through better branch prediction accuracy or branches removed by predication. However, the instruction window should be enlarged without impairing the clock cycle. In particular, the issue buffer and issue logic are among the ....
S.T. Srinivasan and A.R. Lebeck. Load latency tolerance in dynamically scheduled processors. In Proceedings of the 31th Annual International Sympoisum on Microarchitecture, 1998.
....frequency of the processor increases, in order to keep up, the cache will become smaller and thereby suffer more misses for a given memory organization scheme. The fact that all cache hits are not vital to the performance of a high performance superscalar processor is not as well known [7][18]. Loads that are vital cause performance degradation if the latencies of their DL1 accesses are increased. Conversely, loads that are not vital result in negligible performance degradation if the latencies of their DL1 accesses are increased. We will introduce a classification of all dynamic ....
....high bandwidth and high speed by partitioning the DL1 into subcaches based on temporality use. Similarly to Cho, the design s complexity includes identification of local accesses. 1] commented on the predictability of load latencies. 13] showed some effects of memory latencies. But it was [18] to first release the latency tolerance that can be exhibited by a microprocessor. The work showed that loads that lead to mispredicted branches or to a slowing down of the machine are loads that are critical. This work is built on the same concept as [18] This work further identifies many more ....
[Article contains additional citation context not shown here]
Srinivasan and A. Lebeck, "Load latency tolerance in dynamically scheduled processors.", in Proceedings of the Thirty-First International Symposium on Microarchitecture, pp. 148--159, 1998.
....is important because the indirect branch is attached to the data dependence graph of those load instructions. Therefore, if the load misses in the D cache, accurately predicting the branch may allow some of the memory latency to be overlapped by fetching useful instructions from the target address [124]. On the other hand, if the resolution of the indirect branch is delayed because of a D cache miss (of the dependent load) then the misprediction penalty for that branch may be higher than normal. The goal of this section is to reveal the impact of ib loads on the D cache, and to study the ....
....to access the pointer. In general, the penalty differs among different indirect branches, as well as between dynamic instances of the same branch. 130 Thus, indirect branches can be categorized based on their misprediction latency. The same argument has been manifestated for conditional branches [124]. In this section we study a method that attempts to improve the predictability of indirect branches by classifying them at run time based on their misprediction latency. If we know whether an ibload ibload sequence will miss hit in the D cache at the indirect branch is predicted, then we can use ....
S.T. Srinivasan and A.R. Lebeck. Load Latency Tolerance in Dynamically Scheduled Processors. In Proceedings of the International Symposium on Microarchitecture, pages 148--159, December 1998.
....executions, speculative executions, and non blocking load store. A superscalar processor may issue multiple memory requests simultaneously. Although the processor can keep running before the outstanding memory requests are finished, its ability to tolerate long memory latency is still limited [22]. All concurrent memory accesses can be classified into the following three categories: 1. Accesses to the same page in the same bank. These accesses fully exploit the spatial locality and can be well pipelined. Precharge and row access are needed to initiate the first access. Subsequent ....
S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. In Proceedings of the 31st International Symposium on Microarchitecture, 1998.
....locality by retaining recently accessed cache blocks. Most cache management schemes try to exploit locality to increase the fraction of memory accesses satisfied by the cache (i.e. cache hit ratio) Although increasing the overall number of cache hits is usually desirable, recent research [19] shows that not all memory accesses are equal. In dynamically scheduled processors, the latency of some memory load operations can have a much larger influence on overall performance than other loads. Therefore, it may be possible to improve overall performance by decreasing the latency of these ....
....is a strong enough program property to warrant a change in memory hierarchy management techniques for practical implementations. Specifically, we investigate if practical criticality based approaches can equal or surpass the performance of existing locality based techniques. The previous analysis [19] relied on a sophisticated simulator with rollback to determine load criticality. A practical implementation requires hardware support for on line computation of load criticality. This must be augmented by realistic load latency reduction techniques that can exploit this information. In this ....
[Article contains additional citation context not shown here]
S. T. Srinivasan and A. R. Lebeck. Load Latency Tolerance in Dynamically Scheduled Processors. In Proceedings of the 31st Annual International Symposium on Microarchitecture, pages 148--159, December 1998.
No context found.
S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. Journal of Instruction Level Parallelism, 1:1--24, 1999.
No context found.
S. T. Srinivasan and A. R. Lebeck, "Load Latency Tolerance in Dynamically Scheduled Processors", in 31st International Symposium on Microarchitecture, December 1998.
No context found.
S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. Journal of Instruction Level Parallelism, (1):1--24, 1999.
No context found.
S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. Journal of Instruction Level Parallelism, 1:1--24, 1999.
No context found.
Srikanth Srinivasan and Alvin Lebeck. Load latency tolerance in dynamically scheduled processors. In 31st International Symposium on Microarchitecture, pages 148--159, Nov 1998.
No context found.
S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. In 31 st on Microarchitecture, Nov 1998.
No context found.
S. T. Srinivasan and A. R. Lebeck. Load latency tolerance in dynamically scheduled processors. In Proceedings of the 31st Annual International Symposium on Microarchitecture,Nov--Dec 1998.
No context found.
SRI98 S. Srinivasan, and A. Lebeck. "Load Latency Tolerance in Dynamically Scheduled Processors", International Symposium on Microarchitecture (MICRO), November 1998.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC