| Y. Jegou and O. Temem, "Speculative prefetching," in Proceedings of the International Conference on Supercomputing, 1992, pp. 1--11. |
....instructions. One of the simplest hardware based prefetching schemes is sequential prefetching [Smi82] Whenever a cache line l is accessed the cache line l 1 and maybe some subsequent cache lines are prefetched. More sophisticated prefetch schemes have been invented by researchers [Jou90, JT93, CB95] but most microprocessors still implement only stride one stream detection or even no prefetching. In general prefetching will only be successful when the data stream is predicted correctly (in hardware or by a compiler) and if there is enough space left in the cache to keep the prefetched ....
Y. Jegou and O. Temam. Speculative Prefetching. In Proceddings of the International Conference on Supercomputing, pages 57--66, Tokyo, Japan, July 1993.
....also becomes smaller. One technique to address the speed gap between processors and memory that has been advocated by many researchers is hardware data prefetching. Many hardware techniques have been proposed that target a specific access pattern; sequential prefetching [1, 2] stride prefetchers [3, 4, 5, 6], and prefetchers targeting pointer chasing, namely Mehrotra s technique [7] and dependence based prefetching (DBP) 8] There are also hardware techniques targeted for multiple reference patterns e.g. Markov prefetching [9] One limitation of previous work is that the workloads that have been ....
....prefetching opportunities. Finally, for six of the applications, prefetches were not launched early enough. Motivated from having isolated the reasons for limited performance improvements, we sought remedies to some of the problems. We investigated the use of victim caches and prefetch buffers [4] as a remedy for cache pollution and found that a large 4KB prefetch buffer significantly improved the efficiency of the prefetch techniques. 16KB prefetch buffers improved the efficiency even more when exectuing some of the applications. When simulating a system with a limited bandwidth ....
[Article contains additional citation context not shown here]
Y. Jegou and O. Temem. Speculative Prefetching. In Proceedings of the International Conference on Supercomputing (ICS-93), pages 1--11, December 1992.
....occurs, a stream buffer is assigned, and the stream buffer prefetches successive cache lines starting at the miss address. Both of these methods cannot deal with strides longer than a cache line. More sophisticated hardware prefetching schemes use stride directed prefetching and are described in [1, 8, 11, 16]. These can deal with arbitrary constant strides and are discussed in more detail below. Finally, it is possible to attack the problem through a combination of hardware and software. A general prefetch engine using software instructions to supply the stride and start prefetching for a stream is ....
....service time, especially when using a small, fast L1 cache. Both of these goals can be achieved by prefetching at the L2 cache. The goal of this paper to find an L2 cache architecture to effectively handle this case. Of the various hardware data prefetching architectures proposed in literature [18, 1, 8, 11, 12, 16] only the last two concentrate on L2 prefetching. The rest focus on primary, on chip (L1) prefetching. 12] proposes one L2 prefetching approach using stream buffers. 16] improves on it and compares it with a non prefetching L2 cache. Our research was motivated in part by the need for more ....
[Article contains additional citation context not shown here]
Y. Jegou and O. Temam. Speculative prefetching. In Supercomputing, 1993.
....Most of prefetching approaches show limited adaptivity [38] 1.4.1 Hardware Based Prefetching Hardware prefetching schemes are used in scalar and vector machines. A special hardware mechanism is used to decide how to prefetch data based on history information and simple lookahead information [23, 5, 16, 69, 32, 45, 60, 78, 50]. It is transparent to the programmer. But sometimes, it fails to adapt itself to application programs and results in low efficiency and high cost. 1.4.2 Software Based Prefetching If a machine, such as DEC Alpha [24] provides a instruction suitable to be used as a prefetching instruction, data ....
Y. Jegou and O. Temam. Speculative prefetching. In International Conference on Supercomputing, 1993.
....predictability of memory references. The mechanism proposed in this paper is described in section 4 and evaluated in section 5. Finally, the main conclusions of this work are summarized in section 6. 2 Related work Data prefetching has been the focus of a plethora of works [1] 4] 5] 6] 7] 11] 14][17][22] among others. Data prefetching tries to hide the latency of memory instructions by bringing data to the highest levels of the memory hierarchy before it is required by the processor. Compilerdirected techniques [1] 5] 6] 14] 22] are based on the fact that some memory references are ....
....memory patterns predictable at compile time and some side effects, such as increased register pressure, can appear as a result of these optimizations. Some hardware based techniques keep track of data access patterns (last effective address and stride) in a Reference Prediction Table [4] 11][17]. Using this information, the effective address of load store instructions can be predicted before they are executed and the corresponding prefetch request can be issued if the referenced data is not in the cache. In [7] three variations of this design (basic, lookahead and correlated) are ....
Y. Jegou and O. Temam "Speculative Prefetching" Proc. of the 1993 Int. Conf. on Supercomputing, pp. 57-66, 1993.
....buffer is assigned on a cache miss and the stream buffer prefetches successive cache lines starting at the miss address. Both of these methods cannot handle strides longer than a cache line. More sophisticated hardware prefetching schemes use stride directed prefetching and are described in [1, 8, 11, 16]. These can handle arbitrary constant strides and are discussed in more detail later. Finally, it is possible to attack the problem through a combination of hardware and software. A general prefetch engine using software instructions to supply the stride and start prefetching for a stream is ....
....In stride directed prefetching, a memory reference stream, i.e. a sequence of data addresses generated by a given memory access instruction, is detected by hardware and its stride calculated. The calculated stride is used to predict and prefetch future memory accesses. In most previous studies [1, 8, 11], a memory reference stream is detected by using instruction addresses. This approach, denoted an instruction address based prefetching, is quite successful in L1 prefetching. However, it has not been shown whether instruction address based or stridedirected prefetching is effective at L2 cache ....
[Article contains additional citation context not shown here]
Y. Jegou and O. Temam. Speculative prefetching. In Supercomputing, 1993.
....spatial cache designed to exploit mostly spatial locality and some temporal locality and a temporal cache designed exclusively for temporal locality. In this scheme the type of access is determined dynamically using a locality prediction table which is similar in nature to the schemes proposed in [8, 9, 10]. Furthermore, the temporal and spatial caches are not exclusive in that a data element could be present in both. The main difference between this work and the one reported here is the dynamic categorization of data accesses as temporal or spatial and the nonexclusivity of the two caches. The ....
Y. Jegou and O. Temam. Speculative prefetching. In Proc. Int. Conf. on Supercomputing, pages 57--66, 1993.
....line. Jouppi [2] addresses the former problem. He presents a scheme based on multiple prefetch buffers, that can prefetch data with a lookahead greater than one. The latter problem is addressed in several similar schemes proposed by Baer and Chen [3] Fu, Patel and Janssens [4] and Jegou and Temam [5]. These schemes use tables that keep track of the access history of a load instruction in an attempt to predict the reference s stride. Dahlgren, Dubois and Stenstrom [6] present a prefetching method that varies the size of a block that is prefetched on a miss, depending on what percentage of ....
Y. Jegou and O. Temam, "Speculative prefetching," in International Conference on Supercomputing, 1993.
....high performance processors. Hardware, software, and hybrid hardware software schemes have all been explored for prefetching instructions and data references, both in the context of uniprocessors and multiprocessors [CB94a, BL94, CR94, Chi94a, Chi94b, DS95, Gor95, PK94, SMH94, YGHH94, DDS93, EV93, JT93, TE93, CB92, FPJ92, SH92, McF92, MLG92, Sel92, Skl92, VS92, CMH 92, BC91, CKP91, FP91, KL91, MG91, CMCH91, GGV90, Jou90, SR88, LYL87, Smi78] Unfortunately, most published data prefetching schemes are generally limited to generating prefetches for arithmetic progressions of memory addresses ....
Ivan Jegou and Olivier Temam. Speculative Prefetching. In Proceedings of the 1993 ACM International Conference on Supercomputing, pages 57 -- 66, July 1993.
....of long latency memory operations. Prefetching has been researched heavily, and for a long time. Hardware, software, and hybrid hardware software schemes have all been extensively explored for prefetching instructions and data references, both in the context of uniprocessors and multiprocessors [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37]. 2.2 Instruction prefetching and branch prediction The instruction prefetching and branch prediction problems are closely related. Efficient solutions for both are critical, because they affect the degree to which speculative execution is effective in current machines. In their research, ....
....to consider hardware, software, and hybrid hardware software techniques as three separate categories. The work by Tse and Smith [7] Chen and Baer [8, 22] Dahlgren, Dubois, and Stenstrom [9] Charney and Reeves [12] Palacharla and Kessler [16] Eickemeyer and Vassiliadis [20] Temam and Jegou [21], Fu [24] Varma and Sinha [29] Jouppi [33] and Lee, Yew, and Lawrie [36] represents a broad cross section of research in hardware data prefetching. Most of this work can be divided broadly into two categories: schemes that react to patterns of cache misses, and schemes that detect linear ....
[Article contains additional citation context not shown here]
I. Jegou and O. Temam, "Speculative Prefetching," in Proceedings of the 1993 ACM International Conference on Supercomputing, pp. 57 -- 66, July 1993.
....of long latency memory operations. Prefetching has been researched heavily, and for a long time. Hardware, software, and hybrid hardware software schemes have all been extensively explored for prefetching instructions and data references, both in the context of uniprocessors and multiprocessors [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37]. 2.2 Instruction prefetching and branch prediction The instruction prefetching and branch prediction problems are closely related. Efficient solutions for both are critical, because they affect the degree to which speculative execution is effective in current machines. In their research, ....
....to consider hardware, software, and hybrid hardware software techniques as three separate categories. The work by Tse and Smith [7] Chen and Baer [8, 22] Dahlgren, Dubois, and Stenstrom [9] Charney and Reeves [12] Palacharla and Kessler [16] Eickemeyer and Vassiliadis [20] Temam and Jegou [21], Fu [24] Varma and Sinha [29] Jouppi [33] and Lee, Yew, and Lawrie [36] represents a broad cross section of research in hardware data prefetching. Most of this work can be divided broadly into two categories: schemes that react to patterns of cache misses, and schemes that detect linear ....
[Article contains additional citation context not shown here]
I. Jegou and O. Temam, "Speculative Prefetching," in Proceedings of the 1993 ACM International Conference on Supercomputing, pp. 57 -- 66, July 1993.
....have a small but fast first level cache on a processor chip and a second level larger cache outside of a processor chip. Data prefetching can be used at both levels of a memory hierarchy to hide memory access latency for accessing secondary caches or main memory. Most data prefetching studies [32, 33, 34, 35, 36] focus on first level, on chip prefetching. Although some of prefetching schemes can work without any problems at the second level and off chip, others such as stride directed prefetching [32, 33, 34] may not work effectively. Stride direct prefetching is necessary to prefetch data for long stride ....
....latency for accessing secondary caches or main memory. Most data prefetching studies [32, 33, 34, 35, 36] focus on first level, on chip prefetching. Although some of prefetching schemes can work without any problems at the second level and off chip, others such as stride directed prefetching [32, 33, 34] may not work effectively. Stride direct prefetching is necessary to prefetch data for long stride memory accesses and is effective at the first level and on chip. However, at the second level of a memory hierarchy and off chip, instruction addresses are not easily available and only first level ....
[Article contains additional citation context not shown here]
Y. Jegou and O. Temam, "Speculative prefetching," in Supercomputing, pp. 57 -- 66, 1993.
....It is also an extensively researched subject. Hardware, software, and hybrid hardware software schemes have all been extensively explored for prefetching instructions and data references, both in the context of uniprocessors and multiprocessors 1 [CB95, CR94, Chi94a, Chi94b, PK94, YGHH94, EV93, JT93, FPJ92, SH92, McF92, MLG92, Sel92, Skl92, VS92, CKP91, KL91, CMCH91, Jou90, SR88, Smi78] Unfortunately, most published data prefetching schemes are generally limited to generating prefetches for arithmetic progressions of memory addresses in loop nests, typically those found in dense scientific ....
....not worth analyzing it further. Our assumptions for the IRB are described in Section 4. Our results show that the IRB holds considerable promise for use in future CPU implementations. In related work, several researchers have proposed stride directed hardware data prefetch mechanisms [CB95, EV93, JT93, FPJ92, Skl92] Such schemes attempt to use runtime information gathered from a program s execution to predict its future memory access requirements. Other schemes react to sequences of misses experienced by the cache in determining when and what to prefetch [PK94, CR94, VS92, Jou90, SR88, ....
[Article contains additional citation context not shown here]
Ivan Jegou and Olivier Temam. Speculative Prefetching. In Proceedings of the 1993 ACM International Conference on Supercomputing, pages 57 -- 66, July 1993.
....a promising technique for tolerating the cache miss latency in high performance processors. Hardware, software, and hybrid hardware software schemes have all been extensively explored, both in the context of uniprocessors and multiprocessors 3 [CB95, DS95, Gor95, CR94, Mow94, PK94, YGHH94, EV93, JT93, FPJ92, SH92, Sel92, CKP91, KL91, GGV90, Jou90, Smi78] For some programs, particularly scientific codes operating on dense arrays or matrices, data reference pattern prediction is easy. Consequently, several hardware prefetch mechanisms have been proposed for such codes, and effective compiler ....
Ivan Jegou and Olivier Temam. Speculative Prefetching. In Proceedings of the 1993 ACM International Conference on Supercomputing, pages 57 -- 66, July 1993.
....the cache memory and, in case of miss, a request to the external memory may be issued. Three of the most significant contributions in the area of hardware prefetching are the Preloading Scheme proposed by J L. Baer and T. F. Chen [1] the Speculative Prefetching proposed by Y. Jegou and O. Temam [7], and the Stride Directed Prefetching developed by J.W. Fu, J.H. Patel and B.L. Janssens [5] The Preloading Scheme is based on having two program counters. The first point to the instruction currently in execution while the second point some instructions ahead. When the look ahead PC finds a ....
....a vector element, it also depends on the stride and the size of the vector. These attributes are estimated by means of the locality prediction table. The locality prediction table is based on the history table that was proposed by Baer and Chen [1] Fu, Patel and Janssens [5] and Jegou and Temam [7] as part of a hardware mechanism for prefetching vector data. The locality prediction table has a small number of entries. Each entry contains information about a recently executed load store instruction. This information consists of the following fields: a) Instruction address: The address of the ....
[Article contains additional citation context not shown here]
Y. Jegou and O. Temam, Speculative Prefetching in Proc. of the 1993 Int. Conf. on Supercomputing, pp. 57-66, 1993.
....for these data and, in case of miss, a request to the external memory may be issued. Three of the most significant contributions in the area of hardware prefetching are the Preloading Scheme proposed by J L. Baer and T. F. Chen [3] the Speculative Prefetching proposed by Y. Jegou and O. Temam [9], and the Stride Directed Prefetching developed by J.W. Fu, J.H. Patel and B.L. Janssens [7] The Preloading Scheme is based on having two program counters. The first one points to the instruction currently in execution while the other one points to several instructions ahead. When the ....
....the latter case, it also depends on the stride and the size of the vector. These attributes are estimated by means of the locality prediction table. The locality prediction table is based on the history table that was proposed by Baer and Chen [3] Fu, Patel and Janssens [7] and Jegou and Temam [9] as part of a hardware mechanism for prefetching vector data. The locality prediction table has a moderate number of entries (it is suggested in [9] that a number about 256 could be sufficient to catch most of the references) Each entry contains information about a recently executed load store ....
[Article contains additional citation context not shown here]
Y. Jegou and O. Temam, Speculative Prefetching in Proc. of the 1993 Int. Conf. on Supercomputing, pp. 57-66, 1993.
....In addition, following each incorrect branch prediction the LAPC needs to reset and start building up the distance between itself and the main program counter. During this period, prefetches might not be issued early enough. Sklenar [19] Fu, Patel and Janssens [20] and Jegou and Temam [21] present schemes similar to the one described above. However, their schemes do not include branch prediction; rather, the schemes simply prefetch data one iteration ahead. Therefore, these schemes are very limited in the amount of adaptivity they can perform. Jouppi [22] presents a scheme based on ....
....[23] extend the work of Jouppi. They evaluate stream buffers as an off chip replacement for second level cache. They also show how to enhance the stream buffer operation by eliminating useless prefetches and extending prefetching to include LCSASs. The prefetching for LCSASs differs from that in [17, 19, 20, 21] in not using load instruction addresses to predict what data to prefetch. Varma and Sinha [24] present a method for prefetching SCSASs by providing two additional copies of the cache tag RAM. When an access is made to line l, the additional tag RAM copies allow comparisons to be made for lines l ....
[Article contains additional citation context not shown here]
Y. Jegou and O. Temam, "Speculative prefetching," in International Conference on Supercomputing, 1993.
....of long latency memory operations. Prefetching has been researched heavily, and for a long time. Hardware, software, and hybrid hardware software schemes have all been extensively explored for prefetching instructions and data references, both in the context of uniprocessors and multiprocessors [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37]. 2.2 Instruction prefetching and branch prediction The instruction prefetching and branch prediction problems are closely related. Efficient solutions for both are critical, because they affect the degree to which speculative execution is effective in current machines. In their research, ....
....to consider hardware, software, and hybrid hardware software techniques as three separate categories. The work by Tse and Smith [7] Chen and Baer [8, 22] Dahlgren, Dubois, and Stenstrom [9] Charney and Reeves [12] Palacharla and Kessler [16] Eickemeyer and Vassiliadis [20] Temam and Jegou [21], Fu [24] Varma and Sinha [29] Jouppi [33] and Lee, Yew, and Lawrie [36] represents a broad cross section of research in hardware data prefetching. Most of this work can be divided broadly into two categories: schemes that react to patterns of cache misses, and schemes that detect linear ....
[Article contains additional citation context not shown here]
I. Jegou and O. Temam, "Speculative Prefetching," in Proceedings of the 1993 ACM International Conference on Supercomputing, pp. 57 -- 66, July 1993.
....the larger the line size, the smaller the efficiency, i.e. the smaller the ratio of words fetched over words used as seen above. Finally, large line sizes do not prevent periodic cache misses during a vector access. Stride prefetching An alternative to large line sizes has been proposed in [8, 3, 13, 9]. The principle is to detect the access stride of a reference, then predict the next address to be used based on the address being currently referenced and the stride, and then prefetch the corresponding data. Though simulations proved the efficiency of such schemes, they require relatively heavy ....
....was fetched in C 2 is not used, the next line will never be prefetched, since prefetch occurs only when a line of C 2 (that was never used before) is transferred in C 1 . This principle limits the amount of additional memory traffic and wrong predictions. Besides, as for the mechanisms proposed in [8, 3, 13, 9], regularity appearing in complex codes (non rectangular loops, loops with if statements) can also be exploited. As can be seen on figure 9.9, prefetching can increase the performance of codes with strong spatial locality (LL,MM) but degrades the performance of other codes (with respect to the ....
Y. Jegou and O. Temam. Speculative Prefetching. In Proceedings of the ACM International Conference on Supercomputing, 1993.
....when the number of simultaneous streams is higher than the number of stream buffers or when stream strides are large. Now, several schemes based on prediction tables (one table entry per load store instruction) have been proposed where the stride of a load store reference is automatically computed [7, 5, 8]. Such schemes exhibit high prefetch efficiency (accuracy of prediction) but their hardware cost is very significant since the table must be about 256 entry large [5, 8] Software prefetching [1] exploits the subscript expression for address predictions by prefetching A(J d,I) for reference A(J,I) ....
.... entry per load store instruction) have been proposed where the stride of a load store reference is automatically computed [7, 5, 8] Such schemes exhibit high prefetch efficiency (accuracy of prediction) but their hardware cost is very significant since the table must be about 256 entry large [5, 8]. Software prefetching [1] exploits the subscript expression for address predictions by prefetching A(J d,I) for reference A(J,I) in a I,J loop nest (where d is the prefetch distance) The prediction is based on the inner loop index. In addition to the significant compiler overhead of software ....
[Article contains additional citation context not shown here]
Y. Jegou and O. Temam. Speculative Prefetching. In Proceedings of the ACM International Conference on Supercomputing, 1993.
....the larger the line size, the smaller the efficiency, i.e. the smaller the ratio of words fetched over words used as seen above. Finally, large line sizes do not prevent periodic cache misses during a vector access. Stride prefetching An alternative to large line sizes has been proposed in [9, 3, 14, 10]. The principle is to detect the accesss stride of a reference, then predict the next address to be used based on the address being currently referenced and the stride, and then prefetch the corresponding data. Though simulations proved the efficiency of such schemes, they require relatively heavy ....
....was fetched in C 2 is not used, the next line will never be prefetched, since prefetch occurs only when a line of C 2 (that was never used before) is transferred in C 1 . This principle limits the amount of additional memory traffic and wrong predictions. Besides, as for the mechanisms proposed in [9, 3, 14, 10], regularity appearing in complex codes (non rectangular loops, loops with if statements) can also be exploited. As can be seen on figure 9, prefetching can increase the performance of codes with strong spatial locality (LL,MM) but degrades the performance of other codes (with respect to the ....
Y. Jegou and O. Temam. Speculative Prefetching. In Proceedings of the ACM International Conference on Supercomputing, 1993.
No context found.
Y. Jegou and O. Temem, "Speculative prefetching," in Proceedings of the International Conference on Supercomputing, 1992, pp. 1--11.
No context found.
Y. Jegou and O. Temam. "Speculative Prefetching". Proc. of ICS-93, pp.1-11, Dec. 1992.
No context found.
Y. Jegou and O. Temam. "Speculative Prefetching". Proc. ICS-93, Dec. 1992: 1-11.
No context found.
Yvon Jegou and Olivier Temam. Speculative Prefetching. In Proc. Int. Conf. on Supercomputing, pages 57--66, 1993.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC