| T. C. Mowry, "Tolerating latency through software-controlled data prefetching," Ph.D. dissertation, Department of Electrical Engineering, Stanford University, March 1994. |
....loop induction variables. Other researchers investigate optimizing high performance Java applications using traditional loop optimizations [4, 9] Their work is complimentary to our work. Mowry, Lam, and Gupta describe and evaluate compiler techniques for data prefetching in array based codes [19, 18]. Their paper is one of the first to report execution times for compiler inserted prefetching. The algorithm works on affine array accesses, and involves several steps. First, the compiler performs locality analysis to determine array accesses that are likely to be cache misses. The compiler uses ....
T. C. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Stanford University, Department of Electrical Engineering, Mar. 1994.
....locations. The compiler also uses standard cache improvement techniques such as loop unrolling and tiling. Simulation results show improvements in cache utilization and execution speed. Mowry, Lam, and Gupta describe and evaluate compiler techniques for adding prefetching to array based codes [79, 78]. This paper is one of the first that reports execution times for compiler inserted prefetching. The algorithm works on affine array accesses within 25 scientific codes. The algorithm significantly improves performance by as much as a factor of 2. They also show that their algorithm is better ....
....several researchers have investigated prefetching of array based codes on multiprocessors. Fu and Patel evaluate two hardware prefetching schemes on a vector multiprocessor system [39] Mowry and Gupta evaluate software prefetching for array based programs on shared memory multiprocessors [77, 78]. Gornish, Granston, and Veidenbaum implement prefetching for shared memory multiprocessors [42] Dahlgren, Dubois, and Stenstrom evaluate sequential hardware prefetching and stride prefetching on a shared memory multiprocessor [32, 33] In his thesis, Gornish compares software and hardware ....
Todd C. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Stanford University, Department of Electrical Engineering, March 1994.
....several researchers have investigated prefetching of array based codes on multiprocessors. Fu and Patel evaluate two hardware prefetching schemes on a vector multiprocessor system [39] Mowry and Gupta evaluate software prefetching for array based programs on shared memory multiprocessors [77, 78]. Gornish, Granston, and Veidenbaum implement prefetching for shared memory multiprocessors [42] Dahlgren, Dubois, and Stenstrom evaluate sequential hardware prefetching and stride prefetching on a shared memory multiprocessor [32, 33] In his thesis, Gornish compares software and hardware ....
Todd Mowry and Anoop Gupta. Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87--106, June 1991.
.... B 1; n) do instruction is often handled as a hint for the processor to load a certain data item but the fulfillment of the prefetch is not guaranteed by the CPU. Prefetch instructions can be inserted into the code manually by the programmer or automatically by a compiler [Por89, KL91, CKP91, Mow94] In both cases prefetching involves overhead. The prefetch instructions themselves have to be executed, i.e. pipeline slots will be filled with prefetch instructions instead of other instructions ready to be executed. Furthermore, the memory address of the prefetched data must be calculated and ....
T.C. Mowry. Tolerating Latency Through Software--Controlled Data Prefetching. PhD thesis, Computer Systems Laboratory, Stanford University, March 1994.
....overcomes these shortcomings by means of prefetching. VSCAP does not solely rely on overlapping communication with computation, it also overlaps communication operations with other communication operations to hide even more network latency. Research in prefetching can be divided into software [13], hardware [4,9,5] and hybrid prefetching [16,10,12] VSCAP s prefetching approach is quite different as it does not address parallel architectures with cache coherent memory, it rather targets machines with distributed memory and explicit communication operations where data distribution is the ....
T. Mowry. Tolerating Latency Through Software Controlled Data Prefetching. PhD thesis, Department of Computer Science, Stanford University, March 1994.
....tain array data in registers instead of cache, reducing latency for accesses to this data. We hope to use static performance estimation along with our out of core transformations to identify regions of code where the balance between I O and computation can be improved. Cache Management. Mowry [15] describes a method for software controlled data prefetching which focuses on issues pertinent to the data cache. The algorithm contains three stages: identify reuse; isolate predicted misses; and schedule prefetches. Identification of reuse is done using a matrix representation of dependence ....
Todd C. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Department of Electrical Engineering, Stanford University, March 1994.
....by the effectiveness of its compiler support. The CCDP scheme relies on the compiler to identify potentially stale and nonstale data references, and to generate and schedule the appropriate prefetch operations. Several compiler techniques have been developed for software initiated data prefetching [2, 12, 13, 23, 24]. However, as these data prefetching schemes are used solely for memory latency hiding, the data prefetching operations are determined based on data locality considerations alone. Since these techniques do not distinguish between potentially stale and nonstale references, they cannot be applied ....
T. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Stanford University, Dept. of Electrical Engineering, March 1994.
....practice the use of linked data structures, which requires dynamic memory allocation. The proximity of storage layout of such applications does not imply the same degree of spatial locality that array based applications does. More recent approaches such as Multithreading [2,22,30] Prefetching [6,18,21], Jump Pointers [28] and Memory Forwarding [19] have been explored to address memory latency in pointer based applications. Multithreading tends to combat latency by passing the control of execution to other threads when a long latency operation is encounterd. Prefetching tries to predict the ....
T. C. Mowry. "Tolerating Latency Through Software-Controlled Data Prefetching", PhD thesis, Stanford University, March 1994.
....For instance, aggressive prefetching may evict useful data from the cache before it is needed. In addition, adding unnecessary prefetch instructions may hinder instruction cache performance and saturate memory queues. The Open Research Compiler (ORC) 19] uses an extension of Mowry s algorithm [18] to insert prefetch instructions. ORC uses a priority function that assigns a Boolean confidence to prefetching a given address. Subsequent passes use this value to determine whether or not to prefetch the address. Currently, the priority function is simply based upon how well the compiler can ....
T. C. Mowry. Tolerating Latency through Software-Controlled Data Prefetching. PhD thesis, Stanford University, Department of Electrical
....to the scheduler and code generator. Scalar replacement of memory references [14, 15] is a technique to replace memory references by compiler generated temporary scalar variables, which are eventually mapped to registers. Finally, the compiler also inserts the appropriate type of prefetches [7, 19, 20] for data references so as to overlap the memory access latency with computation. These transformations are described in detail in later sections of this paper. A primary objective of scalar optimizations is to minimize the number of computations and the number of references to memory. Scalar ....
....exposes new opportunities for scalar replacement of memory references. DATA PREFETCHING Data prefetching is an effective technique to hide memory access latency. It works by overlapping time to access a memory location with time to compute as well as time to access other memory locations [7, 19, 20]. Data prefetching inserts prefetch instructions for selected data references at carefully chosen points in the program, so that referenced data items are moved as close to the processor as possible before the data items are actually used. Note that the data prefetch instructions do not normally ....
T. Mowry, "Tolerating Latency Through SoftwareControlled Data Prefetching," Ph.D. Thesis, Stanford University, March 1994, Technical Report CSL-TR-94626.
....those objects can be obtained in advance. Once available in a local cache, those objects can be retrieved with minimal 23 delay, enhancing the user experience. Prefetching is a well known approach to decrease access times in the memory hierarchy of modern computer architectures (e.g. Smi82, Mow94, KCK 96, VL97, KW98a, LM99, JG99, SSS99] and has been proposed by many as a mechanism for the same in the World Wide Web (e.g. PM96, KLM97] While the idea is promising, prefetching is not straightforward to evaluate (as we point out later in this thesis and then propose an alternative ....
Todd C. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Computer Systems Laboratory, Stanford University, March 1994.
....to the remote memory. Several techniques have been presented to tolerate or reduce this memory access latency, which may become significant for remote memory accesses. These include processor caches [64] relaxed memory consistency models [1, 22, 38, 33] and software controlled data prefetch [55]. The remainder of this chapter describes the problem introduced by processor caches in shared memory multiprocessors, and chapter 2 describes the relaxed memory consistency models and software controlled data prefetch in greater detail. Processor Memory General Interconnect (Non uniform ....
....and relaxes the ordering of these accesses. In the remainder of this dissertation, the applications studied will use a release consistency memory model. 2. 2 Software Controlled Data Prefetch Another technique to reduce the memory access latency is software controlled, nonbinding data prefetch [55]. The data prefetch moves the requested data to the cache nearest the processor. This data prefetch can be used to hide a portion of the data miss latency in a typical producer and consumer interaction [55] as shown in figure 2.3. In the figure, the producer and consumer share a block of data ....
[Article contains additional citation context not shown here]
Todd Mowry and Anoop Gupta. Tolerating Latency Through SoftwareControlled Prefetching in Shared-Memory Multiprocessors. Journal of Parallel and Distributed Computing, pages 87--106, 1991.
.... of these work have been presented in [30] While many of these work concentrate on getting a rate optimal schedule, other equally important issues to achieve high performance including register allocation and spill code generation [40, 24] prefetching in both numerical and non numerical programs [28, 34] have been getting recent attention. Integer linear programming formulation is widely used to derive rate optimal schedules [11, 12, 1] Comparison between the rate optimal scheduling formulation and the software pipelining in MIPSpro, which is a production quality compiler has been made in [32] ....
T. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Stanford University, 1994.
....has been devoted to the problem of making up for these deficiencies in caches. Many domains of programs have compile time analyzable reuse patterns. This includes the multimedia and other streaming applications which have become of great importance in recent years. Software controlled prefetching [9] uses this reuse analysis to push a data value up the hierarchy. By overriding the standard replacement policy, prefetching can minimize the performance impact of poor replacement choices while improving overall performance by bringing items into the cache before they are needed. By analyzing the ....
....that repeats throughout the iteration space. This stencil of dependencies represents the regular and predictable interaction between any given iteration and the iterations which depend on it. See figure 2 3 for our example s dependence vector stencil. 2. 4 Reuse Analysis Array reuse analysis [9, 14] is a means of statically determining how and when array elements will be reused in a loop nest. The basis of reuse analysis is conversion of affine array accesses 3 into vector expressions. By solving for properties of these vector expressions, 2 Dependence analysis does not usually talk about ....
[Article contains additional citation context not shown here]
Todd C. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching.PhD thesis, Stanford University, 1994.
.... at a later time, with the goal of bringing the location into the processor s cache before it issues a demand memory access [17] Previous studies have shown that software controlled non binding prefetching can eliminate a large fraction of memory stall time in shared memory multiprocessors [84, 124, 100]. However, these studies have been mainly limited to conventional scientific and engineering workloads. In this section, we study the effectiveness of software prefetching for our media processing workloads. As discussed in Section 2.3.3, we follow the well known software prefetching compiler ....
Todd Mowry and Anoop Gupta. Tolerating Latency Through Software-Controlled Prefetching. Journal of Parallel and Distributed Computing, pages 87--106, June 1991.
....the memory accesses that dominate the cache miss stall time, and inserted prefetches by hand (since we did not have access to a C compiler that implements software prefetching) for these accesses. We followed the well known software prefetching compiler algorithm developed by Mowry et al. [83]. The prefetch algorithm handles both affine and indirect addresses to identify and schedule prefetches for possible cache misses. The algorithm is loop based, and consists of an analysis phase and a scheduling phase. The analysis phase identifies the accesses that do not exhibit locality for the ....
....mainly limited to conventional scientific and engineering workloads. In this section, we study the effectiveness of software prefetching for our media processing workloads. As discussed in Section 2.3. 3, we follow the well known software prefetching compiler algorithm developed by Mowry et al. [83] to insert prefetches by hand for our benchmarks. Overall results. Figure 2.4 summarizes the execution time reductions from software prefetching relative to the base system with VIS (with 64K L1 and 128K L2 caches) We do not report results for cjpeg np, djpeg np and mpeg enc since these ....
[Article contains additional citation context not shown here]
Todd Mowry. Tolerating Latency through Software-controlled Data Prefetching. PhD thesis, Stanford University, 1994.
....techniques previously proposed for prefetching different types of memory references. 3. 1 Affine Array Prefetching To perform software prefetching for affine array references commonly found in scientific codes, the well known compiler algorithm for inserting prefetches proposed by Mowry and Gupta [32] is followed. In this algorithm, locality analysis is used to determine which array references are likely to suffer cache misses. The cache missing memory references are then isolated by performing loop unrolling and loop peeling transformations. Finally, prefetch instructions are inserted for the ....
T. Mowry and A. Gupta. Tolerating latency through softwarecontrolled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87--106, June 1991.
....of the data by the processor (known as the prefetch distance) to overlap the latency of the memory access. 3.2 Indexed Array Prefetching Indexed array accesses, of the form A(B(i) are common in irregular scientific codes. The prefetch algorithm for indexed array accesses, originally proposed in [30], is similar to the algorithm for affine array prefetching. The main difference lies in how prefetch requests are scheduled. In affine array prefetching, each prefetch is scheduled early enough to tolerate the latency of a single cache miss. For indexed array references, the memory indirection ....
T. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching, PhD Thesis. Technical report, Stanford University, March 1994.
....intrusive, i.e. managed by the processor and affect the remainder of his behavior. negative effects in front of temporal locality in cache hierarchy, Theoretically, compiler directed prefetching is able to manage a wide range of array access functions and handle multiple nested loops [CKP91] [Mow94]. Even if the current SGI compiler only supports linear access functions [Sil97b] since we mainly focus on scientific computing with regular access patterns, this is sufficient for ours needs. Pragmas utilization may help the compiler to insert prefetch instructions, but it implies a careful ....
Todd C. Mowry. Tolerating latency through software-controlled data prefetching. PhD thesis, Stanford University, 1994.
....the corresponding meeting graph has one circuit of weight 24 8 = 3. It can be decomposed as shown in 5.c, yielding three circuits of weight 1: In both cases, three colors are required. 4. 3 Determining an alternative memory mapping Memory is virtually subdivided into memory lines in analogy to [11]. Technically, attention must be paid that any memory space delivered by routines such as malloc is aligned to element size only, whereas we need alignment to cache line boundaries. This makes it pos k k 1 k 2 k 3 k 4 k 5 k 6 k 7 a) CLLTa CLLTb CLLTc b) c) 8 8 8 8 1 1 1 1 1 1 1 1 ....
T. Mowry. Tolerating Latency Through Software Controlled Data Prefetching. PhD thesis, Dept. of Computer Science, Stanford University, Mar. 1994.
....distributed shared memory (DSM) system. DSMs are software systems that emulate shared memory semantics in software over hardware that provides support only for message passing. Multi threading for latency hiding is a wellknown technique for hiding cache miss latencies in the hardware environment [2, 3]. However, the software environment presents special challenges. The paradigm usually assumed in DSM related literature is that of a distributed system containing a single thread on each processor. This arrangement is simple, and yet al..lows reasonably high processor efficiency. However, DSMs ....
T. Mowry and A. Gupta, "Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors," in Journal of Parallel and Distributed Computing, June 1991.
....to SMT machines are required to support our technique. Despite its simplicity, our technique offers an average speedup of 24 in a set of irregular applications, which is a 19 speedup over state of the art software controlled prefetching. 1. Introduction Multithreading [1, 32] and prefetching [8, 20, 22] are two major techniques for tolerating ever increasing memory latency. Multithreading tolerates latency by executing instructions from another concurrent thread when the running thread encounters a cache miss. In contrast, prefetching tolerates latency by anticipating what data is needed and ....
....quite common in scientific and engineering applications such as sparse matrix algorithms and windtunnel simulations. Another common example is arrays of pointers. To cope with these cases, the compiler usually needs to heuristically decide how to prefetch, perhaps based on profiling information [22]. In contrast, these cases do not present a problem to pre execution as it directly runs the code and hence does not need to make any compile time assumptions. Among our benchmarks, we find that indirect array references contribute significantly to the cache misses in the Spec2000 application ....
[Article contains additional citation context not shown here]
T. C. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Stanford University, March 1994.
....understand the potential gains of our technique. An important question is can these analyses be automated The most challenging step in the analysis is the identification of sparse memory references. Existing compilers for instrumenting software prefetching in a#ne loops [9] indexed array loops [10], and pointer chasing loops [8] already extract this information automatically. Extraction of size information is a simple mechanical process once the sparse memory references have been identified, following the steps in Section 2.3. Consequently, we believe the analyses outlined in this paper are ....
....but our simulator does not quantify these e#ects. Our evaluation considers the impact of annotated memory instructions on software prefetching, so we created software prefetching versions of our benchmarks. For a#ne array and indexed array references, we use the prefetching algorithms in [9] and [10], respectively. For pointer chasing references, we use the prefetch arrays technique [6] Instrumentation of annotated memory instructions for prefetch, load, and store instructions occurs after software prefetching has been applied. 5.2 Performance of Annotated Memory Instructions Figure 8 ....
Todd C. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching, PhD Thesis. Technical report, Stanford University, March 1994.
....code can prevent the needed code restructuring. Software prefetching is a widely used latency tolerance method implemented on many commercial systems. The most commonly known and implemented software prefetching algorithms apply software pipelining to the innermost loop for a given miss reference [17, 18, 19]. Such software pipelining creates a prologue, which prefetches data for the first iterations; a steady state, which includes computation along with prefetches scheduled ahead by a certain number of iterations termed the prefetch distance; and an epilogue, with only computation for the last ....
....known and implemented software prefetching algorithm for regular references, as well as algorithms for irregular references. 2.2. 1 Algorithm for Adding and Scheduling Prefetches The best known software prefetching algorithm implemented in a compiler is the loop based algorithm of Mowry et al. [17, 18, 19]. The analysis phase of the algorithm identifies the static references that can miss (leading references) Then, the scheduling phase uses loop peeling, x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x (a) Base code x x x x x x x x ....
T. Mowry and A. Gupta. Tolerating Latency Through Software-Controlled Prefetching. Journal on Parallel and Distributed Computing, pages 87--106, June 1991.
....code can prevent the needed code restructuring. Software prefetching is a widely used latency tolerance method implemented on many commercial systems. The most commonly known and implemented software prefetching algorithms apply software pipelining to the innermost loop for a given miss reference [17, 18, 19]. Such software pipelining creates a prologue, which prefetches data for the first iterations; a steady state, which includes computation along with prefetches scheduled ahead by a certain number of iterations termed the prefetch distance; and an epilogue, with only computation for the last ....
....We evaluate these latency tolerance techniques both with a detailed simulator (RSIM) and on a real system (Convex Exemplar) applying miss clustering by hand in both cases and applying prefetching by hand for the simulation. We consider prefetching for both regular and irregular applications [15, 16, 17, 27]. For the applications and systems we study, clustering alone outperforms prefetching alone for most cases. This result, however, is sensitive to system trends, and there may be some applications where clustering is not applicable but prefetching is. More importantly, this paper finds that the ....
[Article contains additional citation context not shown here]
T. Mowry. Tolerating Latency through Software-controlled Data Prefetching. PhD thesis, Stanford University, 1994.
....such a prefetching request will be of no use to an executing program. A possible solution to this problem is not to discard a prefetching request on a TLB miss but to prefetch the translation information from a page table if it is not available in the TLB and then use it to prefetch the data [29, 13]. Obviously, such TLB prefetching can only work for the data residing on pages that can be accessedand are available in memory. 4.8 Classification of data related misses We classify the cause of a miss to TLB and L2 cache that occurs on an access to heap allocated data into the following ....
T. Mowry. Tolerating latency through software-controlled data prefetching. PhD thesis, Stanford University, Mar. 1994.
....iterations in advance. There are three proposed solutions to this. First, the use of an adaptive prefetching mechanism that changes the prefetch distance dynamically [15] However, these techniques have not proven to be stable performancewise. The second option is to use a compiler technique [25, 26]. However, it is an open question if these techniques can be applied successfully to the loops in this application. The last option is to perform loop unrolling on the loops to help the stride prefetcher to put more distance between two executions 10 0 10 20 30 40 50 60 70 ....
T. Mowry. Tolerating Latency through Software-Controlled Data Prefetching. PhD thesis, Stanford University, March 1994.
....latency of memory accesses by bringing data into cache before it is accessed by the CPU. Cache prefetching comes in three varieties. Hardware cache prefetchers [4, 10, 9] observe the data stream and use past access patterns and or miss patterns to predict future misses. Software prefetchers [13] insert prefetch directives into the code with enough lead time to allow the cache to acquire the data before the actual access is executed. Recently, the expected emergence of multithreading processors [21, 20] has led to thread based prefetchers [6, 23, 11, 2, 16] which execute code in another ....
T. Mowry and A. Gupta. Tolerating latency through softwarecontrolled prefetching in shared-memory multiprocessors. In Journal of Parallel and Distributed Computing, pages 87--106, June 1991.
....cache conflicts. Software prefetching requires the compiler to determine when the instruction overhead of prefetching outweighs the benefit of potentially avoiding cache misses. It also requires the compiler to determine where to insert the prefetch instructions in order to cover a potential miss. [Mowr94] and [SaPa96] propose methods for varying prefetch distance at run time. WaRa95] proposes to use a processor for prefetching in the context of virtual shared memory. Block Size Execution Time Block Size Figure 1. Qualitative Sketch of the Effect of Cache Block Size Effect on Execution Time ....
T.C. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. Ph.D. Thesis, Stanford University, 1994.
No context found.
T. C. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Stanford University, Mar. 1994.
....whether a new interface is ac tually needed. There arc two reasons why exist ing read write I O interfaces are unacceptable for our purposes. First, for the compiler to successfully move prefetches back far enough to hide the large latency of I O, it is essential that prefetches be non binding [19]. The non binding property means that when a given reference is prefetched, the data value seen by that reference is bound at reference time; in contrast, with a binding prefetch, the value is bound at prefetch time. The problem with a binding prefetch is that if another store to the same location ....
....be used to implement prefetch and release in UNIX. 2.2. 2 Minimizing Prefetch Overhead Earlier studies on compiler based prefetching to hide cache to memory latency have demonstrated the importance of avoiding the overhead of unnecessarily prefetching data that already resides in the cache [19, 20]. To address this problem, com piler algorithms have been developed for inserting prefetches only for those references that are likely to suffer misses. An analogous situation exists with I O prefetching, since we do not want to prefetch data that already resides in main memory hence, we perform ....
[Article contains additional citation context not shown here]
T. C. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Stanford University, March 1994. Technical Report CSL-TR-94-626.
....been proposed. Coherent caches [3, 4, 18, 30] allow shared read write data to be cached and significantly reduce the memory latency seen by the processors. Relaxed memory consistency models [1, 5, 8] hide latency by allowing buffering and pipelining of memory references. Prefetching techniques [11, 16, 21, 23] hide the latency by bringing data close to the processor before it is actually needed. Multiple contexts [3, 12, 13, 26, 29] allow a processor to hide latency by switching from one context to another when a high latency operation is encountered. Our primary objective in this paper is to ....
....while the home node is the node that contains the main memory and directory for the given physical memory address. A remote node is any node, other than the local or home node. The latency The architectural parameters and benchmark data sets in this paper differ from those in previous papers [7, 21], and therefore results cannot be directly compared. Table 1: Latency for various memory system operations in processor clock cycles (1 pclock = 30 ns) Read Operations Hit in Primary Cache 1 pclock Fill from Secondary Cache 14 pclock Fill from Local Node 26 pclock Fill from Home Node (Home ....
[Article contains additional citation context not shown here]
T. Mowry and A. Gupta. Tolerating latency through softwarecontrolled prefetching in shared-memory multiprocessors. J. Paral. Distr. Computing, to appear in June 1991.
....of their time stalled for memory accesses. 1. 2 Memory Hierarchy Optimizations Various hardware and software approaches to improve the memory performance have been proposed recently[15] A promising technique to mitigate the impact of long cache miss penalties is softwarecontrolled prefetching[5, 13, 16, 22, 23]. Software controlled prefetching requires support from both hardware and software. The processor must provide a special prefetch instruction. The soft ware uses this instruction to inform the hardware of its intent to use a particular data item; if the data is not currently in the cache, the ....
.... that is targeted for many different memory systems, this lack of flexibility can be a serious limitation not only in terms of tuning for different memory latencies, but also a prefetching scheme that is appropriate for a uniprocessor may be entirely inappropriate for a multiprocessor [22]. Finally, while hardware based schemes have no software cost, they may have a significant hardware cost, both in terms of chip area and possibly gate delays. 6 Future Work The scope of this compiler algorithm was limited to affine array accesses within scientific applications. By prefetching ....
T. Mowry and A. Gupta. Tolerating latency through softwarecontrolled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87 106, 1991.
....or whether a new interface is actually needed. There are two reasons why existing read write I O interfaces are unacceptable for our purposes. First, for the compiler to successfully move prefetches back far enough to hide the large latency of I O, it is essential that prefetches be non binding [19]. The non binding property means that when a given reference is prefetched, the data value seen by that reference is bound at reference time; in contrast, with a binding prefetch, the value is bound at prefetch time. The problem with a binding prefetch is that if another store to the same ....
....be used to implement prefetch and release in UNIX. 2.2. 2 Minimizing Prefetch Overhead Earlier studies on compiler based prefetching to hide cache to memory latency have demonstrated the importance of avoiding the overhead of unnecessarily prefetching data that already resides in the cache [19, 20]. To address this problem, compiler algorithms have been developed for inserting prefetches only for those references that are likely to su#er misses. An analogous situation exists with I O prefetching, since we do not want to prefetch data that already resides in main memory hence, we perform ....
[Article contains additional citation context not shown here]
T. C. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Stanford University, March 1994. Technical Report CSL-TR-94-626.
No context found.
T. C. Mowry, "Tolerating latency through software-controlled data prefetching," Ph.D. dissertation, Department of Electrical Engineering, Stanford University, March 1994.
No context found.
T. C. Mowry and A. Gupta, "Tolerating latency through software-controlled prefetching in shared-memory multiprocessors," Journal of Parallel and Distributed Computing, vol. 12, no. 2, pp. 87-106, June 1991.
No context found.
Todd C. Mowry. Tolerating Latency through Software Controlled Data Prefetching. PhD Thesis Stanford University, March 1994.
No context found.
T. Mowry. Tolerating Latency Through Software Controlled Data Prefetching. PhD thesis, Dept. of Computer Science, Stanford University, Mar. 1994. 4.7, 5.5
No context found.
Todd Mowry and Anoop Gupta. Tolerating latency through software-controlled prefetching in shared-memory multiprocessors. J. Parallel Distrib. Comput., 12(2):87--106, 1991.
No context found.
T. Mowry and A. Gupta. "Tolerating Latency through SoftwareControlled Prefetching in Scalable Shared-Memory Multiprocessors ". In Jour. of Parallel and Distributed Computing (12) 2, 1991: 87-106.
No context found.
Mowry, T. "Tolerating Latency Through Software Controlled Data Prefetching," Ph.D. dissertation, Stanford University, March 1994.
No context found.
T. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Stanford University, 1994.
No context found.
T. Mowry and A. Gupta. Tolerating latency through software -controlled prefetching in shared-memory multiprocessors. Journal of Parallel and Distributed Computing, 12(2):87--106, June 1992.
No context found.
T. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Stanford University, 1994.
No context found.
T. Mowry. Tolerating Latency Through Software Controlled Data Prefetching. PhD thesis, Dept. of Computer Science, Stanford University, Mar. 1994. 4.7, 5.5
No context found.
T. C. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Stanford University, March 1994.
No context found.
T. Mowry. Tolerating Latency Through Software Controlled Data Prefetching. PhD thesis, Dept. of Computer Science, Stanford University, March 1994.
No context found.
T. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Stanford University, 1994.
No context found.
T.C. Mowry. Tolerating Latency Through Software{Controlled Data Prefetching. PhD thesis, Computer Systems Laboratory, Stanford University, 1994.
No context found.
T. Mowry and A. Gupta, "Tolerating Latency through Software-controlled Prefetching in Shared-memory Multiprocessors," Journal of Parallel and Distributed Computing, Vol. 12, No. 2, June 1991, pp. 87-106.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC