| H. Cheong and V. Veidenbaum. Compiler-Directed Cache Management in Multiprocessors. IEEE Computer, 23(6):39--48, June 1990. |
....implementations that select positive broadcast as the method of distributing tuples must also select a coherence protocol as tuples are duplicated on all nodes. The available protocols are essentially the same as those developed by the cache coherency research (for background one can start with [28, 13, 1, 15], and for more current research see [14, 30] Bjornson [6] noted the advantages of replicating read only tuples, thus implementations that provide this optimization must also decide on an appropriate coherence protocol. Tuple Transfer Protocol Tuple transfer is the movement of a tuple among ....
Hoichi Cheong and Alexander V. Veidenbaum. Compiler-directed cache management in multiprocessors. IEEE Computer, 23(6):39--48, June 1990.
....location, seem more promising for large scale systems. However, directories can require large amounts of additional storage and directory maintenance operations may substantially increase network traffic. Others researchers suggest that caches include ver sion number based support for coherence[4, 12]. Drawbacks to these schemes include dedication of precious cache real estate to version numbers (decreasing the amount of useful data that the cache can hold) and the additional hardware complexity. A promising alternative to hardware based solutions for coherence is to use compilers to ....
....of the coherence algorithm. Similar observations apply for the GWE. 3.1 Related Work Here we examine two previously proposed software schemes in terms of the forementioned trade offs in the positioning of coherence operations. The first is Gheong and Veiden baum s fast selective invalidation [4] (hereafter referred to as the FSI method) The second is a method proposed by Cytron, Karlovsky, and McAuliffe [5] hereafter referred to as the GKM method) Both methods were developed for programs with fork join parallelism expressed in the form of parallel loops. Evictions and cache line sizes ....
[Article contains additional citation context not shown here]
H. Cheong and A. Veidenbaum. Compiler-directed cache management for multiprocessors. Computer, 23(6):39 47, June 1990.
....shared bus. Hardware solutions to the cache coherency problem for multiprocessors with point to point connections more commonly employs a directory based scheme [27, 2, 6, 18, 16] Due to the increased complexity of hardware solutions to the cache coherency problem, software assisted schemes 1 [7, 10, 13, 26, 29, 25, 21, 8] have been proposed, which are under supervision of the compiler (static schemes) or supported by the operating system kernel (dynamic schemes) To support scalability to a large number of processors, overheads such as storage requirements and run time related to cache coherency scheme should be ....
H. Cheong and A. V. Veidenbaum. Compiler-Directed Cache Management in Multiprocessors. IEEE Computer, 23(6):39--47, June 1990.
....The compiler is aggressive in its analysis as it is does not have to make the conservative assumptions of compiler directed schemes where the compiler is entirely responsible for coherence. In the case of compile time unknowns, the directory structure guarantees correctness. Existing analyses [4] [2], 6] for compiler directed coherence assume a fork join model. Parallel regions are separated into epochs by a synchronisation even though there may be no cross processor dependence between regions. Scheduling withinan epoch is considered to be dynamic so it is impossible to determine the ....
....CC NUMAs and VSM architectures are very similar. Compiler directed coherence places the entire burden of maintaining cache coherence on the compiler. Some schemes use a compiler controlled directory to help in runtime dependence analysis, whilst others remove the need for a directory altogether [2], 6] Early work invalidated all cached data at each epoch. More recent schemes have used tags or timestamps to maintain cached data across epochs. This method relies on the compiler analysis for correctness and is limited in the amount of inter epoch locality that can be exploited by the size of ....
[Article contains additional citation context not shown here]
Cheong H., Veidenbaum A.V., Compiler Directed Cache Management in Multiprocessors, IEEE Computer, 23(6):39-48, June 1990.
....as the problem size. Therefore, with increasing cache sizes and numbers of processors, cache affinity becomes more important. Several exiting software schemes attempt to retain cache affinity across parallel loops. They include the Fast Selective Invalidation (FSI) 9, 10] the Version Control [7], the Time Stamp [22, 23] and the Life Span [8] schemes. But all these approaches require additional bits to be added into each cache line. Thus they cannot be implemented on current NUMA multiprocessors (such as BBN TC2000 and Hector) In this paper, we propose a new software cache coherence ....
....that can be used in the current loop. Since the cache affinity in the example is between the two parallel loops, the Life Span strategy [8] can also capture such cache affinity and achieve the same cache hit ratios as those in the optimal scheme. The Version Control and the Time Stamp strategies [7, 22] use several bits per cache line, thus they can retain cache affinity across many parallel loops. The Cache Affinity based Software cache coherence scheme (CAS) captures cache affinity across parallel loops by software. We will disucss it in the next section. In summary, by considering cache ....
H. Cheong and A.V. Veidenbaum. Compiler-Directed Cache Management in Multiprocessors. Computer, 23(6):39--47, 1990.
....these potential stale references at compile time, the system can be forced to get up to date data directly from the main memory instead of from the cache. Stale reference prevention techniques invalidate or update stale cache entries before stale references occur. The Simple Invalidation scheme [3], the Fast Selective Invalidation scheme [3] and the Parallel Explicit Invalidation scheme [10] make use of this technique. Most hardware cache coherence schemes also use prevention techniques. Stale reference avoidance is a more relaxed coherence model than the prevention technique. It allows ....
....time, the system can be forced to get up to date data directly from the main memory instead of from the cache. Stale reference prevention techniques invalidate or update stale cache entries before stale references occur. The Simple Invalidation scheme [3] the Fast Selective Invalidation scheme [3], and the Parallel Explicit Invalidation scheme [10] make use of this technique. Most hardware cache coherence schemes also use prevention techniques. Stale reference avoidance is a more relaxed coherence model than the prevention technique. It allows the existence of stale data while avoiding ....
[Article contains additional citation context not shown here]
H. Cheong and A. Veidenbaum. Compiler-Directed Cache Management In Multiprocessors. IEEE Computer, 23(6):39--47, June 1990.
....framework, this problem can be intuitively phrased as maximizing the sum of the size of the intersections between all the polyhedra associated with each loop nest and each array. 3.3. 2 Software coherence Software managed coherence has been proposed as an inexpensive solution to cache coherence [13]. One of the main issues of cache coherence is to identify the data elements shared by two consecutive loop nests in order to invalidate them. The lack of efficient techniques for precisely identifying shared elements result in excessive invalidations and performance degradations. The proposed ....
Hoichi Cheong and Alexander V. Veidenbaum. Compiler-Directed Cache Management in Multiprocessors. IEEE Computer, 23(6):39--47, June 1990. 19
....to a cache line before it is invalidated, performance is quite good. Reference patterns resulting in frequent cache invalidation traffic perform poorly, however. Cheong and Veidenbaum have 13 proposed compiler directed management of the hardware caches for the Cedar multiprocessor [CV88, CV90] If eventually proven successful, it would be worthwhile to consider applying this strategy to the management of main memory in NUMA multiprocessors. Another reason that the hardware caching literature is relevant to our research is that it is the primary source of information on parallel ....
H. Cheong and A. Veidenbaum. Compiler-directed cache management in multiprocessors. IEEE Computer, 23(6):39--47, June 1990.
....in most cases to bring software cache coherence within sight of the hardware alternatives. 2 We are speaking here of behavior driven coherence mechanisms that move and replicate data at run time in response to observed patterns of program behavior as opposed to compiler based techniques [13, 15]. 2 We also report on the impact of several architectural alternatives on the effectiveness of software coherence. These alternatives include the choice of write policy (write through, write back, writethrough with a write collect buffer) and the availability of a remote reference facility, ....
....coherence messages to propagate in the background of computation (possibly at the expense of extra coherence traffic) in order to avoid a higher waiting penalty at synchronization operations. Coherence for distributed memory with per processor caches can also be maintained entirely by a compiler [13, 15]. Under this approach the compiler inserts the appropriate cache flush and invalidation instructions in the code, to enforce data consistency. The static nature of the approach, however, and the difficulty of determining access patterns for arbitrary programs, often dictates conservative decisions ....
H. Cheong and A. V. Veidenbaum. Compiler-Directed Cache Management in Multiprocessors. Computer, 23(6):39--47, June 1990.
....for convenience, nonlocal is referred to as if it were a specific memory unit in which nonlocal variables can be stored. 2.1. 2 Generation of Communication Statements Producers use atomic statements to set the states of shared variables to loop specific values (this is similar to what is done in [7] and [16] the consumer waits for the state to reach the appropriate value. The first parallelization method discussed in this paper, the basic method, does not distinguish between array elements shared between just two PEs and elements shared among more than two PEs; as a result, a PE producing ....
Hoichi Cheong and Alexander V. Veidenbaum. Compiler-Directed Cache Management in Multiprocessors. IEEE Computer, pages 39--47, June 1990.
....eliminated. The latter depends on the number of correctlyidentified self invalidations that actually reach the directory prior to subsequent requests for the self invalidated blocks. There are a myriad of proposals for self invalidation. Softwaredriven approaches typically use either the compiler [1], the programmer [5] or the binary rewriter (through profiling) 2] to identify opportunities for self invalidation and insert self invalidation directives in the code. To self invalidate blocks accurately, however, software driven approaches require either complex compiler algorithms or careful ....
H. Cheong and A. V. Veidenbaum. Compiler-directed cache management in multiprocessors. IEEE Computer, 23(6):39--48, June 1990.
....46, 36, 72, 65, 75] for systems with a broadcast medium such as a bus interconnection network have been proposed. For more scalable multiprocessors with general interconnection network between processors, 10 directory based protocols [2, 4, 14, 44, 73, 86] and compiler assisted software protocols [20, 51, 57, 79] have been suggested. Recently, dynamically tagged directory protocols [15, 41, 55, 56] have evolved from previous directory based schemes. Snoopy Protocols Snoopy protocols are also called bus based protocols. All processors in the system can observe any memory access by snooping on the bus. ....
H. Cheong and A. V. Veidenbaum. Compiler-directed cache management in multiprocessors. Computer, 23(6):39--47, 1990.
....techniques will be inapplicable to commodity workstations. 7.2 Other Approaches to Software Coherence Solving the data coherence problem by software means has been a topic of research for many years. Early proposals forced the programmer or the compiler to flush caches at synchronization points [14]. However, subsequent work has concentrated on system level approaches. Kai Li has pioneered Virtual Shared Memory (VSM) systems [49] for distributed systems. Research in this area has grown continuously, fueled by the emergence of networks of workstations. Since the first VSM, Ivy [50] which was ....
H. Cheong and A.V. Veidenbaum. Compiler-Directed Cache Management in Multiprocessors. IEEE Computer 23(6):39-47, June 1990.
....To do so, two locality preserving analysis techniques [5] are used. These techniques help to reduce the number of unnecessary coherence operations. We use array data flow analysis [5] to perform stale reference detection more accurately compared to early stale reference detection algorithms [4]. In previous algorithms, an array is treated as a single variable to simplify the compiler analysis. However, this is likely to overestimate the amount of potentially stale references. The array data flow analysis algorithm solves this problem by treating different regions of arrays referenced in ....
....data flow analysis algorithm solves this problem by treating different regions of arrays referenced in the program as distinct symbolic variables. Finally, procedure calls in a program introduce side effects which complicate stale reference detection. Previous stale reference detection algorithms [4] avoided this problem by invalidating the caches at procedure boundaries or by inlining the procedures. However, cache invalidations at procedure boundaries can increase the number of unnecessary cache misses at run time. Inlining might lead to excessive growth in code size, which in turn ....
H. Cheong and A. Veidenbaum. Compiler-directed cache management in multiprocessors. IEEE Computer, 23(6):39--47, June 1990.
....generating unnecessary invalidation or misses. Compiler directed coherence places the entire burden of maintaining cache coherence on the compiler. Some schemes use a compiler controlled directory to help in run time dependence analysis, whilst others remove the need for a directory altogether [2, 4]. Early work invalidated all cached data at each epoch. More recent schemes have used tags or timestamps [3] to maintain cached data across epochs. This method relies on the compiler analysis for correctness and is limited in the amount of inter epoch locality that can be exploited by the size of ....
Cheong H., Veidenbaum A.V., Compiler Directed Cache Management in Multiprocessors, IEEE Computer, 23(6):39-48, June 1990.
.... on processor 2 ENDDO DOALL I A(I 1) A(2) write allocated on processor 1 ENDDO DOALL I B(I) A(I) 1 Access stale A(2) on processor 2 ENDDO Figure 1: Example of stale access in the absence of coherence control 2 Definitions and framework Fork join programs are composed of a series of epochs[4]. Each epoch consists of one or more instances which run in parallel. Each epoch is either a (forkjoin) parallel loop with no internal synchronization, e.g. a Fortran DOALL, or a serial region between parallel loops. Serial regions can be nested serial loops and or those parts of serial loops ....
....the first and second epochs will not change the situation. If A(2) were written on p i between the second and third epochs (not shown in the figure) it would again be valid (on p i ) without coherence control. Numerous previous authors have tried to capture this notion in various analytical ways [4, 6, 9]. Here we simply note the dynamic behavior that causes staleness without addressing its detection at compile time. Coherence is maintained by making sure that values are communicated between caches when necessary. Values are updated from cache to main memory either by using write thru cache or ....
[Article contains additional citation context not shown here]
H. Cheong and A. Veidenbaum. Compiler-directed cache management for multiprocessors. Computer, 23(6):39--47, June 1990.
....as few as three additional instructions for each shared memory write. Maintaining coherence related information increases this overhead to about eleven instructions per shared memory write. Coherence for distributed memory with per processor caches can also be maintained entirely by a compiler [5]. Under this approach the compiler inserts the appropriate cache flush and invalidation instructions in the code, to enforce data consistency. The static nature of the approach, however, and the difficulty of determining access patterns for arbitrary programs, often dictates conservative decisions ....
H. Cheongand A. V. Veidenbaum. Compiler-Directed Cache Management in Multiprocessors. Computer, 23(6):39--47, Jun. 1990.
....extra hardware to overlap coherence processing and computation (possibly at the expense of extra coherence traffic) in order to avoid a higher waiting penalty at synchronization operations. Coherence for distributed memory with per processor caches can also be maintained entirely by a compiler [8]. Under this approach the compiler inserts the appropriate cache flush and invalidation instructions in the code, to enforce data consistency. The static nature of the approach, however, and the difficulty of determining access patterns for arbitrary programs, often dictates conservative decisions ....
H. Cheong and A. V. Veidenbaum. Compiler-Directed Cache Management in Multiprocessors. Computer, 23(6):39--47, June 1990.
....data, the corresponding TLB entry is loaded again from the OTI register. Thus each subsequent reference to a stale cache line is treated as a miss because the OTI entry pair does not match. Version Control scheme: Cheong and Viedenbaum proposed a static scheme called the Versions Control Scheme [43]. In this scheme, data dependency between two tasks is represented by a directed graph. Two tasks on the same graph level are independent and can be run on two processors without any coherence check. To move from one level to another, a synchronisation action is needed. Each task that writes to a ....
H. Cheong and V. Veidenbaum. Compiler-Directed Cache Management in Multiprocessors. IEEE Computer, 23(6):39--48, June 1990.
....using either a hardware or software mechanism [21] Several schemes [12, 16, 30, 34, 36] have been proposed for bus based systems. For more scalable systems that use general interconnection networks, directory based hardware schemes [2, 3, 5, 15, 35, 38] and compiler assisted software schemes [7, 18, 17, 25, 37] have been suggested. Recently, several authors have proposed dynamically tagged directories [6, 14, 22, 23, 24, 30] in which pointers to processors with a copy of a memory block are allocated only when the block is actually cached. These directories maintain a cache of pointers in each memory ....
....implemented our algorithm for marking scalar references only. All array references are conservatively marked as needing a directory pointer. We also briefly discuss how to adapt our algorithm to mark array references, given array data flow information. 1. 1 Background Software coherence schemes [7, 18, 17, 25, 37] analyze the source program at compile time to predict memory reference behavior and potential cache incoherence. Software schemes typically insert special instructions in the program to invalidate a cache line before it is referenced, or to clear the whole cache at appropriate times. However, due ....
H. Cheong and A. V. Veidenbaum. Compiler-directed cache management in multiprocessors. Computer, 23(6):39--47, June 1990.
....perfectly, read requests always find the block in state Idle or Shared, and write requests always find the block in state Idle. Previous self invalidation techniques rely on memory system directives inserted by the compiler, profile based tools, or the programmer. Compiler directed coherence [10,16,18,54] eliminates the directory, placing the entire burden of maintaining cache coherence on the compiler. Unfortunately, this technique requires sophisticated analysis, and has only been demonstrated to work well for regular scientific applications and oneword cache blocks. 1. It is possible to have ....
Hoichi Cheong and Alexander V. Veidenbaum. Compiler-Directed Cache Management in Multiprocessors. IEEE Computer, 23(6):39--48, June 1990.
....mechanisms can be easily added in software controlled protocols. 7. COMPARISON WITH OTHER PROJECTS Solving the data coherence problem by software means has been a topic of research for many years. Early proposals forced the programmer or the compiler to flush caches at synchronization points [5]. However, recent work has concentrated on system level approaches. A major difference among approaches is whether the hardware substrate is shared memory (NCC NUMA) or distributed (message passing) The first paper to trigger interest in system level approaches on top of a NCC NUMA hardware ....
H. Cheong and A.V. Veidenbaum. Compiler-directed Cache Management in Multiprocessors. IEEE Computer 23(6), pp. 39-47, June 1990.
No context found.
H. Cheong and V. Veidenbaum. Compiler-Directed Cache Management in Multiprocessors. IEEE Computer, 23(6):39--48, June 1990.
No context found.
Cheong H., Veidenbaum A.V., Compiler Directed Cache Management in Multiprocessors, IEEE Computer, 23(6):39-48, June 1990.
No context found.
H. Cheong and A. Veidenbaum. Compiler-directed cache management for multiprocessors. IEEE Computer. 23(6), pages 39-47, Jun 1990.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC