| Darnell E., Mellor-Crumney J.M., Kennedy K., Automatic Software Cache Coherence through Vectorisation, Proc. of Int. Conf. on SuperComp. , July 1992. |
....s7 and s8. A more sophisticated form of analysis [4] based on reads after RDS (relaxed determining sequence) 2 examines each epoch for the first occurrence of an upwardly exposed read which could be preceded by an RDS. Such reads will require a special cache read to determine their validity. In [7], vectorisation is used to the overhead of redundant invalidationwhen using RDS analysis. In our example the data will be considered invalid on the read access to gg in s5, e in s6, e in s7, z in s8 and d in s9 . Thus previous compiler based schemes invalidate or consider stale between 3 and 21 of ....
Darnell E., Mellor-Crumney J.M., Kennedy K., Automatic Software Cache Coherence through Vectorisation, Proceedingsof ICS, pages129-138, July 1992.
....the effectiveness of software cache coherence schemes depends on the cache hit ratios and the invalidation overhead. To reduce the invalidation overhead, Cytron, Karlovsky, and McAulife (CKM) 12] attempt to minimize the amount of data to be invalidated, and Darnell, Mellor Crummey, and Kennedy [13] apply vectorization techniques to eliminate redundant invalidations. The second scheme is called Software Cache Coherence through Vectorization (SCTV) Both of these schemes, however, only consider single parallel loops, so the cached data cannot be reused across parallel loops. Without ....
....size is two words, and each processor executes 100 iterations of each parallel loop. Then under the optimal scheme discussed above, the cache hit ratios for arrays A and B are one and 0.99 respectively. Table 1 indicates how various proposed schemes do relative to the optimal. The SCTV scheme [13] does not exploit cache affinity across parallel loops. It conservatively assumes that ffl every variable (except read only variables) may be modified by other processors, so upon entering each parallel loop, the cache lines corresponding to the variable are invalidated before the variable is ....
Ervan Darnell, John M. Mellor-Crummey, and Ken Kennedy. Automatic Software Cache Coherence through Vectorization. In Proc. International Conference on Supercomputing, 1992.
....in most cases to bring software cache coherence within sight of the hardware alternatives. 2 We are speaking here of behavior driven coherence mechanisms that move and replicate data at run time in response to observed patterns of program behavior as opposed to compiler based techniques [13, 15]. 2 We also report on the impact of several architectural alternatives on the effectiveness of software coherence. These alternatives include the choice of write policy (write through, write back, writethrough with a write collect buffer) and the availability of a remote reference facility, ....
....coherence messages to propagate in the background of computation (possibly at the expense of extra coherence traffic) in order to avoid a higher waiting penalty at synchronization operations. Coherence for distributed memory with per processor caches can also be maintained entirely by a compiler [13, 15]. Under this approach the compiler inserts the appropriate cache flush and invalidation instructions in the code, to enforce data consistency. The static nature of the approach, however, and the difficulty of determining access patterns for arbitrary programs, often dictates conservative decisions ....
E. Darnell, J. M. Mellor-Crummey, and K. Kennedy. Automatic Software Cache Coherence Through Vectorization. In 1992 ACM International Conference on Supercomputing, Washington, DC, July 1992.
....introduces too much complexity to the programming interface and is contrary to the fundamental principles underlying the shared memory abstraction. The compiler: Within the compiler, coherence is enforced by statically inserting flush and invalidation instructions into the program stream [18] 1] 1][24]. Among the different classes of coherence strategies, compiler based solutions are the most attractive because they take the burden of coherence away from the both user and the architecture. From the user perspective, this implies a simpler programming interface. From the architectural ....
E. Darnell, J. Mellor-Crummey, and K. Kennedy. Automatic software cache coherence through vectorization. In Proceedings of Supercomputing, 1992.
....coherence messages to propagate in the background of computation (possibly at the expense of extra coherence traffic) in order to avoid a higher waiting penalty at synchronization operations. Coherence for distributed memory with per processor caches can also be maintained entirely by a compiler [19, 21]. Under this approach the compiler inserts the appropriate cache flush and invalidation instructions in the code, to enforce data consistency. The static nature of the approach, however, and the difficulty of determining access patterns for arbitrary programs, often dictates conservative decisions ....
E. Darnell, J. M. Mellor-Crummey, and K. Kennedy. Automatic Software Cache Coherence Through Vectorization. In 1992 ACM International Conference on Supercomputing, Washington, DC, July 1992.
....speed and precision. Implementing and designing Ped also provided insight into the analysis and transformations and a testbed for experimenting with different automatic techniques. Indeed, Ped is proving to be a valuable platform for compiling for other parallel architectures as well [HKK 91, DKMC92, HK92] illustrating the usefulness of this type of tool for developing compilers for new types of architectures. At the same time however, we pursued more advanced and general compiler techniques for shared memory multiprocessors. We first focused on generalizing existing compiler methods to ....
E. Darnell, K. Kennedy, and J. Mellor-Crummey. Automatic software cache coherence through vectorization. In Proceedings of the 1992 ACM International Conference on Supercomputing, Washington, DC, July 1992.
....s7 and s8. A more sophisticated form of analysis [4] based on reads after RDS (relaxed determining sequence) 2 examines each epoch for the first occurrence of an upwardly exposed read which could be preceded by an RDS. Such reads will require a special cache read to determine their validity. In [7], vectorisation is used to minimise the overhead of redundant invalidation when using RDS analysis. In our example, the data will be considered invalid on the read access to gg in s5, e in s6, e in s7, z in s8 and d in s9. Thus previous compiler based schemes invalidate or consider stale between 3 ....
Darnell E., Mellor-Crumney J.M., Kennedy K., Automatic Software Cache Coherence through Vectorisation, Proc. of Int. Conf. on SuperComp., July 1992.
.... which could access potentially stale data [13] Later, Veidenbaum s approach was improved by Darnell, Mellor Crummey, and Kennedy by moving aggregated cache invalidation instructions as close as possible to parallel loop boundaries, at the same time trying not to invalidate data unnecessarily [14]. In the scheme proposed by Lee et al. 30] the compiler divides programs into segments called epochs, and tags variables within an epoch as non cacheable if they are shared and written, and cacheable otherwise. At the end of each epoch caches are invalidated, and write back is used to keep main ....
Ervan Darnell, John M. Mellor-Crummey, and Ken Kennedy. Automatic Software Cache Coherence through Vectorization. In Proceedings of the 1992 International Conference on Supercomputing, pages 129--138, July 1992.
....In general, hardware based protocols require special purpose hardware that is difficult to design, requires additional chip real estate, and might be restricted to bus based interconnections. Several cache coherence schemes based on compiler support have been proposed [Vei86, LYL87, CV88, Che92, DMCK92] An important advantage of these software based methods is that they need little hardware support: only a few instructions used by the compilers to invalidate and flush the cache. Previous research [OA89, AAHV91] shows that the performance of compiler based software schemes is comparable to ....
Ervan Darnell, John M. Mellor-Crummey, and Ken Kennedy. Automatic Software Cache Coherence through Vectorization. In Proceedings of the 1992 International Conference on Supercomputing, pages 129--138, July 1992.
....endif After eliminating redundancy. b) Figure 2: Some elements of global array B read at R 1 might be locally available as a result of R 0 . of the existing software and hardware based strategies for maintaining cache coherence. With compiler directed strategies, for example [Vei86, CKM88, MB89, CV90, DMCK92], the cost of the technique is inversely proportional to the precision of the information available at compile time. By combining flow and dependence analysis, the techniques presented in this paper provide more detailed information on stale data than was previously available [CV88] ....
Ervan Darnell, John M. Mellor-Crummey, and Ken Kennedy. Automatic Software Cache Coherence Through Vectorization. In Proceedings of the International Conference on Supercomputing, pages 129--138, July 1992.
....data is cached by default. The compiler can override this default for sets of shared data by using directives or making private copies of data to be cached, but then it assumes the responsibility of maintaining coherence for these data. Compiler directed coherence algorithms [Vei86, CKM88, CV88, DMCK92] must be conservative. Therefore, the number of unnecessary invalidations that the compiler inserts is inversely proportional to the precision of the information available at compile time. Because the results of redundancy analysis provide more detailed information on stale data (equivalently, ....
Ervan Darnell, John M. Mellor-Crummey, and Ken Kennedy. Automatic Software Cache Coherence Through Vectorization. In Proceedings of the International Conference on Supercomputing, pages 129--138, July 1992.
.... cache invalidation instructions at parallel loop boundaries in [Vei86] This algorithm was improved by Darnell, Mellor Crummey, and Kennedy by moving aggregated cache invalidation instructions as close as possible to parallel loop boundaries, but trying not to invalidate data unnecessarily [DMCK92] In [LYL87] the compiler divides programs into segments called epochs, and tags variables within an epoch as non cacheable if they are shared and written, and cacheable otherwise. At the end of each epoch caches are invalidated, and write back is used to keep main memory updated. Similar ....
Ervan Darnell, John M. Mellor-Crummey, and Ken Kennedy. Automatic Software Cache Coherence through Vectorization. In Proceedings of the 1992 International Conference on Supercomputing, pages 129--138, July 1992.
....the usefulness of caches for remote memory access. For example in Cray T3D, lack of cache coherence mechanism forces each cache line loaded by a remote read not to be cached (by an uncacheable load instruction) or to be flushed [16] Several compiler directed coherence schemes have been proposed [6, 8, 9, 12, 18]. In this approach, cache coherence is maintained locally without directory hardware, avoiding the complexity and the overhead associated with the hardware directories. Although the performance of such schemes have been demonstrated through simulations, most of those studies assume either perfect ....
E. Darnell, K. Kennedy, and Mellor-Crummey. Automatic software cache coherence through vectorization. Proceedings of the International Conference on Supercomputing, Nov. 1992.
....the effectiveness of pre defined memory management policy. An alternative strategy is to control directly the movement of data across the different levels of the memory hierarchy. Such techniques have been studied by Cytron et al. 96] Gornish et al. 97] Callahan et al. 98] and Darnell et al. [99]. 3.2.5 Dependence Breaking Techniques In this section we discuss the two transformations most frequently used to eliminate cross iteration dependences. The first eliminates from a loop L all assignments to induction variables. The sequence of values of an induction variable is computed by means ....
E. Darnell, J. M. Mellor-Crummey, and K. Kennedy. Automatic Software Cache Coherence through Vectorization. In Proceeding of Int'l. Conf. on Supercomputing, pages 129-- 139, 1992.
....tradeoff is that dynamic strategies require some additional hardware support to handle marking bits. Static strategies require no hardware support other than the ability to invalidate cache lines under software control. Static strategies can be used on some existing machines, such as the BBN TC2000[6, 7]. Figure 3 shows an example for which static strategies are inherently inefficient. The value of A written DOALL I=1,N A(I) ENDDO DOALL I=1,N A(I) A(I) ENDDO DOALL I=1,N B(I) A(I) 1 ENDDO Figure 3: Dynamic versus static coherence in the first epoch cannot be allowed to reach the third ....
....line or a particular page. The high level invalidate would then loop over the proper range of pages and lines. Even though this would take O(jsectionj) acceptable performance could still be achieved. Previously, we examined the efficiency of this kind of invalidate for a static strategy [7]. A faster, but more complex, invalidate could work by using a bit mask to determine which addresses to invalidate. With only = comparators and no extra storage, a section could be invalidated in O(log(jsectionj) time. Special layouts and strides could reduce this further. This is similar to ....
E. Darnell, J. Mellor-Crummey, and K. Kennedy. Automatic software cache coherence through vectorization. In Proceedings of 1992 International Conference on Supercomputing, July 1992. Also available as expanded Technical Report CRPC-TR92197, Center for Research on Parallel Computation, January 1992.
....in which this situation occurs. If the architecture does not permit such an allocation, some of the parallelism will have to sacrificed for a completely automatic technique. This must be handled before the coherence graph is built. A more detailed treatment of the algorithm can be found in [6]. 4.3 Example Consider a simple matrix multiply (figure 4) where the outer loop is parallel. The J2 assignment in statement 4 is added 1 PARALLEL DO I=1,N 2 J2=0 3 DO J=1,N 4 J2=J2 1 5 DO K=1,N 6 Inv A(I,K) 7 Inv B(K,J) 8 Inv C(I,J2) 9 C(I,J) C(I,J2) A(I,K) B(K,J) 10 Upd C(I,J) 11 ....
E. Darnell, J. M. Mellor-Crummey, and K. Kennedy. Automatic software cache coherence through vectorization. Technical Report CRPC-TR92197, Computer Science Department, Rice University, Jan. 1992.
No context found.
Darnell E., Mellor-Crumney J.M., Kennedy K., Automatic Software Cache Coherence through Vectorisation, Proc. of Int. Conf. on SuperComp. , July 1992.
No context found.
E. Darnell, J. M. Mellor-Crummey, and K. Kennedy. Automatic software cache coherence through vectorization. Technical Report CRPC-TR92197, Computer Science Department, Rice University, Jan. 1992.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC