| Vijay Karamacheti, John Plevyak, and Andrew A. Chien. Runtime Mechanisms for Efficient Dynamic Multithreading. Journal of Parallel and Distributed Computing, 37(1):21--40, August 1996. |
....of proposed multithreaded architectures [1, 6, 16, 19] Efficient software implementations have been proposed for commodity architectures that introduce a compromise between stack and heap allocation techniques. For example, Lazy Threads [8] are based on stacklets, and the Illinois Concert system [9] employs a hybrid stack heap execution mechanism. The multithreaded Cilk language [7] exhibits a striking balance between versatility of threads, portability, and efficiency. Cilk s implementation is based on the cactus stack semantics proposed by Moses in 1970 [12] As operating systems and ....
....such as Mul T [11] Id [13] or the parallel Haskell dialect pH [14] A variety of approaches to low overhead implementations of multithreaded languages have been studied on commodity computers. I discuss only a subset of them and refer the reader to the papers cited therein. Both papers [8] and [9] discuss a large body of related work. Lazy Threads [8] extend the work on the Threaded Abstract Machine (TAM) 5] a compilation target for parallel nonstrict functional languages. Lazy Threads are based on compiler support which implements customized memory management of activation frames with ....
[Article contains additional citation context not shown here]
Vijay Karamacheti, John Plevyak, and Andrew A. Chien. Runtime Mechanisms for Efficient Dynamic Multithreading. Journal of Parallel and Distributed Computing, 37(1):21--40, August 1996.
....of the system at a high level. In a later section, we present descriptions of the primitives in more detail as well as some performance measurements. 2.3.1 System Design The following is a brief description of the Illinois Concert System implementation. More indepth descriptions are available at [15] and [7] Keep in mind that these concepts are at the runtime system level, and are not exposed to the user of the system. The Illinois Concert System uses a dynamic multi threading execution model. The following terms are defined for purposes of our description. ffl Address Space An address ....
....the system allows the user to elide explicit freeing from code. In executions of our SAMR code, we have seen that garbage collection takes up an insignificant amount of time, even for the largest hierarchies we create. 43 4.2. 3 Thread Scheduling In addition to the runtime support detailed in [15], our shared memory version of the Concert runtime system has the additional dimension of multiple workers per address space. A worker is a kernel level thread of execution which executes Concert threads (e.g. units of work) Locally enqueing invokes and remote invokes create parallel work that ....
Vijay Karamcheti, John Plevyak, and Andrew A. Chien. Runtime mechanisms for efficient dynamic multithreading. Journal of Parallel and Distributed Computing, 37(1):21--40, 1996. Available from http://www-csag.cs.uiuc.edu/papers/rtperf.ps.
....For example, in a divide and conquer computation (such as quicksort) where a new thread is forked for each recursive call, a thread shares data with all its descendent threads. Therefore, many parallel implementations of lightweight threads use per processor data structures to store ready threads [17, 20, 24, 25, 39, 42, 44]. Threads created on a processor are stored locally and moved only when required to balance the load. This technique effectively increases scheduling granularity, and therefore provides good locality [7] and low scheduling contention. Another approach for obtaining good locality is to allow the ....
V. Karamcheti, J. Plevyak, and A. A. Chien. Runtime mechanisms for efficient dynamic multithreading. J. Parallel and Distributed Computing, 37(1):21--40, August 1996.
....due to unpredictable latencies. An instantiation of the code block running on a processing node is called a thread, thus the name multi threading for these systems. Threads, and not individual instructions, are enabled by synchronization signals. The central idea behind many multithreaded models [7, 11, 19, 27, 43, 46, 113, 114, 157, 96, 167] is to allow the execution of these threads (code blocks) to overlap with communication and synchronization latencies. Around the same time that architectures derived from the data flow model were proposed, the term thread started to be used to refer to multiple contexts of computation in ....
....of threads. In EARTH, the ready queue (FIFO) and the token queue (DEQUE) are used for local and remote scheduling of threads, whereas complex entry and exit codes have to be generated for each quantum by the compiler in TAM. 8.3. 3 The Illinois Concert C Language The Concert runtime system [95, 96] proposes close coupling with the compiler and hardware to overcome overheads associated with thread management and communication in a distributed memory environment, especially when dealing with fine grain threads for dynamic and irregular applications. The hybrid stack heap execution mechanism ....
Vijay Karamcheti, John Plevyak, and Andrew A. Chien. Runtime Mechanisms for Efficient Dynamic Multithreading. In Journal of Parallel and Distributed Computing, volume 37, pages 21--40, Aug. 1996.
....For example, in a divide and conquer computation (such as quicksort) where a new thread is forked for each recursive call, a thread shares data with all its descendent threads. Therefore, many parallel implementations of lightweight threads use per processor data structures to store ready threads [17, 20, 24, 25, 39, 42, 44]. Threads created on a processor are stored locally and moved only when required to balance the load. This technique effectively increases scheduling granularity, and therefore provides good locality [7] and low scheduling contention. Another approach for obtaining good locality is to allow the ....
V. Karamcheti, J. Plevyak, and A. A. Chien. Runtime mechanisms for efficient dynamic multithreading. J. Parallel and Distributed Computing, 37(1):21--40, August 1996.
....a global pointer and waiting for its completion) 0 Word Simple (no thread switches at the sender 4 The CC version of these applications is heavily based on the original Split C implementations [23] to allow for a fair comparison. Split C definitions double lx; double global gpY; double lA[20]; void global gpA; 0 Word N A 1 Word N A 2 Word N A 0 Word Atomic RPC atomic(foo, 0) GP 2 Word Read lx = gpY; GP 2 Word Write gpY = lx; Bulk Read bulk read( lA,gpA,20 sizeof(double) Bulk Write bulk write(gpA, lA,20 sizeof(double) Figure 6. Split C ....
....and object caching, that support efficient multi threading through a stack heap execution model and dynamic load balancing, and that provide efficient communication. The performance of RPC using Concert is documented in the context of two languages: Concurrent Aggregates (CA) 19] and ICC [20]. The round trip time of a 2 word RPC in CA running on the CM 5 with 33 MHz Sparc1 qr v # . # v #. p#v p pyr # hi # uvtur. #uh #ur p # s #i #r h. h #. h sr. vt #ur 8H# Active Messages. Most of the overhead is due to activation frame creation, frame scheduling, ....
V. Karamcheti, J. Plevyak, and A. Chien. Runtime Mechanisms for Efficient Dynamic Multi-threading. In Journal of Parallel and Distributed Computing (JPDC), 37(1):21-40, 1996.
....For example, in a divide and conquer computation (such as quicksort) where a new thread is forked for each recursive call, a thread shares data with all its descendent threads. Therefore, many parallel implementations of lightweight threads use per processor data structures to store ready threads [18, 21, 25, 26, 41, 43, 45]. Threads created on a processor are stored locally and moved only when required to balance the load. This technique effectively increases scheduling granularity, and therefore provides good locality [7] and low scheduling contention. Another approach for obtaining good locality is to allow the ....
V. Karamcheti, J. Plevyak, and A. A. Chien. Runtime mechanisms for efficient dynamic multithreading. J. Parallel and Distributed Computing, 37(1):21--40, August 1996.
....other hand, if a mes saging layer s guarantees are too strong (i.e. they provide more functionality than is generally needed) the messaging layer s common case performance may be needlessly degraded. Analysis of the literature and our ongoing studies to support fine grained parallel com puting [5, 12, 13, 14] have led to the conclusion that a low level messaging layer should provide the following key guarantees: Reliable delivery, In order delivery, and Control over scheduling of communication work (decoupling) As mentioned in the previous section, studies of communication software costs [12] ....
V. Karamcheti, J. Plevyak, and A. A. Chien. Runtime mechanisms for efficient dynamic multithreading. Journal of Parallel and Distributed Computing, 37(1):21-40, 1996. Available from http://-csag. cs. uiuc. edu/papers/rtperf. ps.
....by the programmer or automatically by a compiler or runtime system. The Illinois Concert System is a programming environment that harnesses the benefits of COOP, with a goal of high performance. It consists of the ICC language [12] Concert compiler [4, 15, 3] and the Concert runtime system [9]. Concert supports fine grained, concurrent object oriented programming on Actors [1] Computation is expressed as method invocations on objects or collections of objects. Concurrent method invocations operate against state stored in dynamically created thread data structures. Synchronization of ....
Vijay Karamcheti, John Plevyak, and Andrew A. Chien. Runtime mechanisms for efficient dynamic multithreading. Journal of Parallel and Distributed Computing, 37(1):21--40, 1996. Available from http://www-csag.cs.uiuc.edu/papers/ rtperf.ps. 13
....the other hand, if a messaging layer s guarantees are too strong (i.e. they provide more functionality than is generally needed) the messaging layer s commoncase performance may be needlessly degraded. Analysis of the literature and our ongoing studies to support fine grained parallel computing [5,14 16] have led to the conclusion that a low level messaging layer should provide the following key guarantees: M. Lauria, S. Pakin, A. Chien Efficient Layering: MPI over FM 7 0 10 20 30 40 50 60 16 32 64 128 256 512 Msg Size Bandwidth (MB s) Link Mgmt I O Bus Mgmt Flow Control (a) 0 ....
Vijay Karamcheti, John Plevyak, and Andrew A. Chien. Runtime mechanisms for efficient dynamic multithreading. Journal of Parallel and Distributed Computing, 37(1):21--40, 1996. Available from http://www-csag.cs.uiuc.edu/papers/rtperf.ps.
....using application information) 4.4 Efficient Dynamic Multithreading We exploit close coupling between the compiler and runtime systems to optimize logical threads in our nonbinding concurrency model with respect to both sequential and parallel efficiency. Our hybrid stack heap execution model [28, 22] provides a flexible runtime interface to the compiler, shown in Table 1, allowing it to generate code which optimistically executes a logical thread sequentially on its caller s stack, lazily creating a different thread only when the callee computation needs to suspend or be scheduled separately. ....
....For example, robust communication is important at small numbers of processors when communication traffic is high, and load balancing is essential for large numbers of processors. Space limitations prevent us from a detailed analysis for the other applications; the reader is referred elsewhere [38, 22] for additional details. 6 Related Work The Concert system is related to a wide variety of work on concurrent object oriented languages that can be loosely classified as actor based, task parallel, and data parallel. Actor based languages [1, 17, 36, 25] are most similar in terms of high level ....
Vijay Karamcheti, John Plevyak, and Andrew A. Chien. Runtime mechanisms for efficient dynamic multithreading. Journal of Parallel and Distributed Computing, 37(1):21--40, 1996. Available from http://www-csag.cs.uiuc.edu/papers/rtperf.ps.
....issues. Specific techniques include: ffl Aggressive flow sensitive interprocedural analysis [17, 18] ffl Directed cloning and optimization (procedure and object inlining) 6, 20] ffl Compiler managed locality and memory latency management [27] ffl Efficient, robust communication primitives [12, 14] ffl Hybrid stack heap execution (efficient multithreading) 14, 19] ffl View Caching [13] 1.2 Application Suite In this paper, we use a suite of seven irregular applications to evaluate parallel programming support in ICC . Table 1 briefly describes the applications. Although spanning ....
.... interprocedural analysis [17, 18] ffl Directed cloning and optimization (procedure and object inlining) 6, 20] ffl Compiler managed locality and memory latency management [27] ffl Efficient, robust communication primitives [12, 14] ffl Hybrid stack heap execution (efficient multithreading) [14, 19] ffl View Caching [13] 1.2 Application Suite In this paper, we use a suite of seven irregular applications to evaluate parallel programming support in ICC . Table 1 briefly describes the applications. Although spanning diverse computational domains, these applications share common ....
Vijay Karamcheti, John Plevyak, and Andrew A. Chien. Runtime mechanisms for efficient dynamic multithreading. Journal of Parallel and Distributed Computing, 37(1):21--40, 1996. Available from http://www-csag.cs.uiuc.edu/papers/ rtperf.ps.
....using application information) 3.4 Efficient Dynamic Multithreading We exploit close coupling between the compiler and runtime systems to optimize logical threads in our non binding concurrency model with respect to both sequential and parallel efficiency. Our hybrid stack heap execution model [33, 26] provides a flexible runtime interface to the compiler, shown in Table 1, allowing it to generate code which optimistically executes a logical thread sequentially on its caller s stack, lazily creating a different thread only when the callee computation needs to suspend or be scheduled separately. ....
....For example, robust communication is important at small numbers of processors when communication traffic is high, and load balancing is essential for large numbers of processors. Space limitations prevent us from a detailed analysis for the other applications; the reader is referred elsewhere [44, 26] for additional details. 5 Related Work The Concert system is related to a wide variety of work on concurrent object oriented languages that can be loosely classified as actor based, task parallel, and data parallel. PROGRAM INPUT Grobner pavelle5 [4] Grobner basis IC Cedar Myoglobin ....
V. Karamcheti, J. Plevyak, and A. A. Chien. Runtime mechanisms for efficient dynamic multithreading. Journal of Parallel and Distributed Computing, 37:21--40, 1996.
....the other hand, if a messaging layer s guarantees are too strong (i.e. they provide more functionality than is generally needed) the messaging layer s commoncase performance may be needlessly degraded. Analysis of the literature and our ongoing studies to support fine grained parallel computing [5,14 16] have led to the conclusion that a low level messaging layer should provide the following key guarantees: ffl Reliable delivery, ffl In order delivery, and ffl Control over scheduling of communication work (decoupling) As mentioned in the previous section, studies of communication software ....
Vijay Karamcheti, John Plevyak, and Andrew A. Chien. Runtime mechanisms for efficient dynamic multithreading. Journal of Parallel and Distributed Computing, 37(1):21--40, 1996. Available from http://www-csag.cs.uiuc.edu/papers/rtperf.ps.
....range of irregular applications. Our application study is done in the context of the Illinois Concert system, a high performance compiler and runtime for parallel computers which has been the vehicle for extensive research on compiler optimization and runtime techniques over the past five years [9, 32, 29, 30, 33, 28, 31, 12, 44, 23, 22]. While no system contains all known optimizations, the Concert system contains a wide range of aggressive optimizations, and has been used to demonstrate high performance in absolute terms on a wide range of applications [23, 45, 10] In effect, the Concert system automatically addresses many of ....
.... over the past five years [9, 32, 29, 30, 33, 28, 31, 12, 44, 23, 22] While no system contains all known optimizations, the Concert system contains a wide range of aggressive optimizations, and has been used to demonstrate high performance in absolute terms on a wide range of applications [23, 45, 10]. In effect, the Concert system automatically addresses many of the concerns which programmers explicitly manage in lower level programming models (e.g. explicit threads or message passing) Using a rich suite of irregular parallel applications, and a mature Concert system, we evaluate the ....
[Article contains additional citation context not shown here]
Vijay Karamcheti, John Plevyak, and Andrew A. Chien. Runtime mechanisms for efficient dynamic multithreading. Journal of Parallel and Distributed Computing, 37(1):21--40, 1996. Available from http://www-csag.cs.uiuc.edu/papers/ rtperf.ps.
....the other hand, if a messaging layer s guarantees are too strong (i.e. they provide more functionality than is generally needed) the messaging layer s common case performance may be needlessly degraded. Analysis of the literature and our ongoing studies to support fine grained parallel computing [12, 28, 29, 30] have led to the conclusion that a low level messaging layer should provide the following key guarantees: ffl Reliable delivery, ffl Ordered delivery, and ffl Control over scheduling of communication work (decoupling) Previous studies of communication cost in the CM 5 multicomputer system [28] ....
....cost (SHMEM Put) While it may appear that Pull FM is unnecessary Push FM exhibits superior latency and bandwidth it performs much better that Push FM in the presence of heavy network traffic, especially when communication patterns are irregular. See [29] for details) The Concert runtime [30], for instance, uses exclusively Pull FM because the communication irregularity inherent to Concert programs makes communication robustness dominate overall performance more than baseline latency or bandwidth. The two implementations of FM on the T3D demonstrate the advantages of a well defined ....
Vijay Karamcheti, John Plevyak, and Andrew A. Chien. Runtime mechanisms for efficient dynamic multithreading. Journal of Parallel and Distributed Computing, 37:21--40, 1996.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC