| H. F. Jordan. Performance measurement on HEP---A pipelined MIMD computer. Proc. of the 10th Annual Int. Symp. on Comp. Arch., Stockholm, Sweden, June 1983. |
....Engineers, Inc. instruction. However, once a context is activated, its execution is controlled by the availability of operands and processors (data flow) The data flowparadigm is used here to support mediumgrain parallelism. This architecture has a number of similarities with the HEP machine [9]. Both architectures have tagged memory locations. The HEP architecture uses these tags to interlock pipeline stages. In the queue machine, the tags are used to synchronize contexts. The HEP processing element executes a number of tasks simultaneously,effectively doing a context switch on each ....
Harry F. Jordan, "Performance Measurements on HEP---A Pipelined MIMD Computer," Conference Proceedings of the 10 Annual Symposium on Computer Architecture, June 1983, pp. 207-212.
....execution model on any shared memory multiprocessor system that supports either C or Java. V. Related Work A variety of multithreaded architectures have been proposed both to tolerate long memory delays and to increase the total number of instructions that can be issued in each cycle. The HEP [26], Horizon [27] and Tera [28] machines, for instance, maintain hundreds of thread contexts in each processor to allow switching between threads every cycle. With no data caches, this approach allows the processors to tolerate long memory delays. Instead of switching contexts each cycle, a processor ....
Harry F. Jordan, "Performance measurements on HEP --- a pipelined MIMD computer," in Proceedings of the 10th Annual International Symposium on Computer Architecture, June 13-- 17, 1983, pp. 207--212.
....time the lock is unset, each processor makes at least two bus accesses (one for Test and one for Test Set) but only one processor is successful in setting the lock. The synchronization primitives provided in the HEP multiprocessor operate on a Full Empty bit associated with each word in memory [Jord83]. The bit is tested before a read or write operation if a special symbol is prepended to the variable name. The read or write operation blocks until the test succeeds. When the test succeeds, the bit is set to the opposite value, indivisibly with the read or write operation. These primitives are ....
Jordan, H. F., "Performance Measurements on HEP -- a Pipelined MIMD Computer," Proceedings of the 10th Annual International Symposium on Computer Architecture, June 1983, pp. 207-212.
....of varying sizes, ranging from rather small to quite large. This suggests a large amount of memory in the computer. However, memory access speed slows down as memory size grows. Thus, we try to exploit some program properties to speed up program execution on general purpose computers [DCC 87, Jor83, LLG 90, PC90] A well known heuristic property is locality of reference in a program: Data that has been referenced recently or data that is near recently referenced data will tend to be referenced soon. Another intrinsic program property is parallelism in data processing. Parallelism exists ....
Harry F. Jordan. Performance measurements on HEP --- a pipelined MIMD computer. In Proceedings of the 10th Annual International Symposium on Computer Architecture, pages 207--212. Computer Architecture News, 11(3), June 1983.
....In dataflow programs, latency determines the delay between the computation of a data value and the time when the value can actually be used. Data parallel operations are limited by the rate at which processors can obtain access to the data on which they need to operate. Multithreaded ( Smi78] Jor83] ALKK90] SBCvE90] CSS 91] NPA92] and dataflow ( ACM88] AI87] PC90] architectures have been developed to mitigate communication latency by hiding its effects. These techniques all rely on an abundance of parallelism to provide useful processing to perform while waiting on slow ....
J. F. Jordan. Performance Measurement on HEP -- A pipelined MIMD Computer. In Proceedings of the 19th Annual International Symposium on Computer Architecture. IEEE, June 1983.
.... swap, for the IBM 370[5] are read modify write operations. As opposed to spin lock, suspend lock employs interprocessor interrupts. If its first test and set fails, a processor waits for an interrupt[5] Less general than read modify write is the full empty tag on each word in HEP memory[12]. The tag can be tested before a producer consumer write or read operation: only a full word can be read; only an empty one, written. If the test succeeds, the tag value is reversed and the operation is performed. An expanded set of memory tags is included for fast synchronization in the new Tera ....
H.F. Jordan. Performance Measurements on HEP - a Pipelined MIMD Computer. 10th Int. Symp. on Comp. Arch., 207--212, June 1983.
.... and list scheduling the instructions as if optimizing for a pipelined processor (average latencies can be assumed by the scheduler) The thread scheduling policy of a multiple context processor lies between two extremes: interleaving that is, context switching on every instruction as in HEP [Jord83] and P RISC [NiAr89] and blocking executing each thread to completion or suspension before switching to another context, as in Alewife and T. Interleaving can potentially lower context switch costs and, given enough active contexts, completely hide the latency of even intermediate length ....
....A context comprises a file of local general purpose registers, instruction fetch and dispatch logic, and a branch unit for executing control instructions. Duplicating contexts to support a number of active threads is not excessively costly, given that several previous processors, including HEP [Jord83], MASA [HaFu88] PRISC 1 [ShNi89] and Sparcle [Agar 93] have also supported multiple register files or register cache and multiple instruction pointers. Concurro has a relatively minor additional burden of a small instruction buffer, simple pre decoder, and control logic for each context. It is ....
[Article contains additional citation context not shown here]
H.F. Jordan, "Performance measurements on HEP---a pipelined MIMD computer," Proc. 10th Ann. Int'l Symp. on Computer Architecture, pp. 207--212, June 1983.
....enforce access to critical sections. Special atomic synchronization primitives such as Test Set and Unset can be provided to acquire and release locks. In addition to critical sections, there are numerous other paradigms for synchronizing processes and sharing data. In barrier synchronization [Jor83] a number of processes may wish to guarantee that all have reached a specific point in their execution before any can proceed. As a second example, processes may wish to perform enqueue and dequeue operations in parallel on a queue whose entries represent separate units of work. Although numerous ....
H. F. Jordan. "Performance Measurements on HEP -- a Pipelined MIMD Computer". In Proceedings of the 10th Annual International Symposium on Computer Architecture, pages 207--212. ACM, June 1983.
....Both fast context switching and prefetching in superscalar processors are active topics of research. Fast context switching hides latency by switching to another task while the request is outstanding. Examples of custom multiprocessor designs with fast context switching mechanisms include the HEP [Jor83] and the Tera [Smi90] machines. Prefetching hides latency by initiating the remote request far enough in advance of its use. Prefetching has met with mixed success. While some programs greatly benefit [ABC 95] others do not [ MG91] It is important to point out that fast context switching ....
Harry F. Jordan. Performance measurements on HEP - a pipelined MIMD computer. In Proceedings of the 10th Annual International Symposium on Computer Architecture, pages 207--212, Stockholm, Sweden, June 1983. (Also published as SIGARCH Newsletter, Volume 11, Number 3, June 1983.).
....aggressive data communication and synchronization mechanisms between threads to exploit more finegrained parallelism. In addition, multiple functional units can be shared among threads for better utilization. Many concurrent multiple threaded processor architectures have been proposed and studied [1, 3, 4, 5, 6, 7, 8, 10, 11, 12, 14, 15, 16, 19]. Some of them [4, 7, 11, 14] are primarily for increasing system throughput by allowing multiple programs (one program for each thread) to be run concurrently. In this paper, we focus on models that are primarily for speeding up the execution of one single program. Among them, models such as ....
....one single program. Among them, models such as Simultaneous Multithreading [16] and SPSM [3] allow tasks that are independent, such as the iterations of a do all loop, to be executed in parallel. This restriction can simplify the design, but limits the exploitable parallelism. Models such as HEP [10], Tera [1] XIMD [19] Elementary Multithreading [8] and M machine [5, 12] allow data synchronization and communication between threads. These models rely on compilers to detect dependences between threads, and to insert explicit data synchronization and communication commands in a program. They ....
Harry F. Jordan. Performance measurements on HEP --- a pipelined MIMD computer. In Proceedings of the 10th Annual International Symposium on Computer Architecture, pages 207--212, June 13--17, 1983.
....a key, and a value. The synchronization instruction consists of an evaluation of a condition, followed by an operation on the key and the value. A special processor is needed at each memory module in order to implement the generalized synchronization mechanism. The HEP (developed by Denelcor [49]) associates a Full Empty bit with every memory location, as does the April processor [3] The bit is tested before a read or 12 write operation if a special symbol is prepended to the variable name. The operation blocks until the test succeeds, at which time the Full Empty bit is complemented ....
H. F. Jordan. Performance Measurements on HEP - a Pipelined MIMD Computer. In Proceedings of the 10th Annual International Symposium on Computer Architecture, pages 207--212, June 1983.
....control transfer, scheduling, and synchronization involved. Much previous work has sought to reduce this cost by using a combination of compiler techniques and clever runtime representations [16, 36, 44, 48, 49, 53, 57, 61, 63] or by supporting fine grained parallel execution directly in hardware [3, 34, 50]. These approaches, among others, have been used in implementing parallel programming languages such as ABCL [65] CC [13] Charm [35] Cid [48] Cilk [7] Concert [36] Id90 [16, 49] Mul T [39] and Olden [12] In some cases, the cost of the fork is reduced by severely restricting what can be ....
....parallelism is actually needed. 157 Chapter 8 Related Work Attempts to accommodate logical parallelism include thread packages [20, 54, 14, 28] compiler techniques and clever runtime representations [16, 49, 44, 63, 61, 53, 30] and direct hardware support for fine grained parallel execution [34, 3]. These approaches have been used to implement many parallel languages, e.g. Mul T [39] Id90 [16, 49] CC [13] Charm [35] Opus [43] Cilk [7] Olden [12] and Cid [48] The common goal is to reduce the overhead associated with managing the logical parallelism. While much of this work overlaps ....
H. F. Jordan. Performance measurement on HEP --- a pipelined MIMD computer. In Proc. of the 10th Annual Int. Symp. on Comp. Arch., Stockholm, Sweden, June 1983.
....data transfer, scheduling, and synchronization involved. Previous work has sought to reduce this cost by using a combination of compiler techniques and clever run time representations [9, 21, 26, 29, 30, 33, 35, 37, 39] and by supporting fine grained parallel execution directly in hardware [3, 19, 31]. In many cases, the cost of the fork is reduced by severely restricting what can be done in a thread. These approaches, among others, have been used in implementing parallel programming languages such as ABCL [40] CC [6] Charm [20] Cid [29] Cilk [4] Concert [21] Id90 [9, 30] Mul T [23] ....
.... We use the synthetic benchmark Grain [26, 39] 7 Related Work Attempts to accommodate logical parallelism include thread packages [11, 34, 7, 16] compiler techniques and clever run time representations [9, 30, 26, 39, 37, 33, 17] and direct hardware support for fine grained parallel execution [19, 3]. These approaches have been used to implement many parallel languages, e.g. Mul T [23] Id90 [9, 30] CC [6] Charm [20] Opus [25] Cilk [4] Olden [5] and Cid [29] The common goal is to reduce the overhead associated with managing the logical parallelism. While much of this work overlaps ....
H. F. Jordan. Performance measurement on HEP --- a pipelined MIMD computer. In Proc. of the 10th Annual Int. Symp. on Comp. Arch., Stockholm, Sweden, June 1983.
....because of the storage management, data transfer, scheduling, and synchronization involved. This cost has been reduced with a combination of compiler techniques and clever run time representations [7, 19, 23, 16, 25, 20, 18] and by supporting fine grained parallel execution directly in hardware [13, 2]. These approaches, among others, have been used in implementing the parallel programming languages Mul T [15] Id90 [7, 19] CC [5] Charm [14] Cilk [3] Cid [18] and Olden [4] In many cases, the cost of the parallel call is reduced by severely restricting what can be done in a thread. In ....
....and special cases can be avoided. 1. 2 Related Work Attempts to accommodate logical parallelism have include thread packages [8, 21, 6] compiler techniques and clever run time representations [7, 19, 16, 25, 23, 20, 10] and direct hardware support for fine grained parallel execution [13, 2]. These approaches have been used to implement many parallel languages, e.g. Mul T [15] Id90 [7, 19] CC [5] Charm [14] Cilk [3] Olden [4] and Cid [18] The common goal is to reduce the overhead associated with managing the logical parallelism. While much of this work overlaps ours, none ....
H. F. Jordan. Performance measurement on HEP --- a pipelined MIMD computer. In Proc. of the 10th Annual Int. Symp. on Comp. Arch., Stockholm, Sweden, June 1983.
....assigned to a new y thread. 4 Related Work Most of the previous or ongoing research in multithreaded architectures use multiple threads for only one specific purpose: to tolerate memory latency or to increase instruction issue rate. Systems that adopt multithreading for hiding latency include HEP[8], Horizon[9] Tera[2] and Alewife[1] To achieve high instruction issue rate, the XIMD[15] Elementary Multithreading[6] M machine[4] Simultaneous Multithreading[14] Multiscalar[5, 12] and SPSM[3] machines allow instructions from different threads to be issued and executed concurrently. The ....
Harry F. Jordan. Performance measurements on HEP --- a pipelined MIMD computer. In Proceedings of the 10th Annual International Symposium on Computer Architecture, pages 207--212, June 13--17, 1983.
....receiving acknowledgments from every node in the system. The lack of a broadcast mechanism renders snoopy protocols infeasible for large multiprocessor designs. Some methods for solving the cache coherence problem in large multiprocessors bypass the problem entirely. For example, the Denelcor HEP [25] avoids the use of caches by hiding memory access latency with fine grain multitasking. However, this system requires an interconnection network with a very high bandwidth. The NYU Ultracomputer [23] and the IBM RP3 [39] use caches, but avoid the coherence problem by not caching shared data. ....
Harry F. Jordan. Performance Measurements on HEP - A Pipelined MIMD Computer. In Proceedings 10th Annual International Symposium on Computer Architecture, IEEE, New York, June 1983.
....dependences applicable to an operation at the time it issues the operation. Data dependences are typically enforced with semaphores or barriers. Hardware support includes synchronization flags or key fields for controlling access to the associated variable, e.g. the full empty flags on the HEP [Jor83], the empty bit associated with each element of the I structure proposed by Arvind for dataflow computation [ANP89] and the synchronization key fields proposed by Zhu and Yew and by Peir [Pei83, ZhY84] Other proposals for hardware support of version consistency include algorithms for improving ....
H. F. Jordan, Performance Measurements on HEP --- A Pipelined MIMD Computer, Proc. of the 10th ISCA, 1983, 207-212.
....hardware support or merely an appropriate compilation strategy and program representation. 1 Introduction Multithreading at the instruction level may provide the key to general purpose parallel computing[26] because it allows the processor to tolerate long, unpredictable communication latency [2, 4, 17, 24, 29]. In addition, this level of multithreading is required to support certain modern parallel programming languages[28] such as Id[20] and Multilisp[18] and extensions of more conventional languages with synchronizing data structures, e.g. I structures[6] On the other hand, asynchronous transfer ....
....in Section 3. Also, the set of enabled threads is maintained in a special hardware token queue. Several multithreaded architectures have been proposed as generalizations of conventional singlethreaded machines, with registers sets (i.e. frames) multiplexed to hide memory and communication latency[1, 14, 17, 27, 29]. In most cases, only one thread of execution per frame is supported. Thus, each outstanding reference has an entire register set standing idle behind it. With the exception of MASA[14] the number of frames per processor is static, thus the mechanism does not directly support language models with ....
H. F. Jordan. Performance Measurement on HEP --- A Pipelined MIMD Computer. In Proc. of the 10th Annual Int. Symp. on Comp. Arch., Stockholm, Sweden, June 1983.
....Switching cheaply on a cache miss is rather difficult, for example, because the miss is discovered very late in the processor pipeline. Switching on every load or on every instruction can be initiated much earlier in the pipeline. This is the primary motivation for interleaved pipelines, as in Hep[15], Tera[3] and Monsoon[16] However, in these machines there are really two kinds of switches. When the switch does not involve a remote access, the VP immediately re enters the pipeline. On a remote reference the essential state of the VP is packaged and delivered into the network with the ....
....cost will appear as the miss penalty if the number of resident VPs is large. Since the size of the top level of the storage hierarchy determines the amount of latency that can be effectively tolerated, what if the top level is simply eliminated This is essentially the approach adopted in HEP[15] and Tera[3] which use register addressing modes to access a sizable SRAM. The register access requires multiple cycles, but by interleaving VPs across multiple banks a new access can be initiated every cycle. The top level of the physical storage hierarchy contains only pipeline latches. The ....
H. F. Jordan. Performance Measurement on HEP --- A Pipelined MIMD Computer. In Proc. of the 10th Annual Int'l Symp. on Comp. Arch., Stockholm, Sweden, June 1983.
....incurring a high cost for context switching. Therefore, these languages have been accompanied by the development of specialized computer architectures, e.g. graph reduction machines [PCSH87, Kie87] dataflow machines [ACI 83, GKW85, SYH 89, PC90] and multithreaded architectures [Jor83, NPA92] Much research has been done in compiling lenient languages for dataflow architectures [ACI 83, Tra86, AN90, GKW85, Cul90] As a clearer separation of language and architecture has been obtained, attention has shifted to compilation aspects of these languages for commodity processors ....
H. F. Jordan. Performance Measurement on HEP --- A Pipelined MIMD Computer. In Proc. of the 10th Annual Int. Symp. on Comp. Arch., Stockholm, Sweden, June 1983.
....handler runs as a separate thread or in processors which execute only one instruction per thread at a time. The extreme example of such a processor is the dataflow processor, such as Monsoon [Pap91] which can execute a new one instruction thread on every cycle. Another example is the HEP [Jor83] processor which interleaves instructions from multiple threads and the J Machine multicomputer in which the state of a supervisor thread and a user thread coexist in the processor [DFK 92] While such processors are capable of providing efficient interrupts even in the presence of many ....
Harry F. Jordan. Performance measurements on HEP - a pipelined MIMD computer. In Proceedings of the 10th Annual International Symposium on Computer Architecture, pages 207--212, Stockholm, Sweden, June 1983. (Also published as SIGARCH Newsletter, Volume 11, Number 3, June 1983.).
....mechanisms between threads to exploit more fine grained parallelism. In addition, multiple functional units can be shared among threads for better utilization in some multiple threaded architectures. Many concurrent multiple threaded processor architectures have been proposed and studied [2, 8, 9, 10, 11, 14, 15, 17, 18, 19, 31, 36, 47, 46, 52, 32, 38, 28, 29, 20, 21]. Some of them [9, 14, 18, 31, 47, 46] are primarily for increasing system throughput by allowing multiple programs (e.g. one program on each thread) to be run concurrently. In this thesis, we focus on models that are primarily for speeding up the execution of one single program. In this ....
....instructions issued per machine cycle. Most of the previous or ongoing research in multithreaded architectures uses multiple threads for only one specific purpose: to tolerate memory latency or to increase instruction issue rate. Systems that adopt multithreading for hiding latency include HEP [17], Horizon [22] Tera [2] and Alewife [1] To tolerate memory latency, HEP, Horizon and Tera can accommodate more than one hundred threads per processor and allow fine grained and fast context switching at every cycle. These machines superpipeline both the memory and ALU operations to increase ....
Harry F. Jordan. Performance measurements on HEP --- a pipelined MIMD computer. In Proceedings of the 10th Annual International Symposium on Computer Architecture, pages 207--212, June 13--17, 1983.
No context found.
H. F. Jordan. Performance measurement on HEP---A pipelined MIMD computer. Proc. of the 10th Annual Int. Symp. on Comp. Arch., Stockholm, Sweden, June 1983.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC