| Agarwal A., "Performance Tradeoffs in Multithreaded Processors," Private Communication, 1989. |
....a register is marked as busy by setting the busy bits of all of its sub registers. r3 r2y r2x r1y r1b r1a r0d r0c r0b r0a r7 r6 r5 r4 r11 r10 r9 r8 o Figure 4 1: Multigranular general purpose registers. 4.2 Multithreading and Event Handling Multithreading is a very well known technique. In [Agarwal92] and [Thekkath94] it is shown that hardware multithreading can significantly improve processor utilization. A large number of designs have been proposed and or implemented which incorporate hardware multithreading; examples include HEP [Smith81] Horizon [Thistle88] MASA [Halstead88] Tera ....
Anant Agarwal, "Performance Tradeoffs in Multithreaded Processors", IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 5, September 1992, pp. 525-539.
....of issues such as the number of threads, thread granularity, memory latencies on the proposed architecture. 3 An Analytical Model For Evaluation There have been many analytical formulations to predict the performance of multi threaded programs on conventional architecture (see for example in [Agarwal 92] Culler 91] In this paper, we will use a closed form queuing network model to compare the perfor mance of Scheduled Dataflow with conventional processors, ETS like dataflow architecture and hybrid systems utilizing separate processors for thread execution and thread scheduling (e.g. EARTH ....
A. Agarwal, "Performance tradeoffs in multithreaded processors," IEEE Transactions on Parallel and Distributed Systems, vol. 3(5), pp.525-539, September 1992.
....loaded by multiple contexts, and thus latencies may increase. We presented a preliminary investigation of the benefits of multiple context processors in cache coherent multiprocessors in a previous study [29] More recently, there have also been two analytical evaluations of multiple contexts [2, 24]. In this study we present a more detailed evaluation of the performance of multiplecontext processors, and we also consider the combined effect with other latency hiding techniques. We use processors with two and four contexts. We do not consider more contexts per processor because 16 4 context ....
A. Agarwal. Performance tradeoffs in multithreaded processors. MIT VLSI Memo 89-566, Lab. for Comput. Sci., Submitted for publication, September 1989.
.... with a 66 80 hit ratio with a release consistency memory model [64] Multithreading has been introduced to tolerate the network latency by overlapping remote memory access of one thread with the computation of other threads [61] 65] Analysis of the multithreaded processor has been studied in [66][67] However, we believe that it would be preferable to reduce the long latency than to tolerate it as the variance of the communication latency becomes large [68] We plan to employ a multitasking scheme as a complementary technique to conventional multithreading technique in order to increase ....
A.Agarwal, "Performance Tradeoffs in Multithreaded Processors," IEEE Trans. Parallel and Distributed Systems, Vol.3 pp.528-539, Sep.1992. 152
....attempt to eliminate or to reduce the number of remote memory references, as in pre fetching [1] and weak memory consitensy [2] In the latter, remote memory accesses are tolerated, provided the memory latency is hidden. Multithreading approach belongs to this group. In a multithreaded processor [3 ,4], the program is presented with a collection of threads. A thread consists of a sequence of instructions and has the possibility to communicate to other threads (inter thread communication) Memory latency hiding in a multithreaded processor is realized by switching to another thread, when the ....
A. Agarwal, Performances Tradeoffs in Multithreaded Processors. IEEE trans. On Paral. and Distri. Systems, vol. 3 no 5 Sept. 1992
....Computational models and simulation are used frequently in the development and analysis of architectures both to determine design parameters and to analyze performance. These models frequently provide numbers which fairly accurately reflect actual execution data as illustrated by Agarwal in [1] where he proposed and validated an analytical performance model for multithreaded processors that included cache interference, network contention, and context switching overhead effects. In order to analyze the molecular dynamics simulation on a PIM array we must con SIAM01p 2000 12 4 page 5 ....
Agarwal A., "Performance Tradeoffs in Multithreaded Processors," Private Communication, 1989.
....simultaneously or sequentially, to evaluate a system whenever possible. The results of simulation and analytical modeling would be more convincing if the inputs and parameters are determined based on previous measurement [Jain91] and 8 analytical models could be validated by simulations (e.g. Agarwal92; Menasce92] 3 Fundamental Laws and Scalability Analysis One key problem faced by parallel programmers is how to effectively utilize the processing power provided by the underlying architecture in order to gain performance. Obviously, good performance can easily be achieved when multiple ....
....reported that the improvement in speedup keeps diminishing as more processors are used. This phenomenon, or performance loss, is caused by so called parallel overhead. Rayfield and Silverman attributed this to interprocessor communication and the sequential part of computation. Agarwal [Agarwal92] gave two reasons for the decreasing processor utilization. First, the cost of each memory access increases because network delays increase with system size. Second, as we strive for greater speedups through fine grain parallelism, the number of network transactions and synchronization delays ....
[Article contains additional citation context not shown here]
Agarwal, A., "Performance tradeoffs in multithreaded processors," IEEE Transactions on Parallel and distributed Systems, vol. 3, no. 5, pp. 525-539, September 1992.
....with similar multiplexing of the execution pipeline include MASA [29] and METRIC [68] though these machines context switch on a demand basis rather than in fixed rotation. Processors with short but nonzero latency context switches have been proposed to mask long latencies of multiprocessor arrays [1], 82] 2] 51] Sharing of the instruction issue unit is not necessarily a cost efficient approach to a single chip multiprocessor design. First, while one task is in execution, other tasks, some in a ready to execute state, can make no progress. This contention can keeps the issue unit ....
Anant Agarwal. Performance tradeoffs in multithreaded processors. Technical Report VLSI Memo, Massachusetts Institute of Technology, 1989. Submitted to IEEE TPDS.
....Cache Coherence Multithreaded architectures have been intensively studied as a way for reducing long latencies in DSM multiprocessors, and may be interesting also for bus based multiprocessor. Many solutions have been proposed and studied for the event(s) triggering the context switch operation [2, 24]: switch on every instruction fetch is limited by the great amount of contexts necessary to keep the execution pipeline full [25] switch on miss works with a smaller degree of multithreading but causes inter thread conflictmisses, which can break the locality of each thread [26] and also may ....
....of private data generated by process migration. Process migration plays an important role in a general purpose multiprocessor, since it allows the programmer to develop his applications without caring about load balance. The interaction between multithreading and coherence has been pointed out in [24] and is therefore explored here by highlighting two new aspects: i) the kind of machine, which is the shared bus shared memory multiprocessor, ii) the kind of workload, which consists of a mix of both uniprocess and multiprocess applications, including kernel aspects. The shared bus multithreaded ....
A. Agarwal, "Performance tradeoffs in multithreaded processors," IEEE Transactions on Parallel and Distributed Systems, vol. 3 (5), pp. 525--539, September 1992.
....switches that control the traffic between the nodes. Instruction reordering is one of the approaches used to alleviate the problem of divergent processor and memory performances. Multithreading is another approach which combines software (compilers) and hardware (multiple thread contexts) means [1, 5, 15]. Multithreading is an architectural approach to tolerating long latency memory accesses and synchronization delays in distributed memory systems. The 64 general idea is quite straightforward. When a long latency memory operation occurs, the processor instead of waiting for its completion ....
Agarwal, A., "Performance tradeoffs in multithreaded processors"; IEEE Trans. on Parallel and Distributed Systems, vol.3, no.5, pp.525-539, 1992.
....(i.e. the average number of thread instructions issued before a context switch takes place) the average number of available threads, the probability of long latency access to local memory, etc. A number of analytical studies of multithreaded architectures has been reported in the literature [Ag92, AB91, KD92, NG93, SB90, ZG97]. This paper derives simple models of the multithreaded architectures and uses them to perform approximate performance evaluation of the system. In particular, the utilization of the processors is studied in greater detail although the same approach can be used to study the performance of the ....
Agarwal, A., "Performance tradeoffs in multithreaded processors"; IEEE Trans. on Parallel and Distributed Systems; vol.3, no.5, pp.525-539, 1992.
.... FX 8 [TMS87] the Briarcliff machine [GEW90] and related work by Lee and Gupta [LG91] VLIW inthe Large [DV90] single chip multiprocessors [ONH 96, Fei94] STAMPede [SM97] and even the multiscalar and related architectures [FS92, SBV95, RJSS97] Multithreaded machines of various flavors [Aga92, Chi91, CSS 91, JT91, Kow85, HKN 92, Ian88b, NBW91, PT91, TEE 96, TFP92] also fall into this category. 1.2.3. Compilation The compiler exposes parallelism available in the application, analyzes the target architecture, and chooses how to exploit parallelism in order to minimize ....
Anant Agarwal. Performance Tradeoffs in Multithreaded Processors. Parallel and Distributed Systems, 3(5):525--539, September 1992.
....for the Monte Carlo simulations described in the next section. 4 Analytical Model Evaluating the New Architecture 4. 1 Overview of the experiment There have been many analytical formulations to predict the performance of multithreaded programs on conventional architectures (see for examples in [2], 4] In this section, we will show the preliminary performance analysis on our Scheduled Dataflow architecture using Monte Carlo simulations. In order to analyze the architecture in a more realistic light, we generated synthetic workloads and applied these workloads to the simulations ....
....be found in [12] 4.2 Thread parallelism In order to measure the effect of thread level parallelism on the performance of the different architectures, we generated a sequence of threads for each architecture. We took the simple performance model for multithreaded processors suggested by Agarwal[2] to introduce the latency between a pair of threads (the time difference between the termination of a thread and the initiation of a successive thread) We considered three values for latencies, 1, 3, 5 times the length of a thread (L = 1R, L = 3R, L = 5R in figure 4 above) Note that the figure 4 ....
A. Agarwal, "Performance tradeoffs in multithreaded processors, " IEEE Transactions on Parallel and Distributed Systems, vol. 3(5), pp.525--539, September 1992.
....[9] and the Tera MTA [4] can be mentioned as examples of multiprocessors whose nodes support fine multithreading. Block multithreading explicitly aims at tolerating long remote memory latency or synchronization latency in large scale multiprocessors. Studies of such multithreading technique [1, 2, 6, 8, 26, 31] allowed the conclusion that a block multithreaded processor with as small number of hardware supported contexts as 2 4 can achieve high efficiency by switching contexts on cache misses. However, as mentioned in [19] one can intuitively assume that more threads are required to cover long ....
....techniques, and their impact on the architecture performance, ii) investigation of the dependence between performance and characteristics of various multithreaded workloads. A few attempts of mathematical evaluation of block multithreaded MTAs have resulted in deterministic analytical models [1, 25, 28] and queuing models [1, 20, 21, 25, 26, 28] A technique of analytical modelling of such MTAs is mainly based on the consideration of a set of thread states and state transitions. A thread, during its life time, cyclically passes through four main states: switching, running, suspended and ....
[Article contains additional citation context not shown here]
A. Agarwal, "Performance Tradeoffs in Multithreaded Processors", IEEE Transactions on Parallel and Distributed Systems, 3(5), pp. 525-539, September 1992.
....occurs when v is insufficient to hide the entire load latency, reducing scalar utilization to U(v) vR (R L) this is called, appropriately, linear mode multithreading, since utilization is proportional to the number of contexts. More detailed analytical models of multithreading can be found in [SaCE90, Agar92]. In Concurro the minimum number of contexts required for saturation is increased by C times over that required for scalar saturation. Thus in the worst case, setting S to zero, v C(1 L R) to achieve saturation. With L R expected to be in the neighbourhood of 10, it is readily apparent that ....
A. Agarwal, "Performance tradeoffs in multithreaded processors," IEEE Trans. on Parallel and Distributed Sys., vol. 3, no. 5, pp. 525--539, Sept. 1992.
....pattern on a case by case basis, a general solution is desirable. In this paper we examine the effect of producer consumer data on network and memory contention in large scale, direct connect, distributed shared memory multiprocessors like the Stanford DASH [Lenoski et al. 1993] and MIT Alewife [Agarwal et al. 1992] machines. We use detailed execution driven simulation of parallel programs to quantify the performance impact of producer consumer data as a function of the network and memory bandwidth and the number of processors in the machine. Our experiments show that, over a wide range of network and ....
....broadcasting. We conclude, in Section 7, with a summary of our results. 2 Relationship to Previous Work A uniform distribution of accesses (and therefore uniform utilization of memory modules) is a common assumption in analytical models used to guide multiprocessor and network design. In [Agarwal, 1992], for example, this assumption is used in the calculation of the round trip latency of a non local memory reference, which is then used in the calculation of the number of processor contexts for a multi threaded multiprocessor. In this paper we show that this assumption is not valid for many ....
A. Agarwal, "Performance Tradeoffs in Multithreaded Processors," IEEE Transactions on Parallel and Distributed Systems, 3(5):525--539, Sept 1992.
....Cache Coherence Multithreaded architectures have been intensively studied as a way for reducing long latencies in DSM multiprocessors, and may be interesting also for bus based multiprocessor. Many solutions have been proposed and studied for the event(s) triggering the context switch operation [2, 24]: switch on every instruction fetch is limited by the great amount of contexts necessary to keep the execution pipeline full [25] switch on miss works with a smaller degree of multithreading but causes inter thread conflictmisses, which can break the locality of each thread [26] and also may ....
....invalidates on the first write on a shared copy. Process migration plays an important role in a general purpose multiprocessor, since it allows the programmer to develop his applications without caring about load balance. The interaction between multithreading and coherence has been pointed out in [24] and is therefore explored here highlighting two new aspects: i) the kind of machine, which is the shared bus shared memory multiprocessor, ii) the kind of workload, which conisistsof a mix of both uniprocess and multiprocess applications, including kernel aspects. The shared bus multithreaded ....
A. Agarwal, "Performance tradeoffs in multithreaded processors," IEEE Transactions on Parallel and Distributed Systems, vol. 3, pp. 525--539, September 1992.
....indeed reduce the average memory access latency significantly, they tend to handle sharing induced misses poorly. To deal with sharing induced misses, we often need techniques that overlap memory accesses with computation or with other memory accesses. These techniques include multithreading [2], relaxed memory consistency models [1, 7] data prefetching [13] and data forwarding [19] In both prefetching and forwarding, the data is moved close to the consumer processors before it 1 This work was supported in part by the National Science Foundation under grants NSF Young Investigator ....
A. Agarwal. Performance Tradeoffs in Multithreaded Processors. In IEEE Transactions on Parallel and Distributed Systems, volume 3, pages 525--539, September 1992.
....such as the number of hardware contexts, thread granularity, memory accesses, context switching overhead. 3 An Analytical Model For Evaluation There have been many analytical formulations to predict the performance of multithreaded programs on conventional architecture (see for example in [3], 5] In this paper, we will use a closed form queuing network model to compare the performance of Scheduled Dataflow with conventional processors, ETS like dataflow architecture and hybrid systems utilizing separate processors for thread execution and thread scheduling (e.g. EARTH[8] It may ....
A. Agarwal, "Performance tradeoffs in multithreaded processors," IEEE Transactions on Parallel and Distributed Systems, vol. 3(5), pp.525--539, September 1992.
....the reduced effectiveness of the cache. In light of the above observations, it would be interesting to see the effects of multithreading on the operation of a superscalar processor. The methodology for multithreading must address the above considerations, and keep the following two goals in mind [1]: 1. To have a low cost of switching contexts, since this is simply an overhead. 2. To have good single thread performance, so that applications with low parallelism, and inherently sequential code like critical sections can execute efficiently. One way of executing several threads on a processor ....
....efficiently. One way of executing several threads on a processor is to load the state of a particular thread from memory, execute that thread, and switch contexts by storing back its new state to memory before loading that of another one. This is the technique used by many multithreaded processors [1, 3, 2]. The state of a thread in a superscalar processor consists of the following: its set of registers, program counter, reorder buffer, instruction window, and store buffer. Saving and restoring all this information at every context switch constitutes an enormous overhead. Therefore, a different ....
[Article contains additional citation context not shown here]
Anant Agarwal. "Performance tradeoffs in multithreaded processors,". IEEE Transactions on Parallel and Distributed Systems, 3(5):525--539, September 1992.
....Figure 1 3 illustrates one alternative to idling a processor on communication and synchronization points. Efficient context switching allows a processor to very quickly switch to another thread and continue running. The less time spent context switching, the greater a processor s utilization [1]. Equation 11 shows the utilization of a processor as a function of average context switch time T switch , and average run length of threads, T run , assuming enough concurrent threads. EQ 1 1) 1.2.4 Thread Scheduling Scheduling threads to run in parallel computer systems is an active area ....
....data cache must be able to quickly respond to register spills and reloads from the NSF, to prevent long pipeline stalls. This section addresses each of these issues in turn. Several recent studies have investigated the effect of multithreading on cache miss rates and processor utilization. Agarwal [1] has shown that for data caches much larger than the total working set of all processes, the miss rate due to multithreading increases linearly with the number of processes being supported. Typical cache miss rates due to multithreading range from 1 to 3 , depending on the application. Weber and ....
[Article contains additional citation context not shown here]
Anant Agarwal. "Performance tradeoffs in multithreaded processors." IEEE Transactions on Parallel and Distributed Systems, 3(5):525--539, September 1992.
....environment and perhaps on a multi threaded processor. Running multiple threads in a uniprocessor requires context switching. The overhead of switching between the two threads can be significant when performed by the operating system. This reduces system throughput and increases response times [Agar 92] Boot 92] Alternatively a uniprocessor capable of supporting simultaneous multithreading can be used. However, a recent study [Chou 96] indicates that significant hardware and possibly cycle time overheads are required to implement a uniprocessor that is capable of supporting simultaneous ....
A. Agarwal. "Performance Trade-offs in Multithreaded Processors," IEEE Transactions on Parallel and Distributed Systems, 3(5):525539, September 1992
....these techniques indeed reduce average latencies significantly, they tend to handle sharing induced misses poorly. To deal with this last situation, other techniques that overlap memory accesses with computation or other memory accesses are often necessary. These techniques include multithreading [2], relaxed memory consistency models [1, 6] data prefetching [11] and data forwarding [16] In both prefetching and forwarding, the data is moved close to the consumer processor before it is actually needed. Therefore, when the processor finally accesses the data, it can do so with low latency. In ....
A. Agarwal. Performance Tradeoffs in Multithreaded Processors. In IEEE Transactions on Parallel and Distributed Systems, volume 3, pages 525--539, September 1992.
....event [Iann88] These proposals appear to offer the same level of utilization, but without overly increasing the turnaround time of a task. Even though multithreading offers to improve processor utilization, there have been few studies that actually attempt to evaluate its potential benefits [Webe89, Agar89]. In our work we have tried to reverse this situation by proposing a model of multithreading based on a small number of significant parameters and studying their effect. We want to use the model to answer some of the most interesting questions confronting multithreading, such as: what is the ....
....is a direct result of a higher number of cache misses, as more contexts start competing for cache resources. In addition, we showed that there exists a close agreement between our results and trace driven simulations when network congestion is moderate 1 . Other recent studies of multithreading [Agar89] have also derived expressions for processor utilization under the assumption that latency and run lengths are constant for a fixed number of ################ 1 Our model did include an explicit dependency between the number of contexts and network delays. 3 contexts. The assumption that ....
Agarwal, A., "Performance Tradeoffs in Multithreaded Processors", Laboratory for Computer Science, Massachusetts Institute of Technology, November 1989.
....a task for local contour from ready queue and perform approximation. Step 6: Check termination condition locally. If FALSE, go to Step 3. end Figure 7: An outline of Asynchronous Algorithm for linear approximation. The above task scheduling can be regarded as an emulation of multithreading [Agarwal, 1992] at an algorithmic level. Early work in multithreading focused on operating systems and shared memory machines to hide I O latency and cache miss latency, respectively. Recently, a message driven technique has been used in designing a parallel compiler [Holm et al. 1994] a parallel programming ....
A. Agarwal, "Performance Tradeoffs in Multithreaded Processors," IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 5, pp. 525-539, 1992.
....smaller than time slices and I O time [21] More recently, multithreading has been used in shared memory machines to avoid idling when the cache misses. Since cache misses can occur as frequently as one every few cycles [22] special hardware support is required to support fast context switching [23]. Some processors used in DMM s also provide support for multithreading including the Transputer [12] the START project [17] Monsoon [15] and the Message driven Machine [16] The Transputer and START both provide processor level hardware support for context switching and threads in general ....
Anant Agarwal. Performance tradeoffs in multithreaded processors. IEEE Transactions on Computers, 3:525--539, September 1992.
....utilization is achieved, as depicted in (4) The processor utilization increases linearly with an increase, either in R or in n t . This happens until the maximum memory capacity is used, i.e. n t = n p . 3. 1 Discussions Earlier works on performance analysis of multithreaded architectures [6, 1, 2] have overlooked the effect of memory. These results suggest that the use of large number of threads would mask the long memory latency. This increase with respect to the number of threads is possible only if all requests sent to the memory are served with a constant latency and there is no ....
....utilization. 4) Low values of effective memory latency give better processor utilization. This can be achieved by a low L and or large n p . 5. 2 Related Work There have been a number of research works reported on the performance evaluation of multithreaded architecture using analytical methods [6, 2, 1, 8]. These results indicate (i) increasing the number of threads, ii) increasing the size of thread cache, iii) increasing the network bandwidth, as possible ways of improving the utilization of a multi threaded architecture. The results of [8] suggest that two to four threads per processor is ....
A. Agarwal. Performance tradeoffs in multithreaded processors. MIT VLSI Memo 89-566, Lab. for Computer Science, MIT, September 1989.
....and the processor may have to remain idle during this transfer. This problem is partially eliminated with the presence of caches or local memories, which try to keep the working set close to the processor. Alternative or complementary approaches to caches are prefetching [5, 9] and multithreading [2]. Prefetching consists of fetching instructions or data before they are needed while overlapping the accesses with other computation. Both hardware and software approaches are possible. Multithreading consists of switching processes when an instruction or memory access by a process would stall the ....
A. Agarwal. Performance Tradeoffs in Multithreaded Processors. In IEEE Transactions on Parallel and Distributed Systems, volume 3, pages 525--539, September 1992.
....data types, it becomes important to bound the worst case delay. Moreover, controllable or predictable delays can also be used to optimize communication protocols and their implementations. While there has been a wide array of analyses for wormhole routed networks to model average case performance [93, 94, 95, 96], we know of none that have focused on worst case latency. Our efforts at analysis indicate that even for a bandwidth allocation algorithm as simple as ALU biasing, system dynamics can be remarkably subtle and complicated. The analysis shown below is more complicated than those done for other ....
A. Agarwal. "Performance tradeoffs in multithreaded processors," IEEE Transactions on Parallel and Distributed Systems, vol. 3, no. 5, pp. 525--539, May 1992.
....time is small useful work can be performed during the memory accesses. Low latency multithreaded processors provide multiple sets of registers and other processor state storage so that context switches take little time, in some schemes zero cycles. See [9] for an early description and [1,2,8,23,30] for some recent work. In a system using out of order completion and a relaxed consistency model, instructions following memory accesses that do not immediately complete (e.g. due to a cache miss or some consistency action) would not necessarily stall the CPU [11,13] In prefetching, data is ....
A. Agarwal, "Performance tradeoffs in multithreaded processors," IEEE Trans. on Parallel and Distributed Systems, vol. 3, no. 5, pp. 525-539, Sep. 1992.
....lengths and longer latencies. Our experiments show that compared to fixed size hardware contexts, register relocation can improve processor utilization by a factor of two for many workloads. 1 Introduction Multithreading is an important technique for tolerating latency in multiprocessor systems [3, 7, 19, 21]. Support for multiple contexts and rapid context switching permits high latency operations such as remote memory references and synchronization events to be overlapped with computation, which improves processor utilization. Because the number of registers required by thread contexts varies across ....
....cache fault latency (L) is constant. Thus, there is a fixed probability of a cache miss on each execution cycle, and network response time is uniform, which is reasonable for lightly loaded networks. These distributions are also consistent with the assumptions and models used in earlier studies [3, 19]. The context switch cost is set to S = 6 cycles, which is consistent with the code presented in Figure 3, and better than the 11 cycle cost incurred by the current APRIL implementation [2] To avoid effects due to the selection of a particular thread unloading policy, contexts are never unloaded. ....
[Article contains additional citation context not shown here]
A. Agarwal. "Performance Tradeoffs in Multithreaded Processors", IEEE Trans. on Parallel and Distributed Systems, September 1992.
....these techniques indeed reduce average latencies significantly, they tend to handle sharing induced misses poorly. To deal with this last situation, other techniques that overlap memory accesses with computation or other memory accesses are often necessary. These techniques include multithreading [2], relaxed memory consistency models [1, 6] data prefetching [11] and data forwarding [16] In both prefetching and forwarding, the data is moved close to the consumer processor before it is actually needed. Therefore, when the processor finally accesses the data, it can do so with low latency. In ....
A. Agarwal. Performance Tradeoffs in Multithreaded Processors. In IEEE Transactions on Parallel and Distributed Systems, volume 3, pages 525--539, September 1992.
....a task for local contour from ready queue and perform approximation. Step 6: Check termination condition locally. If FALSE, go to Step 3. end Figure 6: An outline of Asynchronous Algorithm for linear approximation. The above task scheduling can be regarded as an emulation of multithreading [1] at an algorithmic level. Early work in multithreading focused on operating systems and shared memory machines to hide I O latency and cache miss latency, respectively. Recently, a message driven technique has been used in designing a parallel compiler [8] a parallel programming language [7] a ....
A. Agarwal, "Performance Tradeoffs in Multithreaded Processors," IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 5, pp. 525-539, 1992.
....put, we ignore resurrections. Allowing blocks to be resurrected in later time intervals introduces significant complexity in the model; our experiments show that resurrections materially change the results only when context switch intervals are very small (say, less than few hundreds of cycles) [9]. 3 ffl Processes switch in a round robin fashion. Round robin switching results in worst case miss rates, but simplifies the model. 3 A Multiprogrammed Cache Model This section derives from first principles a very simple expression for the miss rate in multiprogrammed caches. The same ....
....carry over set, because the second term in Equation 2 is very small compared to the first term for reasonable cache sizes. As in When the context switching interval is smaller, as in multithreaded caches, the analysis must take into account partial replenishment of the cached set of a process [9]. 5 the multiprogrammed model derived by Thiebaut and Stone [8] when self interference is not explicitly considered, the carry over set can be used instead. We can simplify the expression for v(D) when D = 1 as v(1) SP (1) u (4) When S 1 and u 1, we can simplify the above ....
Anant Agarwal. Performance Tradeoffs in Multithreaded Processors. 1992.
....the deleterious effects of cache, network, and context switching overhead, the processor utilization is itself a good measure. We have developed a model for multithreaded processor utilization that includes the cache, network, and switching overhead effects. A detailed analysis is presented in [1]. This section will summarize the model and our chief results. Processor utilization U as a function of the number of threads resident on a processor p is derived as a function of the cache miss rate m(p) the network latency T (p) and the context switching overhead C: U(p) 8 : p ....
Anant Agarwal. Performance Tradeoffs in Multithreaded Processors. September 1989. MIT VLSI Memo 89-566, Laboratory for Computer Science.
No context found.
Anant Agarwal. Performance Tradeoffs in Multithreaded Processors. September 1989. MIT VLSI Memo 89-566, Laboratory for Computer Science.
....via asynchronous traps. 5.2 Simulation Results and Analysis We compare the behavior of a multithreaded architecture to a standard configuration, and analyze how synchronization, local memory access latency, and remote memory access latency contribute to the run time of each application. See [3] for additional analyses. A thorough evaluation of multithreading will require a large parallel machine and a scheduler optimized for multithreaded multiprocessors. On the largest machines we can reasonably simulate (around 64 processors) and with our current scheduler, the scheduling cost of ....
Anant Agarwal. Performance Tradeoffs in Multithreaded Processors. September 1989. MIT VLSI Memo 89-566. Submitted for publication.
....in this study for three reasons. First, threads in parallel processing environments share significant portions of their code and data sets. Second, if the combined working set sizes of the threads is not significantly greater than the cache size, cache performance is not adversely impacted [11], and third, because all our experiments compare the various synchronization hiding mechanisms using the same number of threads, cache effects are expected to be the same in all cases. 3.2 Operation of D Registers The processor state machine (henceforth termed the dribbler) must initiate two ....
Anant Agarwal. Performance Tradeoffs in Multithreaded Processors. In, IEEE Transactions on Parallel and Distributed Systems, October, 1992.
No context found.
Agarwal A., "Performance Tradeoffs in Multithreaded Processors," Private Communication, 1989.
No context found.
Anant Agarwal. "Performance tradeoffs in multithreaded processors". IEEE T'asactios o Parallel ad Distributed Systems, 3(5):525 539, September 1992.
No context found.
Agarwal, A., "Performance Tradeoffs in Multithreaded Processors," IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 5, Sept. 1992, pp. 525-539.
No context found.
A. Agarwal, Performance Tradeoffs in Multithreaded Processors, IEEE Trans on Parallel and Distributed Systems, vol 3, no 5 september 1992, pp 525--539.
No context found.
AGA92: A. Agarwal, "Performance Tradeoffs in Multithreaded Processors", IEEE Trans on Parallel and Distributed Systems, Sep 1992.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC