43 citations found. Retrieving documents...
Agarwal A., "Performance Tradeoffs in Multithreaded Processors," Private Communication, 1989.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents

Design and Evaluation of the Hamal Parallel Computer - Grossman (2002)   (1 citation)  (Correct)

....a register is marked as busy by setting the busy bits of all of its sub registers. r3 r2y r2x r1y r1b r1a r0d r0c r0b r0a r7 r6 r5 r4 r11 r10 r9 r8 o Figure 4 1: Multigranular general purpose registers. 4.2 Multithreading and Event Handling Multithreading is a very well known technique. In [Agarwal92] and [Thekkath94] it is shown that hardware multithreading can significantly improve processor utilization. A large number of designs have been proposed and or implemented which incorporate hardware multithreading; examples include HEP [Smith81] Horizon [Thistle88] MASA [Halstead88] Tera ....

Anant Agarwal, "Performance Tradeoffs in Multithreaded Processors", IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 5, September 1992, pp. 525-539.


Scheduled Dataflow Architecture: A Synchronous Paradigm for.. - Kavi, Kim, Hurson   (Correct)

....of issues such as the number of threads, thread granularity, memory latencies on the proposed architecture. 3 An Analytical Model For Evaluation There have been many analytical formulations to predict the performance of multi threaded programs on conventional architecture (see for example in [Agarwal 92] Culler 91] In this paper, we will use a closed form queuing network model to compare the perfor mance of Scheduled Dataflow with conventional processors, ETS like dataflow architecture and hybrid systems utilizing separate processors for thread execution and thread scheduling (e.g. EARTH ....

A. Agarwal, "Performance tradeoffs in multithreaded processors," IEEE Transactions on Parallel and Distributed Systems, vol. 3(5), pp.525-539, September 1992.


Comparative Evaluation of Latency Reducing and.. - Gupta, Hennessy.. (1991)   (103 citations)  (Correct)

....loaded by multiple contexts, and thus latencies may increase. We presented a preliminary investigation of the benefits of multiple context processors in cache coherent multiprocessors in a previous study [29] More recently, there have also been two analytical evaluations of multiple contexts [2, 24]. In this study we present a more detailed evaluation of the performance of multiplecontext processors, and we also consider the combined effect with other latency hiding techniques. We use processors with two and four contexts. We do not consider more contexts per processor because 16 4 context ....

A. Agarwal. Performance tradeoffs in multithreaded processors. MIT VLSI Memo 89-566, Lab. for Comput. Sci., Submitted for publication, September 1989.


Processor Management Policies for Multiprocessors - Yu (1994)   (Correct)

.... with a 66 80 hit ratio with a release consistency memory model [64] Multithreading has been introduced to tolerate the network latency by overlapping remote memory access of one thread with the computation of other threads [61] 65] Analysis of the multithreaded processor has been studied in [66][67] However, we believe that it would be preferable to reduce the long latency than to tolerate it as the variance of the communication latency becomes large [68] We plan to employ a multitasking scheme as a complementary technique to conventional multithreading technique in order to increase ....

A.Agarwal, "Performance Tradeoffs in Multithreaded Processors," IEEE Trans. Parallel and Distributed Systems, Vol.3 pp.528-539, Sep.1992. 152


A Simulator for a Multithreaded Processor - Adda, Niar, Bleuel, Lopez   (Correct)

....attempt to eliminate or to reduce the number of remote memory references, as in pre fetching [1] and weak memory consitensy [2] In the latter, remote memory accesses are tolerated, provided the memory latency is hidden. Multithreading approach belongs to this group. In a multithreaded processor [3 ,4], the program is presented with a collection of threads. A thread consists of a sequence of instructions and has the possibility to communicate to other threads (inter thread communication) Memory latency hiding in a multithreaded processor is realized by switching to another thread, when the ....

A. Agarwal, Performances Tradeoffs in Multithreaded Processors. IEEE trans. On Paral. and Distri. Systems, vol. 3 no 5 Sept. 1992


Petaflop Computing for Protein Folding - Shannon Kuntz Richard   (Correct)

....Computational models and simulation are used frequently in the development and analysis of architectures both to determine design parameters and to analyze performance. These models frequently provide numbers which fairly accurately reflect actual execution data as illustrated by Agarwal in [1] where he proposed and validated an analytical performance model for multithreaded processors that included cache interference, network contention, and context switching overhead effects. In order to analyze the molecular dynamics simulation on a PIM array we must con SIAM01p 2000 12 4 page 5 ....

Agarwal A., "Performance Tradeoffs in Multithreaded Processors," Private Communication, 1989.


Performance Evaluation for Parallel Systems: A Survey - Hu, Gorton (1997)   (2 citations)  (Correct)

....simultaneously or sequentially, to evaluate a system whenever possible. The results of simulation and analytical modeling would be more convincing if the inputs and parameters are determined based on previous measurement [Jain91] and 8 analytical models could be validated by simulations (e.g. Agarwal92; Menasce92] 3 Fundamental Laws and Scalability Analysis One key problem faced by parallel programmers is how to effectively utilize the processing power provided by the underlying architecture in order to gain performance. Obviously, good performance can easily be achieved when multiple ....

....reported that the improvement in speedup keeps diminishing as more processors are used. This phenomenon, or performance loss, is caused by so called parallel overhead. Rayfield and Silverman attributed this to interprocessor communication and the sequential part of computation. Agarwal [Agarwal92] gave two reasons for the decreasing processor utilization. First, the cost of each memory access increases because network delays increase with system size. Second, as we strive for greater speedups through fine grain parallelism, the number of network transactions and synchronization delays ....

[Article contains additional citation context not shown here]

Agarwal, A., "Performance tradeoffs in multithreaded processors," IEEE Transactions on Parallel and distributed Systems, vol. 3, no. 5, pp. 525-539, September 1992.


Instruction-Processing Optimization Techniques For VLSI.. - Bunda (1993)   (1 citation)  (Correct)

....with similar multiplexing of the execution pipeline include MASA [29] and METRIC [68] though these machines context switch on a demand basis rather than in fixed rotation. Processors with short but nonzero latency context switches have been proposed to mask long latencies of multiprocessor arrays [1], 82] 2] 51] Sharing of the instruction issue unit is not necessarily a cost efficient approach to a single chip multiprocessor design. First, while one task is in execution, other tasks, some in a ready to execute state, can make no progress. This contention can keeps the issue unit ....

Anant Agarwal. Performance tradeoffs in multithreaded processors. Technical Report VLSI Memo, Massachusetts Institute of Technology, 1989. Submitted to IEEE TPDS.


Bus Utilization Analysis of Multithreaded Shared-Bus.. - Giorgi, Foglia, Prete   (Correct)

....Cache Coherence Multithreaded architectures have been intensively studied as a way for reducing long latencies in DSM multiprocessors, and may be interesting also for bus based multiprocessor. Many solutions have been proposed and studied for the event(s) triggering the context switch operation [2, 24]: switch on every instruction fetch is limited by the great amount of contexts necessary to keep the execution pipeline full [25] switch on miss works with a smaller degree of multithreading but causes inter thread conflictmisses, which can break the locality of each thread [26] and also may ....

....of private data generated by process migration. Process migration plays an important role in a general purpose multiprocessor, since it allows the programmer to develop his applications without caring about load balance. The interaction between multithreading and coherence has been pointed out in [24] and is therefore explored here by highlighting two new aspects: i) the kind of machine, which is the shared bus shared memory multiprocessor, ii) the kind of workload, which consists of a mix of both uniprocess and multiprocess applications, including kernel aspects. The shared bus multithreaded ....

A. Agarwal, "Performance tradeoffs in multithreaded processors," IEEE Transactions on Parallel and Distributed Systems, vol. 3 (5), pp. 525--539, September 1992.


Performance Modeling of Multithreaded Distributed Memory.. - Zuberek Department Of (1999)   (Correct)

....switches that control the traffic between the nodes. Instruction reordering is one of the approaches used to alleviate the problem of divergent processor and memory performances. Multithreading is another approach which combines software (compilers) and hardware (multiple thread contexts) means [1, 5, 15]. Multithreading is an architectural approach to tolerating long latency memory accesses and synchronization delays in distributed memory systems. The 64 general idea is quite straightforward. When a long latency memory operation occurs, the processor instead of waiting for its completion ....

Agarwal, A., "Performance tradeoffs in multithreaded processors"; IEEE Trans. on Parallel and Distributed Systems, vol.3, no.5, pp.525-539, 1992.


Approximate Performance Evaluation of Multi- Threaded Distributed .. - Zuberek (1999)   (Correct)

....(i.e. the average number of thread instructions issued before a context switch takes place) the average number of available threads, the probability of long latency access to local memory, etc. A number of analytical studies of multithreaded architectures has been reported in the literature [Ag92, AB91, KD92, NG93, SB90, ZG97]. This paper derives simple models of the multithreaded architectures and uses them to perform approximate performance evaluation of the system. In particular, the utilization of the processors is studied in greater detail although the same approach can be used to study the performance of the ....

Agarwal, A., "Performance tradeoffs in multithreaded processors"; IEEE Trans. on Parallel and Distributed Systems; vol.3, no.5, pp.525-539, 1992.


Exploiting Multi-Grained Parallelism For.. - Newburn (1997)   (2 citations)  (Correct)

.... FX 8 [TMS87] the Briarcliff machine [GEW90] and related work by Lee and Gupta [LG91] VLIW inthe Large [DV90] single chip multiprocessors [ONH 96, Fei94] STAMPede [SM97] and even the multiscalar and related architectures [FS92, SBV95, RJSS97] Multithreaded machines of various flavors [Aga92, Chi91, CSS 91, JT91, Kow85, HKN 92, Ian88b, NBW91, PT91, TEE 96, TFP92] also fall into this category. 1.2.3. Compilation The compiler exposes parallelism available in the application, analyzes the target architecture, and chooses how to exploit parallelism in order to minimize ....

Anant Agarwal. Performance Tradeoffs in Multithreaded Processors. Parallel and Distributed Systems, 3(5):525--539, September 1992.


A Decoupled Scheduled Dataflow Multithreaded Architecture - Kavi, Kim, Arul, Hurson (2000)   (Correct)

....for the Monte Carlo simulations described in the next section. 4 Analytical Model Evaluating the New Architecture 4. 1 Overview of the experiment There have been many analytical formulations to predict the performance of multithreaded programs on conventional architectures (see for examples in [2], 4] In this section, we will show the preliminary performance analysis on our Scheduled Dataflow architecture using Monte Carlo simulations. In order to analyze the architecture in a more realistic light, we generated synthetic workloads and applied these workloads to the simulations ....

....be found in [12] 4.2 Thread parallelism In order to measure the effect of thread level parallelism on the performance of the different architectures, we generated a sequence of threads for each architecture. We took the simple performance model for multithreaded processors suggested by Agarwal[2] to introduce the latency between a pair of threads (the time difference between the termination of a thread and the initiation of a successive thread) We considered three values for latencies, 1, 3, 5 times the length of a thread (L = 1R, L = 3R, L = 5R in figure 4 above) Note that the figure 4 ....

A. Agarwal, "Performance tradeoffs in multithreaded processors, " IEEE Transactions on Parallel and Distributed Systems, vol. 3(5), pp.525--539, September 1992.


A Queuing Model of Multithreading: A Case Study - Vlassov Thorelli   (Correct)

....[9] and the Tera MTA [4] can be mentioned as examples of multiprocessors whose nodes support fine multithreading. Block multithreading explicitly aims at tolerating long remote memory latency or synchronization latency in large scale multiprocessors. Studies of such multithreading technique [1, 2, 6, 8, 26, 31] allowed the conclusion that a block multithreaded processor with as small number of hardware supported contexts as 2 4 can achieve high efficiency by switching contexts on cache misses. However, as mentioned in [19] one can intuitively assume that more threads are required to cover long ....

....techniques, and their impact on the architecture performance, ii) investigation of the dependence between performance and characteristics of various multithreaded workloads. A few attempts of mathematical evaluation of block multithreaded MTAs have resulted in deterministic analytical models [1, 25, 28] and queuing models [1, 20, 21, 25, 26, 28] A technique of analytical modelling of such MTAs is mainly based on the consideration of a set of thread states and state transitions. A thread, during its life time, cyclically passes through four main states: switching, running, suspended and ....

[Article contains additional citation context not shown here]

A. Agarwal, "Performance Tradeoffs in Multithreaded Processors", IEEE Transactions on Parallel and Distributed Systems, 3(5), pp. 525-539, September 1992.


Superscalar Performance in a Multithreaded Microprocessor - Gunther (1993)   (3 citations)  (Correct)

....occurs when v is insufficient to hide the entire load latency, reducing scalar utilization to U(v) vR (R L) this is called, appropriately, linear mode multithreading, since utilization is proportional to the number of contexts. More detailed analytical models of multithreading can be found in [SaCE90, Agar92]. In Concurro the minimum number of contexts required for saturation is increased by C times over that required for scalar saturation. Thus in the worst case, setting S to zero, v C(1 L R) to achieve saturation. With L R expected to be in the neighbourhood of 10, it is readily apparent that ....

A. Agarwal, "Performance tradeoffs in multithreaded processors," IEEE Trans. on Parallel and Distributed Sys., vol. 3, no. 5, pp. 525--539, Sept. 1992.


Eager Combining: A Coherency Protocol for Increasing.. - Ricardo Bianchini (1994)   (6 citations)  (Correct)

....pattern on a case by case basis, a general solution is desirable. In this paper we examine the effect of producer consumer data on network and memory contention in large scale, direct connect, distributed shared memory multiprocessors like the Stanford DASH [Lenoski et al. 1993] and MIT Alewife [Agarwal et al. 1992] machines. We use detailed execution driven simulation of parallel programs to quantify the performance impact of producer consumer data as a function of the network and memory bandwidth and the number of processors in the machine. Our experiments show that, over a wide range of network and ....

....broadcasting. We conclude, in Section 7, with a summary of our results. 2 Relationship to Previous Work A uniform distribution of accesses (and therefore uniform utilization of memory modules) is a common assumption in analytical models used to guide multiprocessor and network design. In [Agarwal, 1992], for example, this assumption is used in the calculation of the round trip latency of a non local memory reference, which is then used in the calculation of the number of processor contexts for a multi threaded multiprocessor. In this paper we show that this assumption is not valid for many ....

A. Agarwal, "Performance Tradeoffs in Multithreaded Processors," IEEE Transactions on Parallel and Distributed Systems, 3(5):525--539, Sept 1992.


Bus Utilization Analysis of Multithreaded Shared-Bus.. - Giorgi, Foglia, Prete (1997)   (Correct)

....Cache Coherence Multithreaded architectures have been intensively studied as a way for reducing long latencies in DSM multiprocessors, and may be interesting also for bus based multiprocessor. Many solutions have been proposed and studied for the event(s) triggering the context switch operation [2, 24]: switch on every instruction fetch is limited by the great amount of contexts necessary to keep the execution pipeline full [25] switch on miss works with a smaller degree of multithreading but causes inter thread conflictmisses, which can break the locality of each thread [26] and also may ....

....invalidates on the first write on a shared copy. Process migration plays an important role in a general purpose multiprocessor, since it allows the programmer to develop his applications without caring about load balance. The interaction between multithreading and coherence has been pointed out in [24] and is therefore explored here highlighting two new aspects: i) the kind of machine, which is the shared bus shared memory multiprocessor, ii) the kind of workload, which conisistsof a mix of both uniprocess and multiprocess applications, including kernel aspects. The shared bus multithreaded ....

A. Agarwal, "Performance tradeoffs in multithreaded processors," IEEE Transactions on Parallel and Distributed Systems, vol. 3, pp. 525--539, September 1992.


Data Forwarding in Scalable Shared-Memory Multiprocessors - Koufaty, Chen, Poulsen.. (1995)   (14 citations)  (Correct)

....indeed reduce the average memory access latency significantly, they tend to handle sharing induced misses poorly. To deal with sharing induced misses, we often need techniques that overlap memory accesses with computation or with other memory accesses. These techniques include multithreading [2], relaxed memory consistency models [1, 7] data prefetching [13] and data forwarding [19] In both prefetching and forwarding, the data is moved close to the consumer processors before it 1 This work was supported in part by the National Science Foundation under grants NSF Young Investigator ....

A. Agarwal. Performance Tradeoffs in Multithreaded Processors. In IEEE Transactions on Parallel and Distributed Systems, volume 3, pages 525--539, September 1992.


Scheduled Dataflow Architecture: A Synchronous Execution.. - Kavi, Kim, Hurson (1999)   (Correct)

....such as the number of hardware contexts, thread granularity, memory accesses, context switching overhead. 3 An Analytical Model For Evaluation There have been many analytical formulations to predict the performance of multithreaded programs on conventional architecture (see for example in [3], 5] In this paper, we will use a closed form queuing network model to compare the performance of Scheduled Dataflow with conventional processors, ETS like dataflow architecture and hybrid systems utilizing separate processors for thread execution and thread scheduling (e.g. EARTH[8] It may ....

A. Agarwal, "Performance tradeoffs in multithreaded processors," IEEE Transactions on Parallel and Distributed Systems, vol. 3(5), pp.525--539, September 1992.


Performance Study of a Multithreaded Superscalar Microprocessor - Manu Gulati (1996)   (20 citations)  (Correct)

....the reduced effectiveness of the cache. In light of the above observations, it would be interesting to see the effects of multithreading on the operation of a superscalar processor. The methodology for multithreading must address the above considerations, and keep the following two goals in mind [1]: 1. To have a low cost of switching contexts, since this is simply an overhead. 2. To have good single thread performance, so that applications with low parallelism, and inherently sequential code like critical sections can execute efficiently. One way of executing several threads on a processor ....

....efficiently. One way of executing several threads on a processor is to load the state of a particular thread from memory, execute that thread, and switch contexts by storing back its new state to memory before loading that of another one. This is the technique used by many multithreaded processors [1, 3, 2]. The state of a thread in a superscalar processor consists of the following: its set of registers, program counter, reorder buffer, instruction window, and store buffer. Saving and restoring all this information at every context switch constitutes an enormous overhead. Therefore, a different ....

[Article contains additional citation context not shown here]

Anant Agarwal. "Performance tradeoffs in multithreaded processors,". IEEE Transactions on Parallel and Distributed Systems, 3(5):525--539, September 1992.


The Named-State Register File - Nuth (1993)   (2 citations)  (Correct)

....Figure 1 3 illustrates one alternative to idling a processor on communication and synchronization points. Efficient context switching allows a processor to very quickly switch to another thread and continue running. The less time spent context switching, the greater a processor s utilization [1]. Equation 11 shows the utilization of a processor as a function of average context switch time T switch , and average run length of threads, T run , assuming enough concurrent threads. EQ 1 1) 1.2.4 Thread Scheduling Scheduling threads to run in parallel computer systems is an active area ....

....data cache must be able to quickly respond to register spills and reloads from the NSF, to prevent long pipeline stalls. This section addresses each of these issues in turn. Several recent studies have investigated the effect of multithreading on cache miss rates and processor utilization. Agarwal [1] has shown that for data caches much larger than the total working set of all processes, the miss rate due to multithreading increases linearly with the number of processes being supported. Typical cache miss rates due to multithreading range from 1 to 3 , depending on the application. Weber and ....

[Article contains additional citation context not shown here]

Anant Agarwal. "Performance tradeoffs in multithreaded processors." IEEE Transactions on Parallel and Distributed Systems, 3(5):525--539, September 1992.


Modeling Multiprogrammed Caches - Agarwal   Self-citation (Agarwal)   (Correct)

....put, we ignore resurrections. Allowing blocks to be resurrected in later time intervals introduces significant complexity in the model; our experiments show that resurrections materially change the results only when context switch intervals are very small (say, less than few hundreds of cycles) [9]. 3 ffl Processes switch in a round robin fashion. Round robin switching results in worst case miss rates, but simplifies the model. 3 A Multiprogrammed Cache Model This section derives from first principles a very simple expression for the miss rate in multiprogrammed caches. The same ....

....carry over set, because the second term in Equation 2 is very small compared to the first term for reasonable cache sizes. As in When the context switching interval is smaller, as in multithreaded caches, the analysis must take into account partial replenishment of the cached set of a process [9]. 5 the multiprogrammed model derived by Thiebaut and Stone [8] when self interference is not explicitly considered, the carry over set can be used instead. We can simplify the expression for v(D) when D = 1 as v(1) SP (1) u (4) When S 1 and u 1, we can simplify the above ....

Anant Agarwal. Performance Tradeoffs in Multithreaded Processors. 1992.


Unknown - Petaflop Computing For   (Correct)

No context found.

Agarwal A., "Performance Tradeoffs in Multithreaded Processors," Private Communication, 1989.


C.A. Waldspurger and W.E. Weihl. "Register relocation.. - Steven Wallace And   (Correct)

No context found.

Anant Agarwal. "Performance tradeoffs in multithreaded processors". IEEE T'asactios o Parallel ad Distributed Systems, 3(5):525 539, September 1992.


Simulation Study of Multithreaded Virtual Processor - Lee, Kwak, Carlson, Al. (1998)   (Correct)

No context found.

Agarwal, A., "Performance Tradeoffs in Multithreaded Processors," IEEE Transactions on Parallel and Distributed Systems, Vol. 3, No. 5, Sept. 1992, pp. 525-539.

First 50 documents

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC