| R. Thekkath and S. J. Eggers, "Impact of sharing-based thread placement on multithreaded architecture," in Proceedings of the 21th Annual International Symposium on Computer Architecture, pp. 176-- 186, April 1994. |
....protocol, Invalidation transactions are showed instead of Write transactions, since the protocol provides this method to keep coherence. binding more parallel processes on the same processor while diminishing parallelism, has the benefit of reducing the bus load. This result seems in contrast with [37], but in that study workloads were constituted of parallel applications only, while here sequential programs are considered too. Thus, the designer may accept a tradeoff between having an increased scalability or a greater speed for the parallel application. This tradeoff may have a difficult ....
R. Thekkath and S. J. Eggers, "Impact of sharing-based thread placement on multithreaded architecture," in Proceedings of the 21th Annual International Symposium on Computer Architecture, pp. 176-- 186, April 1994.
....an additional reason is that, in some cases, processes belonging to the parallel application are executed onto the same processor. Thus, binding more parallel processes on the same processor while diminishing parallelism, has the benefit of reducing the bus load. This result seems in contrast with [37], but in that study workloads were constituted of parallel applications only, while here sequential programs are considered too. Thus, the designer may accept a tradeoff between having an increased scalability or a greater speed for the parallel application. This tradeoff may have a difficult ....
R. Thekkath and S. J. Eggers, "Impact of sharing-based thread placement on multithreaded architecture," in Proceedings of the 21th Annual International Symposium on Computer Architecture, pp. 176-- 186, April 1994.
....91, GNL95, GB96] most focusing on extending high performance RISC cores with extra instructions or synchronization primitives to exploit thread level parallelism. Designs combine several degrees of hardware and software cooperation to detect, schedule and execute threads from applications [LC95, TE94, KD92, FD95] There are several ongoing projects that have built (or are in the process of building) several prototypes. The T machine being developed at MIT [NPA92] is a multithreaded massively parallel architecture built around a commodity microprocessor with extra register files and special ....
Radhika Thekkath and Susan J. Eggers. Impact of sharing-based thread placement on multithreaded architectures. In Proceedings of the 21st Annual International Symposium on Computer Architecture, pages 176--186, Chicago, Illinois, April 18--21, 1994. IEEE Computer Society TCCA and ACM SIGARCH. Computer Architecture News, 22(2), April 1994.
....of the external Register Use Cache. The synchronization unit will prioritize threads based on whether or not the register set associated with a particular thread s frame lives in the register window. A study of co locating threads based on the amount of sharing between threads is discussed in [TE94] This should reduce compulsory and invalidation cache misses and thus improve the performance. However, their results suggest that load balancing is the most important determinant of the performance. In part, their results are due to the large threads they used (ranging in upwards of a million ....
R. Thekkath and S. J. Eggers. Impact of sharing-based thread placement on multithreaded architectures. In Proc. 21 th Ann. Int. Symp. on Computer Architecture, Chicago, Illinois, 1994.
....analysis algorithm is applicable to programs written under a different model of parallel programming. Finally, the barrier analysis algorithm is simpler and therefore more efficient. 7. 3 Applications The research described in this dissertation has already been applied in two studies [TE93, TE94] Tullsen and Eggers [TE93] studied the effectiveness of prefetching on bus based multiprocessors. They found that the limit to effective prefetching (and perfor 112 mance in general) is invalidation misses on shared data, something that traditional (uniprocessor based) prefetching algorithms ....
....in this dissertation they were able to eliminate false sharing, a significant source of invalidation misses, in their workload to the point where the performance of the (simpler) traditional prefetching algorithms equaled that of more complex, multiprocessor based algorithms. Thekkath and Eggers [TE94] studied placement of threads based on inter thread sharing on a parallel architecture in which processors have multiple hardware contexts [ALKK90] They found that for programs that had been optimized for locality, either manually or using the compiler directed approach presented in this ....
R. Thekkath and S.J. Eggers. Impact of sharing-based thread placement on multithreaded arhictectures. In 20th Annual International Symposium on Computer Architecture, pages 176--186, April 1994.
....benchmarks. The caching mechanism attempts to preserve the working set of the program in the cache. Kavi et al. KHP 95] use a cache with a frame based storage (called SuperBlocks) in a dataflow execution model. Their mechanism uses a Cold Store bit to identify compulsory misses. In [TE94] Tekkath and Eggers show that there is very minimal amount of contention misses due to inter thread communication. Therefore thread coplacement strategies designed enhance inter thread locality and reduce such misses have minimal effects. From a cache design perspective these results are quite ....
R. Thekkath and S. J. Eggers. Impact of sharing-based thread placement on multithreaded architectures. In Proc. 21 th Int. Symp. on Computer Architecture, Chicago, Illinois, 1994.
....of more work, tol network increases. However, the disadvantage is a significant increase in response times at the switches (collectively S obs ) and at the local memory module (L obs ) 4 4 Agarwal [3] reports a deteriorating effect of partitioning of a cache at a large n t . Thekkath et al. [28] and Eickemeyer et al. 11] report little variations in cache miss rates ( 1 R ) due to multithreading. In this paper, we do not explore this application dependent phenomenon. In summary, we note the following points for the network latency tolerance: ffl Workload characteristics, and not the ....
R. Thekkath and S. Eggers. Impact of sharing-based thread placement on multithreaded architectures. In Proceedings of the 21st International Symposium on Computer Architecture. ACM, April 1994.
....application. The DASH project manages cache coherency at cache line granularity, while the granularity of the Markatos study is an abstract coherency block. Thekkath and Eggers examine cache affinity scheduling (which they call sharing based placement) on multithreaded, shared memory architectures [25], and show that load balancing rather than sharing considerations determine whether a particular placement policy performs well. Their work is of particular interest since they also use the SPLASH benchmarks as part of their workload, and because in some respects their multithreaded CC UMA ....
....limits the amount of locality management possible. Our results are in agreement with the work of Thekkath and Eggers, who examined the impact of sharingbased thread placement on multithreaded architectures using our four benchmark applications (among others) and concluded that it had no impact [25]. Work that has reported significant performance gains with locality management, such as [10] and [21] has generally assumed a CC NUMA model, in which shared data remains on a fixed home node and is neither replicated nor migrated. Thekkath and Eggers assume a multithreaded CC UMA model. Since ....
Radhika Thekkath and Susan J. Eggers. Impact of sharing-based thread placement on multithreaded architectures. In Proceedings of the 21st Annual International Symposium on Computer Architecture, pages 176--186, April 1994.
.... effectiveness with other latency hiding schemes [6] More recently, we have shown how multithreading can be used effectively on the DDM [2] Note that within the execution of a single task one can distribute threads over the processors in many different ways, as evaluated by Thekkath and Eggers [8]. In our experiments using a t threaded processor, we allocate the t threads that are created first to the first processor, the next t threads to the next processor and so on. 3.2 Striped execution Under striped execution, all tasks are executed in parallel on as many nodes as possible: if ....
Radhika Thekkath and Susan J. Eggers. Impact of SharingBased Thread Placement on Multithreaded Architectures. In Proceedingsof the 21st Annual International Symposiumon Computer Architecture, pages 176--186, Chicago, Illinois, April 1994. IEEE Computer Society Press.
....multithreading. This paper does not analyze the full costs and benefits of multithreaded processors, but only introduces a more efficient register file for such a processor. Other research has studied how instruction issue logic and caches are shared among the threads of a multithreaded processor [17,29]. 3. The Named State Register File The Named State Register File (NSF) is an alternative register file organization. It is not divided into large frames for each thread. Instead, the NSF is a fully associative structure with very small lines. A thread s registers may be distributed anywhere in ....
R. Thekkath and S.J. Eggers. Impact of sharing-based thread placement on multithreaded architectures. In Proceedings of the 21st Annual International Symposium on Computer Architecture, pages 176--186. IEEE, April 1994.
....[35] it is shown that by using certain simple hardware support, the cache performance can be increased by an order of magnitude. Some blocks of the cache are reserved for satisfying the compulsory misses. This scheme, called reserve block scheme, reduces the miss penalty on compulsory misses. In [41] Tekkath and Eggers show that a minimal amount of contention misses occur due to inter thread communication. Therefore thread co placement strategies designed to enhance inter thread locality and reduce cache misses have minimal effects. Among the proposed multithreaded architectures that are ....
R. Thekkath and S. J. Eggers. Impact of sharing-based thread placement on multithreaded architectures. In Proc. 21thInt. Symp. on Computer Architecture, pages 176--186, Chicago, Illinois, 1994.
....S obs ) This value represents the wait time for a thread at all queueing nodes. Each thread spends a duration R at the processor, so for n t threads, we obtain U p as n t R wait time at all queueing nodes . Similar approach of an assumed, fixed network load is used by Boothe [10] and Thekkath [29], to study various aspects in multithreading. This naive model works well when n t = 1. Let us assume that S obs is 27.33, i.e. its un loaded value. Substituting values of R, L and p remote in (R L 2 p remote S obs ) we obtain U p as 10 10 10 2 Thetapremote Theta27:33 =21.1 . Our closed ....
....approximation to apply AMVA. Our results show the effectiveness of multithreading to tolerate long latencies. In particular, we have identified the role of network capacity on the network latency and processor utilization. Simulation studies also report the performance benefits of multithreading [31, 10, 29]. Weber [31] shows the differences in performance gains due to multithreading, because of variations in the bus traffic. While Thekkath s results [29] indicate the need for tuning multithreaded workload, Boothe [10] suggests compiling techniques for multithreading. While confirming these results, ....
[Article contains additional citation context not shown here]
R. Thekkath and S. Eggers. Impact of sharing-based thread placement on multithreaded architectures. In the 21st ISCA , ACM, April 1994.
....of the data they access. In this paper we examine the effect of task placement on memory system overhead on DSM platforms. Previous work has shown that even under optimal conditions applications implemented with heavy weight threads benefit very little from sharing based placement policies [19, 7]. However, previous work has also shown that applications based on a task queue model can show appreciable performance gains from policies that place tasks near the data they reference or near other tasks that reference the same data [6, 15] We have therefore restricted our work to applications ....
....parameters. Results are presented in Section 5, and Section 6 concludes the paper. 2 Related Work Several recent papers have explicitly or implicitly addressed the impact of task scheduling on multiprocessor memory system overhead, application performance, and data sharing. Thekkath and Eggers [19] use trace driven simulation to examine the effect of data sharing based placement on the execution time of applications from the SPLASH [17] and the PRESTO [3] suites running on a multithreaded, multiprocessor architecture. Threads scheduled on the same processor share a cache, and the authors ....
[Article contains additional citation context not shown here]
Radhika Thekkath and Susan J. Eggers. Impact of sharing-based thread placement on multithreaded architectures. In Proceedings of the 21st Annual International Symposium on Computer Architecture, pages 176--186, April 1994.
....idle when a miss occurs in the instruction cache. The situation becomes more serious as processors are built to exploit instruction level parallelism: 1) multi threaded processors place a greater demand on the instruction cache as code from multiple working sets need to coexist in the cache [Thekkath Eggers 94] and 2) aggressive superscalar processors will increase the performance penalty associated with an instruction cache miss as more functional units idle while the instruction is fetched from memory. The only way to deal with instruction cache misses is to either hide the latency of the miss ....
Thekkath, R. and Eggers, S. J. Impact of Sharing-Based Thread Placement on Multithreaded Architectures. In Proc. 21st Annual International Symposium On Computer Architecture, pages 176--186, April 1994.
....single node Nodes T.1 T.2 T.4 T.6 0 10 20 30 40 50 60 70 Speedup relative to single node, single thread Figure 7: Speedup graph for Barnes 4000 tios of the previous experiment. These results are partly in accordance with an extensive study on multithreading and locality by Thekkath and Eggers [14]. They have shown that load imbalance is the major factor affecting execution time under multithreading. However, in contrast with that study, we observe a clear deviation in miss ratios. We expect that this is due to machine and associative memory size, as their miss ratios are reported for small ....
R. Thekkath and S. J. Eggers. Impact of Sharing-Based Thread Placement on Multithreaded Architectures. In Proc. of the 21st ISCA, pp 176--186, Chicago, Illinois, Apr. 1994. IEEE Computer Society Press. I--185
.... flavors of multithreaded architectures have been proposed or built [HF88, Ian88, KS88, ALKK90, ACC 90, Chi91, KD92, HKN 92, LGH94, TEL95] There have also been a lot of studies on the performance of multithreaded architectures [WG89, SBCvE90, Aga92, PW91, DT91, YST 94, BR92, NGGA93, TE94b, TE94a] A recent example of a fine grained multithreaded multiprocessor is the Tera computer [ACC 90] Like the HEP, the Tera s processor has 128 hardware contexts and switches every cycle with a zero cycle penalty, ensuring that all instructions in the pipeline 4 are from different ....
R. Thekkath and S. J. Eggers. Impact of sharing-based thread placement on multithreaded architectures. 21th Annual International Symposium on Computer Architecture, pages 176--186, April 1994.
....application and number of processors. We use a load balanced thread placement. Since thread lengths are known from the trace file lengths, an oracle load balancer distributes threads across processors, attempting to equalize the number of instructions executed per processor. In a previous study [23], we showed that load balancing was the superior placement algorithm on a wide range of processor hardware context configurationsfor a similar suite of applications. Therefore we use it here. Load balanced placements are practical to implement, and are frequently used in real systems by both ....
R. Thekkathand S. J. Eggers. Impact of sharing-based thread placement on multithreaded architectures. 21th Annual International Symposium on Computer Architecture, pages 176-- 186, April 1994.
....only when appropriate. 2 The Applications Table 1 lists the applications described in this report. The table also shows the number of lines of source code, and a broad classification of each program in the application domain. Several research studies have used subsets of these applications [3, 11, 12], mainly in simulation work. These publications provide a detailed analysis of the characteristics of some of the applications. Program Lines of Application Name C Source Domain Grav 1013 Scientific Patch 2746 Graphics Pdsa 3952 CAD Vandermonde 319 Mathematics Health 511 Simulation FullConn ....
R. Thekkath and S. J. Eggers. Impact of sharing-based thread placement on multithreaded architectures. 21th Annual International Symposium on Computer Architecture, pages 176-- 186, April 1994.
No context found.
Thekkath, R., and Eggers, S. J., "Impact of Sharing-Based Thread Placement on Multithreaded Architectures," Proceedings of the 21st International Symposium on Computer Architecture, 1994.
No context found.
Thekkath, R., Eggers, S.J. "Impact of Sharing-Based Thread Placement on Multithreaded Architectures," 21st Annual International Symposium on Computer Architecture, pp. 176-186, 1994.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC