35 citations found. Retrieving documents...
D.-K. Chen, H.-M. Su, and P.-C Yew, "The impact of synchronization and granularity on parallel systems," Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 239-249, 1990.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents

Exploiting Fine-Grain Thread Level Parallelism on.. - Keckler, Dally.. (1998)   (14 citations)  (Correct)

....on problem size. Based on examination of the code, we expect that fine thread parallelism will continue to scale with more processors and that more aggressive parallelization can yield both greater concurrency and smaller grain sizes. 5 Related Work The study of synchronization cost performed in [2] explored a spectrum of granularifies including instruction, statement, and loop level parallelism. They found that statement oriented parallelism was far more sensitive to synchronization overhead than loop level parallelism. However, even with substantial synchronization overhead the statement ....

CHEN, D.-K., Su, H.-M., aND YEW, P.-C. The impact of synchronization and granularity on parallel systems. In Proceedings of the 17th International Symposium on Computer Architecture (May 1990), pp. 239-248.


A Distributed Memory LAPSE: Parallel Simulation of.. - Dickens.. (1993)   (18 citations)  (Correct)

....Table 1 uses these attributes 1 to categorize relevant existing work, and our own tool, LAPSE, the Large Application Parallel Simulation Environment, that we are currently implementing on the Intel Paragon multicomputer. Tool communication simulator LAPSE message passing network parallel MaxPar[4] shared memory (no cacheing) serial Proteus[2] cache coherent shared memory serial RPPT[5] message passing network serial Simon[8] message passing network serial Tango[7] cache coherent shared memory serial WWT[17] cache coherent shared memory parallel Table 1: Direct Execution Simulation ....

D.-K. Chen, H.-M. Su, and P.-C. Yew. The impact of synchronization and granularity on parallel systems. In Int'l. Symp. on Computer Architecture, pages 239--248, May 1990.


Performance Prediction for MPI Programs Executing on.. - Phillip Dickens..   (Correct)

....can be estimated. We defer for now the discussion on including the costs of multiprogramming into the simulation model. We will do so in the final paper. 3 Related Work Several other projects use direct execution of application processes to drive simulations of multiprocessor systems [1, 2, 3, 9]. The Wisconsin Wind Tunnel (WWT) 9] is, to our knowledge, the only multiprocessor simulator that uses a multiprocessor (the CM 5) to execute the simulation. It is worthwhile to note two important differences between LAPSE and WWT. First is the issue of purpose. The WWT is a tool for ....

Chen,D., Su,H., and P. Yew. The impact of synchronization and granularity on parallel systems. In Int'l. Symp. on Computer Architecture, pages 239--248, May 1990.


Architectural And Software Support For Executing Numerical.. - Anik (1993)   (6 citations)  (Correct)

....passing paradigm, and concluded that synchronization does not introduce significant overhead [2] The characteristics of parallel FORTRAN programs are very different from those programs using message passing paradigm. In these programs the basic form of parallelism is the parallel loop structure [3]. The parallel FORTRAN dialect used in this work is Cedar FORTRAN (CF) which has two basic parallel loop structures, DOALL and DOACROSS [4] There is no dependence between iterations of a DOALL loop; therefore, the iterations can be executed in arbitrary order. A DOACROSS loop can have a ....

....by the KAP Cedar source to source parallelizer [20] 4] which generates a parallel FORTRAN dialect, Cedar FORTRAN. This process exploits parallelism at the loop level, which has been shown by Chen, Su, and Yew to capture most of the available parallelism for Perfect Club benchmark set programs [3]. They measured the instruction level parallelism by trace based data flow analysis and concluded that parallel loop structures sufficiently exploit this parallelism. However this assumes that all memory and control dependences can be resolved in the parallelization process. In practice, ....

[Article contains additional citation context not shown here]

D. Chen, H. Su, and P. Yew, "The impact of synchronization and granularity on parallel systems," Tech. Rep. CSRD Rpt. no. 942, Center for Supercomputing Research and Development, University of Illinois, 1989.


Compiler Optimizations for Eliminating Barrier Synchronization - Tseng (1995)   (43 citations)  (Correct)

....[8, 15] but in most cases compilers will identify many parallel inner loops rather than a few large parallel outer loops. When the amount of computation in a parallel loop (also known as granularity) is small, parallel speedup canbe significantly limited due to barrier synchronization overhead [10]. Barriers are expensive for two reasons. First, executing a barrier has some run time overhead that typically grow quickly as the number of processors increases. Second, executing a barrier requires all processorsto idle This research was supported in part by ARPA contract DABT63 94 C0054 and an ....

D. Chen, H. Su, and P. Yew. The impact of synchronization and granularity on parallel systems. In Proceedingsof the 17th International Symposium on Computer Architecture, Seattle, WA, May 1990.


An Approach to Scalability Study of Shared Memory.. - Sivasubramaniam.. (1994)   (3 citations)  (Correct)

....in the context of overall application performance. There are studies that use real applications to address specific issues like the effect of sharing in parallel programs on the cache and bus performance [11] and the impact of synchronization and task granularity on parallel system performance [6]. Cypher et al. 10] identify the architectural requirements such as floating point operations, communications, and input output for messagepassing scientific applications. Rothberg et al. 24] conduct a similar study towards identifying the cache and memory size requirements for several ....

D. Chen, H. Su, and P. Yew. The Impact of Synchronization and Granularity on Parallel Systems. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 239--248, 1990.


Timing Simulation Of Paragon Codes Using Workstation.. - Dickens, Heidelberger.. (1994)   (1 citation)  (Correct)

....parallelized. Table 1 uses these attributes to categorize relevant existing work, and LAPSE. LAPSE and HASE (Howell et al. 1994) simulate a message passing network with a parallelized simulator. WWT (Reinhardt et al. simulates a shared memory environment with a parallelized simulator. MaxPar (Chen et al. 1990), Maya (Agrawal et al. 1994) Proteus (Brewer et al. 1991) and Tango (Davis et al. 1991) simulate a shared memory network with a serial simulator. RPPT (Covington et al. 1991) and Simon (Fujimoto 1983) simulate a message passing network with a serial simulator. Table 1: Direct Execution ....

Chen, D., H. Su, and P. Yew, 1990. The impact of synchronization and granularity on parallel systems.


A Worksation-Based Parallel Direct-Execution Simulator - Phillip Dickens   (Correct)

....by LAPSE which may provide the functionality we seek. We briefly outline this proposed approach at the end of the paper. Before leaving this section, it is important to note that several other projects use direct execution of application processes to drive simulations of multiprocessor systems [1, 2, 3, 5, 11]. The Wisconsin Wind Tunnel (WWT) 11] is, to our knowledge, the only multiprocessor simulator that uses a multiprocessor (the CM 5) to execute the simulation. It is worthwhile to note the differences between LAPSE and WWT. The first is the issue of purpose of the system. The WWT is a tool for ....

D.-K. Chen, H.-M. Su, and P.-C. Yew. The impact of synchronization and granularity on parallel systems. In Int'l. Symp. on Computer Architecture, pages 239--248, May 1990.


Analysis and Transformation in the ParaScope Editor - Kennedy, McKinley, Tseng (1991)   (9 citations)  (Correct)

....location. They are utilized by transformations to exploit parallelism [9, 39, 54, 56] and the memory hierarchy [12, 24] 3 Work Model Ped is designed to exploit loop level parallelism, which comprises most of the usable parallelism in scientific codes when synchronization costs are considered [18]. In the work model best supported by Ped, the user first selects a loop for parallelization. Ped then displays all of its carried dependences. The user may sort or filter the dependences to help discover and delete dependences that are due to overly conservative dependence analysis. Ped also ....

D. Chen, H. Su, and P. Yew. The impact of synchronization and granularity on parallel systems. In Proceedings of the 17th Annual International Symposium on Computer Architecture, Seattle, WA, May 1990.


ThreadMon A Tool for Monitoring Multithreaded Program Performance - Bryan Cantrill   (2 citations)  (Correct)

....attention on compute intensive applications on multiprocessors. We took as the archetypical compute intensive application the matrix mult example of Section 5. 3 and reduced it to a program whose threads make successive iterations of an arbitrary computation followed by synchronization at a barrier [Chen et al. 1990]. The amount of computation per iteration was made a parameter, so that we could adjust the granularity of the synchronization. Figure 7 gives log log plots of the performance of a fine grained barrier on a four processor machine with four bound threads and the performance of four unbound threads ....

CHEN, D.K., SU, H.H., AND YEW, P.C. 1990. The impact of synchronization and granularity in parallel systems. In Proceedings of the 17th Annual International Symposium on Computer Architecture.


Performance Evaluation of Hierarchical Ring-Based Shared.. - Mark Holliday (1992)   (12 citations)  (Correct)

....Research [8, 6] There have been performance studies of single slotted rings, but not of ring based hierarchies. Previous studies of other shared memory architectures with system models at the level of detail of our simulator have tended to only examine small systems (100 processors or less) [31, 9]. The performance of branching factor topologies has been considered for bus hierarchies. The experiments by Vernon, J og, and Sohi [28] indicate that the best topologies had large branching factors at the low levels and small branching factors near the root. In contrast, our contentionbased ....

D.-K. Chen, H.-M. Su, and P.-C. Yew. The impact of synchronization and granularity on parallel systems. In Proceedings of the 16th Annual International Symposium on Computer Architecture, pages 239--248, Seattle, WA, May 1990.


Tetra: Evaluation of Serial Program Performance on Fine-Grain.. - Austin, Sohi (1993)   (Correct)

....too look at the effects of the control model on available parallelism, we also look at control barrier from another viewpoint. By measuring the control distance of dependencies, we can then show for a given level of parallelism, how effective speculation must be to expose that parallelism. MaxPar [CSY90] is very similar to Kumar s COMET in that it also rewrites programs. Where Kumar s COMET was limited to scheduling program statements, MaxPar has the ability to schedule at the granularity of operation level, statement level, loop level, or subprogram level. It can also limit computing resources ....

D. K. Chen, H. M. Su, and P. C. Yew. The impact of synchronization and granularity on parallel systems. In Conference Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 239--248. Association for Computing Machinery, May 1990.


Performance Implications of Synchronization Support for Parallel .. - Anik, Hwu   (Correct)

.... Yew to capture Submitted for publication Journal of Parallel and Distributed Processing 4 DOALL 30 J=1,J1 X(II1 J) X(II1 J) SC1 Y(II1 J) Y(II1 J) SC1 Z(II1 J) Z(II1 J) SC1 30 CONTINUE Figure 1: A DOALL loop most of the available parallelism for Perfect Club benchmark set programs [5]. They measured the instruction level parallelism by trace based data flow analysis and concluded that parallel loop structures sufficiently exploit this parallelism. However this assumes that all memory and control dependences can be resolved in the parallelization process. In practice, compile ....

D. Chen, H. Su, and P. Yew. The impact of synchronization and granularity on parallel systems. Technical Report CSRD Rpt. No. 942, Center for Supercomputing Research and Development, University of Illinois, 1989.


Evaluation Of Parallelizing Compilers - David Padua (1992)   (1 citation)  (Correct)

....execution time of the automatically parallelized program and an approximation of the shortest possible parallel execution time on the ideal machine. The rest of the section describes a method to compute this value. This method was introduced by Kumar [6] and later was used by Chen and Yew [7] to measure important characteristics of sequential programs. Any particular execution of a program consists of a series of operations whose results are written to an external device or used by other operations. For a given input data set, the execution of a program can be represented by an ....

D.-K. Chen, H.-M. Su, and P.-C. Yew, "The Impact of Synchronization and Granularity on Parallel Systems," Proceedings of the 17th Int'l. Symp. on Computer Architecture, Seattle, WA, December 1989.


A Simulation-based Scalability Study of Parallel Systems - Anand Sivasubramaniam (1993)   (Correct)

....in the context of overall application performance. There are studies that use real applications to address specific issues like the effect of sharing in parallel programs on the cache and bus performance [10] and the impact of synchronization and task granularity on parallel system performance [7]. Cypher et al. 9] identify the architectural requirements such as floating point operations, communications, and input output for scientific applications. However, there have been very few attempts at quantifying the effects of algorithmic and architectural interactions in a parallel system. ....

D. Chen, H. Su, and P. Yew. The Impact of Synchronization and Granularity on Parallel Systems. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 239--248, 1990.


Memory Latency Rediction via Data Prefetching and Data Forwarding .. - Poulsen (1994)   Self-citation (Yew)   (Correct)

No context found.

D.-K. Chen, H.-M. Su, and P.-C Yew, "The impact of synchronization and granularity on parallel systems," Proceedings of the 17th Annual International Symposium on Computer Architecture, pp. 239-249, 1990.


Compiler-Directed Run-Time Monitoring of Program Data Access - Chen Ding Yutao (2002)   Self-citation (Chen)   (Correct)

No context found.

D.K. Chen, H.H. Su, and P.C. Yew. The impact of synchronization and granularity in parallel systems. In Proceedings of the 17th Annual International Symposium on Computer Architecture, 1990.


Chief: A Simulation Environment for Studying Parallel Systems - Pavlos Konas (1994)   (1 citation)  Self-citation (Yew)   (Correct)

....applications and the effects of operating systems and multiprogramming on system performance [9] Critical path simulation directly executes instrumented serial application codes such that they are implicitly, optimistically parallelized when executed on a serial host machine. EPG sim and MaxPar [12] provide critical path simulation capabilities in Chief. This type of simulation measures the minimum parallel execution time and maximum speedup of optimistically parallelized codes through the use of dynamic, interprocedural dependence analysis. Dynamic dependence analysis reveals parallelism ....

D.-K. Chen, H.-M. Su, and P.-C. Yew, "The Impact of Synchronization and Granularity on Parallel Systems," in Proceedings of International Symposium on Computer Architecture, pp. 239--249, 1990.


Execution-Driven Tools for Parallel Simulation of Parallel.. - Poulsen, Yew (1993)   (8 citations)  Self-citation (Yew)   (Correct)

....execution time and maximum speedup or parallelism of optimistically parallelized codes. This type of simulation allows high level parallel architecture and application performance data to be obtained efficiently without the need for complex system simulation models, or even parallel codes. MaxPar [ChSY90] is a CPS tool that measures the potential performance, parallelism, and behavior of optimistically parallelized codes given various architectural parameters. Other CPS tools have been used to measure program parallelism [Laru90, 3) Chan91] and to perform dynamic dependence analysis [PePa92, ....

....or fixed, allowing the effects of processor scheduling policies to be observed. The processor scheduler is implemented as a runtime library routine that can be used to implement arbitrary scheduling policies. Scheduling strategies include round robin, earliest available processor, and near optimal [ChSY90], which attempts to maximize processor utilization. Intelligent instrumentation is used to make event generation more efficient. Instrumentation is guided by the wealth of information contained in the Parafrase 2 intermediate representation, allowing events to be more easily identified and ....

[Article contains additional citation context not shown here]

Chen, D.-K., Su, H.-M., and Yew, P.-C., "The Impact of Synchronization and Granularity on Parallel Systems", Proceedings of ISCA 1990, pp. 239-249.


The Effect Of Compiler Optimizations On Available Parallelism.. - Scott Mahlke (1991)   (3 citations)  Self-citation (Chen)   (Correct)

....and thus there is more than one machine model to optimize for. Therefore, it is important to understand the interactions of these optimizations and their effect on available parallelism and speedup. There has been significant research done to analyze the available parallelism in numeric programs [1] [3] Previous researchers have shown that for numeric programs the most parallelism can be found at either the instruction level or the loop level [1] However for scalar programs, Larus suggests that there is not much loop level parallelism available because the loops tend to be small and have ....

....effect on available parallelism and speedup. There has been significant research done to analyze the available parallelism in numeric programs [1] 3] Previous researchers have shown that for numeric programs the most parallelism can be found at either the instruction level or the loop level [1]. However for scalar programs, Larus suggests that there is not much loop level parallelism available because the loops tend to be small and have few iterations [5] Therefore, it may be better to exploit fine grain parallelism for scalar applications. In this paper we analyze the effect of ....

D. -K. Chen, H. -M. Su, and P. -C. Yew, "The Impact of Synchronization and Granularity on Parallel Systems ", Proceedings of the 17th Annual International Symposium on Computer Architecture, June 1990, pp. 239-248.


Dataflow: A Complement to Superscalar - Budiu, Artigas, Goldstein (2005)   (Correct)

No context found.

D.-K. Chen, H.-H. Su, et al. The impact of synchronization and granularity in parallel systems. In International Symposium on Computer Architecture (ISCA), pages 239--248, 1990.


Spatial Computation - Budiu (2003)   (Correct)

No context found.

Ding-Kai Chen, Hong-Hen Su, and Pen-Chung Yew. The impact of synchronization and granularity in parallel systems. In International Symposium on Computer Architecture (ISCA), pages 239--248, 1990.


D.M. Nicol. The cost of conservative synchronization in.. - Press Flannery Teukolsky (1995)   (Correct)

No context found.

D.-K. Chen, H.-M. Su, and P.-C. Yew. The impact of synchronization and granularity on parallel systems. In Int'l. Symp. on Computer Architecture, pages 239--248, May 1990.


Parallelized Direct Execution Simulation of.. - Dickens.. (1994)   (19 citations)  (Correct)

No context found.

D.-K. Chen, H.-M. Su, and P.-C. Yew. The impact of synchronization and granularity on parallel systems. In Int'l. Symp. on Computer Architecture, pages 239--248, May 1990.


Towards A Thread-Based Parallel Direct Execution Simulator - Phillip Dickens (1996)   (2 citations)  (Correct)

No context found.

D.-K. Chen, H.-M. Su, and P.-C. Yew. The impact of synchronization and granularity on parallel systems. In Int'l. Symp. on Computer Architecture, pages 239--248, May 1990.

First 50 documents

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC