18 citations found. Retrieving documents...
Grunwald D, Vajracharya S. Efficient barriers for distributed shared memory computers. Proceedings of the 8th International Parallel Processing Symposium, April 1994.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Threshold Counters with Increments and Decrements - Busch, Demetriou, Herlihy.. (1999)   (1 citation)  (Correct)

....is the current counter value and w is a fixed constant. Thus, the Read operation returns the approximate value of the counter to within the constant w. Threshold counters have a variety of potential uses, most obviously software barrier synchronization (see, for example, 11, Section 4.2. 5] or [6, 7]) Threshold counters are interesting because they can sometimes be implemented more efficiently than exact counters. The most obvious way to implement a shared counter, whether threshold or exact, is to use a single shared variable protected by a lock. However, such centralized data structures ....

D. Grunwald and S. Vajracharya, "Efficient Barriers for Distributed Shared Memory Computers," Proceedings of the 8th International Parallel Processing Symposium, IEEE Computer Society Press, April 1994.


A Multithreaded Java Grande Benchmark Suite - Smith, Bull (2001)   (1 citation)  (Correct)

....parallel programming model, such as spawning and joining threads. Barrier Benchmark This measures the performance of barrier synchronisation. Two types of barriers have been implemented: the Simple Barrier uses a shared counter, while the Tournament Barrier uses a lock free 4 ary tree algorithm [10]. The Tournament barrier is used wherever barrier synchronisation is required in Sections II and III of the suite. ForkJoin Benchmark This benchmark measures the performance of creating and joining threads. Synchronisation Benchmark This benchmark measures the performance of synchronised ....

Grunwald, D. and Vajracharya, S. (1994) Efficient Barriers for Distributed Shared Memory Computers, in Proceedings of 8th International Parallel Processing Symposium, IEEE Computing Society, April 1994, pp 202-213.


An OpenMP-like Interface for Parallel Programming in Java - Kambites, Obdrzalek, Bull (2001)   (Correct)

....in its own team of size one, and the task is executed. Finally, the original values of the thread specific data are restored. The setNested( method does nothing, and the getNested( method always returns false. 3. 6 Barriers The Barrier class implements a simple, static 4 way tournament barrier [4] for an arbitrary number of threads. Its constructor takes as a parameter the number of threads to use. The doBarrier( method takes as a parameter a thread number, and causes the calling thread to block until it has been called the same number of times for each possible thread number. To avoid ....

D. Grunwald and S. Vajracharya. Efficient Barriers for Distributed Shared Memory Computers. In Proceedings of 8th International Parallel Processing Symposium, April 1994.


JOMP-an OpenMP-like Interface for Java - Bull, Kambites (2000)   (3 citations)  (Correct)

....in its own team of size one, and the task is executed. Finally, the original values of the thread specific data are restored.The setNested( method does nothing, and the getNested( method always returns false. 3. 7 Barriers The Barrier class implements a simple, static 4 way tournament barrier [3] for an arbitrary number of threads. Its constructor takes as a parameter the number of threads to use. The DoBarrier( method takes as a parameter a thread number, and causes the calling thread to block until it has been called the same number of times for each possible thread number. To avoid ....

D. Grunwald and S. Vajracharya. Efficient Barriers for Distributed Shared Memory Computers. In Proceedings of 8th International Parallel Processing Symposium, April 1994.


Parallel Implementation of a Multilevel Modelling Package - Bull, Riley, Rasbash.. (1999)   (3 citations)  (Correct)

....doParallel method. It is also available to be called directly by the user, though this is not required in MLn. Since barrier synchronisation is often a significant source of overheads in shared memory programs, we have implemented 11 an efficient barrier algorithm (the static F way tournament of [10]) as well as a simple centralised counter algorithm. The static F way tournament algorithm is lock free, and scales as the logarithm of the number of threads. Even for small numbers of threads, we have found it to be faster than a simple centralised counter scheme. 3) Lock Class: This class ....

Grunwald, D. and S. Vajracharya, Efficient barriers for distributed shared memory computers, (Tech. Rep. CU-CS-703-94-93, Department of Computer Science, University of Colorado, Boulder, CO, 1993)


Measuring Synchronisation and Scheduling Overheads in OpenMP - Bull (1999)   (12 citations)  (Correct)

....models (see, for example, 1] 3] the historical lack of standardisation in shared memory programming has made meant that little work has been done in this area. Barrier synchronisation is an important feature of this programming model, and previous studies of barrier performance include [2] and [5] Loop scheduling methods have an extensive literature, with most authors reporting performance studies, though the emphasis is on comparing algorithms rather than implementations (see, for example, 4] 8] 9] The remainder of this paper is organised as follows: Section II describes ....

Grunwald, D. and S. Vajracharya (1994) Efficient Barriers for Distributed Shared Memory Computers in Proceedings of 8th International Parallel Processing Symposium, April 1994.


The Parallel Fast Multipole Method In Molecular Dynamics - Singer (1995)   (4 citations)  (Correct)

....of levels of the Fast Multipole Method to processors. extensions to the POSIX standard to bind our threads to processors. For synchronization we have implemented the most simple minded barriers possible, for better performance on larger numbers of processors dynamic f way barriers, as described in [31], should be employed. The new code necessary for the parallelization was less than 200 lines, while only about 50 lines in the existing code needed to be changed. The sub boxes on each level are linked in the natural order implied by the recursive parent child relations (cf. 42, 8, 63, 64] ....

Dirk Grunwald and Suvas Vajracharya. Efficient barriers for distributed shared memory computers. Technical Report CU-CS-703-94-93, University of Colorado at Boulder, Department of Cumputer Science, Campus Box 430, University of Colorado, Boulder, Colorado 80309, September 1993. Email: grunwald,suvas@cs.colorado.edu.


The Parallel Fast Multipole Method in Three Dimensions - Singer (1995)   (Correct)

....standard. On the KSR we also use KSR specific extensions to the POSIX standard to bind our threads to processors. For synchronization we have implemented the most simple minded barriers possible, for better performance on larger numbers of processors dynamic f way barriers, as described in [GV93] should be employed. The new code necessary for the parallelization was less than 200 lines, while only about 50 lines in the existing code needed to be changed. The sub boxes on each level are linked in the natural order implied by the recursive parent36 1 2 4 8 16 32 1 2 4 8 16 32 Number of ....

Dirk Grunwald and Suvas Vajracharya. Efficient barriers for distributed shared memory computers. Technical Report CU-CS-703-94-93, University of Colorado at Boulder, Department of Cumputer Science, Campus Box 430, University of Colorado, Boulder, Colorado 80309, September 1993. Email: grunwald,suvas@cs.colorado.edu.


Parallel Implementation of the Fast Multipole Method with.. - Singer (1995)   (1 citation)  (Correct)

....standard. On the KSR we also use KSR specific extensions to the POSIX standard to bind our threads to processors. For synchronization we have implemented the most simple minded barriers possible, for better performance on larger numbers of processors dynamic f way barriers, as described in [GV93] should be employed. The new code necessary for the parallelization was less than 200 lines, while only about 50 lines in the existing code needed to be changed. The sub boxes on each level are linked in the natural order implied by the recursive parentchild relations (cf. LB92, BHE 94, ....

Dirk Grunwald and Suvas Vajracharya. Efficient barriers for distributed shared memory computers. Technical Report CU-CS-703-94-93, University of Colorado at Boulder, Department of Cumputer Science, Campus Box 430, University of Colorado, Boulder, Colorado 80309, September 1993. Email: grunwald,suvas@cs.colorado.edu.


Parallelising Serial Code: A Comparison Of Three.. - MacLaren (1997)   (1 citation)  (Correct)

....processor was. Releasing a lock costs around 25 processor cycles (1.25 s) These times were obtained experimentally. The overhead for a barrier synchronisation on the KSR1 ranges from 12000 processor cycles (0. 6ms) to 20000 cycles (1ms) with a logarithmic dependency on the number of processors [Grunwald93]. Fortran Compiler Parallel Support The KSR1 Fortran 77 [ANSI78] compiler provides support for parallelism via the ksr tile and ksr end tile directives which are embedded in comments placed in the code around a DO loop. This directive is capable of making DO loops run in parallel, and ....

Grunwald D and Vajracharya S, Efficient Barriers for Distributed Shared Memory Computers, Technical Report CU-CS-703-94-93, Department of Computer Science, University of Colorado, Sep. 1993.


Global Arrays: A Portable "Shared-Memory".. - Nieplocha, al. (1994)   (Correct)

....into atomic mode with similar cost to an ordinary non atomic access to that page. This facility is used to provide fine grain locking in the accumulate operation which increases scalability. Also on the KSR, we use a dynamic Fway barrier which is claimed to be the fastest barrier for this machine [14]. On other machines, the central barrier algorithm is used. 5 Performance of Communication Primitives The efficiency of the elementary communication operations, get, put and accumulate, might be crucial to the overall performance of the applications that use the toolkit. We demonstrate ....

D. Grunwald and S. Vajracharya, `Efficient barriers for distributed shared memory computers,' Proceedings of 8th IPPS, pp. 604-608, 1994.


Global Arrays: A Non-Uniform-Memory-Access.. - Nieplocha..   (22 citations)  (Correct)

....put into atomic mode with similar cost to an ordinary non atomic access to that page. This facility might be used to provide fine grain locking in the accumulate operation which increases scalability. Also on the KSR, we use the dynamic F way barrier, the fastest barrier algorithm for this machine [12]. On other shared memory machines, the central barrier algorithm is used. The code for globally addressable distributed memory and shared memory implementations is almost identical since it uses generalized locking, copy and memory allocation abstractions. Our generalized copy mechanisms ....

D. Grunwald and S. Vajracharya. Efficient barriers for distributed shared memory computers. In Proc. of 8th IPPS, pages 202--213. IEEE Computer Society, 1994.


Dependence Driven Execution for Data Parallelism - Vajracharya   Self-citation (Vajracharya)   (Correct)

....does not improve locality because the whole matrix must be traversed before the data is re accessed; all the red computations must complete and synchronize at the barrier before any element is re accessed by black computations. The improved performance is largely due to a more efficient barrier [36] which takes the hierarchical interconnect topology of the KSR 1 into account. 5.1.3 Awesime Threads Dynamic Block Scheduling This is the chunking method described in section 3.1.1. Work is not statically assigned but instead workers (implemented by a Awesime thread grab a block of the matrix ....

Dirk Grunwald and Suvas Vajracharya. Efficient barriers for distributed shared memory computers. In Intl. Parallel Processing Symposium. IEEE, IEEE Computer Society, April 1994. (to appear).


The Design of an Object-Oriented Runtime System For.. - Baille, Grunwald.. (1995)   Self-citation (Grunwald Vajracharya)   (Correct)

....different work sharing strategies. The most common work sharing mechanism uses a separate scheduler for each CpuMux, and CpuMux s steal from each other if they are idle. As another example of dynamic dispatch, users can select a barrier algorithm that is most appropriate to the architecture [12] or problem. The DUDE runtime system uses the abstraction and inheritance constructs of C to keep the scheduling policy, the underlying hardware, the type of objects being scheduling, the type of synchronization and other aspects of the system mutually orthogonal. As we will see, we need not ....

Dirk Grunwald and Suvas Vajracharya. Efficient barriers for distributed shared memory computers. In Intl. Parallel Processing Symposium. IEEE, IEEE Computer Society, April 1994. (to appear).


Application of an Object-Oriented Parallel Run-Time.. - Baillie, Grunwald.. (1995)   Self-citation (Grunwald Vajracharya)   (Correct)

....different work sharing strategies. The most common work sharing mechanism uses a separate scheduler for each CpuMux, and CpuMux s steal from each other if they are idle. As another example of dynamic dispatch, users can select a barrier algorithm that is most appropriate to the architecture [10] or problem. The DUDE runtime system uses the abstraction and inheritance constructs of C to keep the scheduling policy, the underlying hardware, the type of objects being scheduling, the type of synchronization and other aspects of the system mutually orthogonal. As we will see, we need not ....

Dirk Grunwald and Suvas Vajracharya. Efficient barriers for distributed shared memory computers. In Intl. Parallel Processing Symposium. IEEE, IEEE Computer Society, April 1994. (to appear).


The DUDE Runtime System: An Object-Oriented Macro-Dataflow .. - Grunwald, Vajracharya (1995)   (5 citations)  Self-citation (Grunwald Vajracharya)   (Correct)

....different worksharing strategies. The most common work sharing mechanism uses a separate scheduler for each CpuMux, and CpuMux s steal from each other if they are idle. As another example of dynamic dispatch, users can select a barrier algorithm that is most appropriate to the architecture [16] or problem. The Dude runtime system uses the abstraction and inheritance constructs of C to keep the scheduling policy, the underlying hardware, the type of objects being scheduling, the type of synchronization and other aspects of the system mutually orthogonal. As we will see, we need not ....

....rows does not improve locality because the whole matrix must be traversed before the data is re accessed; all the red computation must complete and synchronize at the barrier before any element is re accessed by black computation. The improved performance is largely due to a more efficient barrier [16] which takes the hierarchical interconnect topology of the KSR1 into account. With dynamically scheduled Dude threads, threads grab a block of the matrix from a global descriptor containing information on what work remains to be done. While data or work is not bound to a particular thread, the ....

Dirk Grunwald and Suvas Vajracharya. Efficient barriers for distributed shared memory computers. In Intl. Parallel Processing Symposium. IEEE, IEEE Computer Society, April 1994. (to appear).


Application of an Object-Oriented Parallel Run-Time System to.. - Clive Baillie   Self-citation (Grunwald Vajracharya)   (Correct)

....different work sharing strategies. The most common work sharing mechanism uses a separate scheduler for each CpuMux, and CpuMux s steal from each other if they are idle. As another example of dynamic dispatch, users can select a barrier algorithm that is most appropriate to the architecture [15] or problem. The Dude runtime system uses the abstraction and inheritance constructs of C to keep the scheduling policy, the underlying hardware, the type of objects being scheduling, the type of synchronization and other aspects of the system mutually orthogonal. As we will see, we need not ....

D. Grunwald and S. Vajracharya. Efficient barriers for distributed shared memory computers. In 8th Intl. Parallel Processing Symposium, pages 604--608. IEEE Computer Society, April 1994.


An OpenMP-like interface for parallel programming in Java - Kambites, Obdrzalek, Bull (2001)   (Correct)

No context found.

Grunwald D, Vajracharya S. Efficient barriers for distributed shared memory computers. Proceedings of the 8th International Parallel Processing Symposium, April 1994.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC