16 citations found. Retrieving documents...
Girija J. Narlikar and Guy E. Blelloch. Space-efficient implementation of nested parallelism. In Proceedings of the 6th ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, June 1997.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Scalable Real-time Parallel Garbage Collection for Symmetric.. - Cheng (2001)   (1 citation)  (Correct)

....children threads have terminated. The scheduler always executes ready threads at the head of the queue. Activating threads in this order ensures that the parallel computation takes roughly the space of the sequential computation plus an additional factor proportional to the number of processors. [57]. The parallel let construct is elaborated by the TILT compiler into nonparallel constructs by using the two extra primitives. The sumTree example above would be compiled into the following code where a1 and a2 are gensym ed names. The elaboration replaces each binding in the parallel construct ....

Girija J. Narlikar and Guy Blelloch. Space-efficient implementation of nested parallelism. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 1997.


Low-Contention Depth-First Scheduling of Parallel Computations.. - Fatourou (2001)   (1 citation)  (Correct)

....number IST 1999 14186 (ALCOM FT) dynamic, unstructured parallelism. During the execution of a multithreaded computation, a thread may spawn child threads which can be executed in parallel, and it can synchronize with other currently executing threads. In most of the work in the literature [1, 4, 5, 6, 7, 9, 15, 16, 24, 25, 26, 27], a multithreaded computation is modeled as a directed acyclic graph (see Figure 1(a) Of much concern is how a multithreaded computation can be executed efficiently on a parallel computer. A parallel execution of a multithreaded computation specifies which processor executes each thread and ....

....a scheduling algorithm to achieve all of the above goals is not a trivial task. Several algorithms [7, 9, 15, 16] employ work stealing, a technique in which underutilized processors try to steal work from over utilized ones, to achieve the above scheduling goals. Recently, a flurry of research [4, 6, 24, 26] has resulted in depth first schedulers, which schedule threads prioritized by their (serial) left to right depth first execution order and are highly space efficient; the space complexity of an algorithm is the total amount of memory used by all processors to execute the computation. ....

[Article contains additional citation context not shown here]

G. Narlikar and G. Blelloch. Space-efficient implementation of nested parallelism. In Proceedings of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, June 1997.


Achieving High Performance for Parallel Programs that Contain.. - Oyama (2000)   (Correct)

....performance model for a given program. In particular, tools for measuring the critical path of a given program are essential. Determining the critical path length is a well known and extremely effective way of understanding the performance of parallel programs [BL94, BJK 95, BJK 96, BGM95, NB97, Nar99] 1.4 Contributions The principal contributions of our work are as follows. ffl We propose a technique for achieving efficient execution in bottlenecks. It reduces the number of mutual exclusion operations that accompany bottleneck modules and enhances the cache efficiency in the ....

....on memory because processors can rapidly send repeated requests to an owner, resulting in the creation of a huge number of data structures containing the information needed for the execution of the requested method. See Cilk s work [BL94, BJK 95, BJK 96] and NESL s work [BGM95, BGMN97, NB97] for a theoretical background on spaceefficiency. Local based execution, however, is not always the best choice. If an object is updated frequently by multiple processors, for example, local based execution will be subject to serious slowdowns caused by overheads, such as cache misses in reading ....

Girija J. Narlikar and Guy E. Blelloch. Space-Efficient Implementation of Nested Parallelism. In Proceedings of the sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '97), pages 25--36, Las Vegas, June 1997.


Expressing Irregular Computations in Modern Fortran Dialects - Prins, Chatterjee, Simons (1998)   (4 citations)  (Correct)

....a FORALL construct. The compile time component constructs the threads from the nested loops. A run time component dynamically schedules these threads across processors. Recent work has resulted in run time scheduling techniques that minimize completion time and memory use of the generated threads [9, 6, 20]. Scheduling very fine grained threads (e.g. a single multiplication in the sparse matrix vector product example) is impractical, hence compile time techniques are required to increase thread granularity, although this may result in lost parallelism and increased load imbalance. FORALL (i = ....

....summed [25] On the SX 4, is typically not needed since the operating system performs gang scheduling and the threads experience very similar progress rates. 10 20 30 40 50 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 number of rows columns Constant number of non zeros [20] varying matrix size (SGI Origin 200) 1 proc 2 proc 4 proc flat pointer 10 20 30 40 50 0 20 40 60 80 100 120 140 160 180 number of non zeros per row Varying number of non zeros constant matrix size [20000] SGI Origin 200) 1 proc 2 proc 4 proc 1 proc 2 proc 4 proc flat ....

[Article contains additional citation context not shown here]

G. J. Narlikar and G. E. Blelloch. Space-efficient implementation of nested parallelism. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 25--36, Las Vegas, NV, June 1997. ACM.


Expressing Irregular Computations in Modern Fortran Dialects - Prins, Chatterjee, Simons (1998)   (4 citations)  (Correct)

....a ###### construct. The compile time component constructs the threads from the nested loops. A run time component dynamically schedules these threads across processors. Recent work has resulted in run time scheduling techniques that minimize completion time and memory use of the generated threads [9, 6, 20]. Scheduling very fine grained threads (e.g. a single multiplication in the sparse matrix vector product example) is impractical, hence compile time techniques are required to increase thread granularity, although this may result in lost parallelism and increased load imbalance. ###### ## # ....

....# is typically not needed since the operating system performs gang scheduling and the threads experience very similar progress rates. 10 20 30 40 50 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 performance in MFLOPS number of rows columns Constant number of non zeros [20] varying matrix size (SGI Origin 200) 1 proc 2 proc 4 proc flat pointer 10 20 30 40 50 0 20 40 60 80 100 120 140 160 180 performance in MFLOPS number of non zeros per row Varying number of non zeros constant matrix size [20000] SGI Origin 200) 1 proc 2 proc 4 proc flat ....

[Article contains additional citation context not shown here]

G. J. Narlikar and G. E. Blelloch. Space-efficient implementation of nested parallelism. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 25--36, Las Vegas, NV, June 1997. ACM.


Irregular Computations in Fortran - Expression and.. - Prins, Chatterjee..   (Correct)

....is impractical, hence compile time techniques are required to increase thread granularity, although this may result in lost parallelism and increased load imbalance. Recent work has resulted in run time scheduling techniques that minimize completion time and memory use of the generated threads [11, 8, 27]. The automatic construction of threads of appropriate granularity is currently being investigated by several researchers [26, 19] In Fig. 4(c) we show a decomposition of the total work into four parallel threads T 1 ; T 4 . In this decomposition the body of the inner FORALL loop has been ....

G. J. Narlikar and G. E. Blelloch. Space-efficient implementation of nested parallelism. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 25--36, Las Vegas, NV, June 1997. ACM. 4.1


Parallel Implementation - Gautier, Hong, Roch, Schreiner   (Correct)

....execution order is a priori known, managing a priority queue according to this order enables to bound the memory space with respect to the one required by the serial execution. Such a serialization restricts the programming model: fully strict computations in Cilk [5] nested computations in Nesl [31], planar graphs [4] and non blocking tasks in Athapascan [15] Serialization versus parallelization is also related to time efficiency concerning the tuning of the granularity: how to decide if a computation should be splitted into subsequent concurrent tasks or not Assuming the correctness of ....

Girija J. Narlikar and Guy E. Blelloch. Space-efficient implementation of nested parallelism. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 25--36, June 1997.


Breadth-First with Depth-First BDD Construction: A Hybrid.. - Chen, Yang, Bryant (1997)   (1 citation)  (Correct)

....approaches. This algorithm has good memory access locality while keeping the memory overhead below a fixed fraction of the total memory usage. Combining of breadth first and depth first approaches to bound memory overhead has been proven successful in the parallel computation communities [7, 2, 12]. Experimental results on ISCAS85 [4] and multiplier circuits [6] show that our new approach is generally faster than other breadth first and depth first implementations, while keeping memory overhead comparable to the depth first approach. In particular, for the 13 bit multiplier circuit, our ....

NARLIKAR, G. N., AND BLELLOCH, G. E. Space-efficient implementation of nested parallel languages. Draft (available from the authors) (1996).


Practical Parallel Divide-and-Conquer Algorithms - Hardwick (1997)   (1 citation)  Self-citation (Blelloch)   (Correct)

No context found.

Girija J. Narlikar and Guy E. Blelloch. Space-efficient implementation of nested parallelism. In Proceedings of the 6th ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, June 1997.


Pthreads for Dynamic and Irregular Parallelism - Narlikar, Blelloch (1998)   (4 citations)  Self-citation (Narlikar Blelloch)   (Correct)

....in high memory allocation, high resource contention, and poor speedup. We then describe simple modifications we made to the Solaris Pthreads implementation to improve space and time performance. The modified version of the Pthreads implementation uses a space efficient scheduling mechanism [35] that results in a good speedup, while keeping memory allocation low. For example, for the dense matrix multiply program, the modified Pthreads scheduler reduces the running time on 8 processors compared to the original scheduler by 44 , and the memory requirement by 63 ; this allows the program ....

....implementation supports the full functionality of the original Pthreads library. Therefore, any existing Pthreads programs can be executed using our space efficient scheduler, including programs with blocking locks and condition variables. Some previous implementations of efficient schedulers [11, 29, 35] do not support such blocking synchronizations. Our results indicate that, provided we use a good scheduler, the rich functionality 2 and standard API of Pthreads can be combined with the advantages of dynamic, lightweight threads to result in high performance. The remainder of this paper is ....

[Article contains additional citation context not shown here]

Girija J. Narlikar and Guy E. Blelloch. Space-efficient implementation of nested parallelism. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 25--36, June 1997.


Scheduling Threads for Low Space Requirement and Good Locality - Narlikar (1999)   (6 citations)  Self-citation (Narlikar)   (Correct)

....serial, depth first space requirement [9] A computation with work (total number of operations) and depth (length of the critical path) was shown to require # time on processors [9] We will henceforth refer to such schedulers as work stealing schedulers. Recent work [6, 34] has resulted in depth first scheduling algorithms that require # ( space for nestedparallel computations with depth . For programs that have a low depth (a high degree of parallelism) such as all programs in the class , 14] the space bound of .# ( is ....

.... depth first approach has been extended to handle computations with futures [39] or I structures [16] resulting in similar space bounds [4] Experiments showed that an asynchronous, depth first scheduler often results in lower space requirement in practice, compared to a work stealing scheduler [34]. However, since depth first schedulers use a globally ordered queue, they do not provide some of the practical advantages enjoyed by work stealing schedulers. When the threads expressed by the user are fine grained, the performance may suffer due to poor locality and high scheduling contention ....

[Article contains additional citation context not shown here]

G. J. Narlikar and G. E. Blelloch. Space-efficient implementation of nested parallelism. In Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pages 25--36, June 1997.


Scheduling Threads for Low Space Requirement and Good Locality - Narlikar (1999)   (6 citations)  Self-citation (Narlikar)   (Correct)

....where S1 is the serial, depth first space requirement [9] A computation with W work (total number of operations) and D depth (length of the critical path) was shown to require W=p O(D) time on p processors [9] We will henceforth refer to such schedulers as work stealing schedulers. Recent work [6, 34] has resulted in depth first scheduling algorithms that require S1 O(p Delta D) space for nestedparallel computations with depth D. For programs that have a low depth (a high degree of parallelism) such as all programs in the class NC [14] the space bound of S1 O(p Delta D) is ....

.... depth first approach has been extended to handle computations with futures [39] or I structures [16] resulting in similar space bounds [4] Experiments showed that an asynchronous, depth first scheduler often results in lower space requirement in practice, compared to a work stealing scheduler [34]. However, since depth first schedulers use a globally ordered queue, they do not provide some of the practical advantages enjoyed by work stealing schedulers. When the threads expressed by the user are fine grained, the performance may suffer due to poor locality and high scheduling contention ....

[Article contains additional citation context not shown here]

G. J. Narlikar and G. E. Blelloch. Space-efficient implementation of nested parallelism. In Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pages 25--36, June 1997.


Space-Efficient Scheduling of Parallelism with.. - Blelloch, Gibbons, .. (1997)   (13 citations)  Self-citation (Narlikar Blelloch)   (Correct)

....[4] showed that for nested computations, the time bounds can be maintained while bounding the space by s1 O(pd) which for sufficient parallelism is just an additive factor over the sequential space. This was used to bound the space of the nesl programming language [5] Narlikar and Blelloch [30] showed that this same bound can be achieved in a non preemptive manner (threads are only moved from a processor when synchronizing, forking or allocating memory) and gave experimental results showing the effectiveness of the technique. All this work, however, has been limited to computations in ....

G. J. Narlikar and G. E. Blelloch. Space-efficient implementation of nested parallelism. In Proc. Symposium on Principles and Practice of Parallel Programming, June 1997.


Scheduling Threads for Low Space Requirement and Good Locality - Girija Narlikar (1999)   (6 citations)  Self-citation (Narlikar)   (Correct)

....where S1 is the serial, depth first space requirement [9] A computation with W work (total number of operations) and D depth (length of the critical path) was shown to require W=p O(D) time on p processors [9] We will henceforth refer to such schedulers as work stealing schedulers. Recent work [6, 36] has resulted in depth first scheduling algorithms that require S1 O(p Delta D) space for nestedparallel computations with depth D. For programs that have a low depth (a high degree of parallelism) such as all programs in the class NC [15] the space bound of S1 O(p Delta D) is ....

.... depth first approach has been extended to handle computations with futures [41] or I structures [17] resulting in similar space bounds [4] Experiments showed that an asynchronous, depth first scheduler often results in lower space requirement in practice, compared to a work stealing scheduler [36]. However, since depth first schedulers use a globally ordered queue, they do not provide some of the practical advantages enjoyed by work stealing schedulers. When the threads expressed by the user are fine grained, the performance may suffer due to poor locality and high scheduling contention ....

[Article contains additional citation context not shown here]

G. J. Narlikar and G. E. Blelloch. Space-efficient implementation of nested parallelism. In Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pages 25--36, June 1997.


Provably Efficient Scheduling for Languages with.. - Blelloch, Gibbons (1995)   (28 citations)  Self-citation (Blelloch)   (Correct)

....or end of a block then grouping computations into fixedsize blocks does not alter our space bounds. This, however, would require that a compiler break up threads into fixedsized blocks so they can be preempted at the end of each block. Following up the results in this paper, Narlikar and Blelloch [NB97] presented a scheduling algorithm that runs jobs mostly nonpreemptively, while maintaining the same space bounds as presented in this paper. The basic idea is to allocate a fixed pool of memory to a thread when it starts and then allow it to run nonpreemptively until it either terminates, forks ....

G. J. Narlikar and G. E. Blelloch. Spaceefficient implementation of nested parallelism. In Proc. 6th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 25--36, June 1997.


Pthreads for Dynamic Parallelism - Narlikar, Blelloch (1998)   Self-citation (Narlikar Blelloch)   (Correct)

....and high resource contention. This prevents the compute intensive benchmark from scaling well. We then describe several simple modifications we make to the Pthreads library that improve space and time performance. The final version of the scheduler uses a provably efficient scheduling mechanism [32] that results in a good speedup for the matrix multiply benchmark, while keeping memory allocation low. The simple and portable code for matrix multiply runs within 10 of handoptimized BLAS3 code for small matrices, and outperforms it for larger matrices. We also describe a set of 6 additional ....

....fork join or loop parallelism using non preemptive, stateless threads; it further reduces overheads by coarsening and pruning excess parallelism. Recent work has resulted in provably efficient scheduling techniques that provide upper bounds on the space required by the parallel computation [11, 12, 10, 8, 32]. Since there are several possible execution orders for lightweight threads in a computation with a high degree of parallelism, the provably space efficient schedulers restrict the execution order for the threads to bound the space requirement. For example, the Cilk multithreaded system [10] ....

[Article contains additional citation context not shown here]

Girija J. Narlikar and Guy E. Blelloch. Space-efficient implementation of nested parallelism. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 25--36, June 1997.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC