| Robert C. Miller. A type-checking preprocessor for Cilk 2, a multithreaded C language. Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, May 1995. |
....as[2] a collection into a whole, of definite, well distinguished objects of our perception or of our thought. Although this intuitive definition is perceived to be correct, it is a circular definition. To avoid such a situation in set theory, we shall choose set and element of a set as axioms[25]. In the above definition, there exist many other things we should not overlook. First, Cantor did not specify how many kinds of sets we can have. Second, Cantor did not put any conditions on how to construct sets. Third, sets are very closely connected with our perception and our thought . ....
Peter W. Zehna and Robert L. Johnson. Elements of set theory. College mathematics series. Allyn and Bacon, Boston, 1962. Department of Electrical Engineering and Computer Sciences, Yang's Scientific Research Institute (http://www.YangSky.com), 741 East First Street, Tucson, Arizona 85719-4830, USA E-mail address: taoyang@yangsky.com
....published in [12] 63 5.1 The Cilk language and runtime system The Cilk language [10] extends C with primitives to express parallelism, and the Cilk runtime system maps the expressed parallelism into parallel execution. A Cilk program is preprocessed to C using the cilk2c translator [76] and then compiled and linked with a runtime library to run on the target platform. Currently supported targets include the Connection Machine CM5 MPP, the Intel Paragon MPP, the Sun SparcStation SMP, the Silicon Graphics Power Challenge SMP, and the Cilk NOW network of workstations (Chapter 6) ....
Robert C. Miller. A type-checking preprocessor for Cilk 2, a multithreaded C language. Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, May 1995.
....computation on a computer with an infinite number of processors and a perfect scheduler (imagine God s computer) Work and critical path are properties of the computation alone, and they do not depend on the number of processors executing the computation. In previous work, Blumofe and Leiserson [30, 25] designed Cilk s work stealing scheduler and proved that it executes a Cilk program on P processors in time T P ,where T # =P #O#T# # : 1.1) In this dissertation we improve on their work by observing that Equation (1.1) suggests both an efficient implementation strategy for Cilk and an ....
....used this idea in the FFTW codelet generator (see Chapter 6) which generates cache oblivious fast Fourier transform programs. 1.1. 3 Coping with parallelism and memory hierarchy together What happens when we parallelize a cache oblivious algorithm with Cilk The execution time upper bound from [25] (that is, Equation (1.1) does not hold in the presence of caches, because the proof does not account for the time spent in servicing cache misses. Furthermore, cache oblivious algorithms are not necessarily cache optimal when they are executed in parallel, because of the communication among ....
[Article contains additional citation context not shown here]
R. D. BLUMOFE, Executing Multithreaded Programs Efficiently, PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September 1995.
....of ffl as small as e GammaT 1 , the space complexity of our algorithm is at most S1 O(PT1 log(PT1 ) with probability at least 1 Gamma e GammaT 1 . These bounds include all time and space costs for both the computation and the scheduler. 1 Introduction Many parallel programming languages [3, 7, 11, 12, 14, 18, 20, 22, 28] support dynamic threads. The multithreaded model of parallel computation is a general approach to model This work was done while the author was affiliated with the MaxPlanck Institute fur Informatik, Saarbrucken, Germany and was partially supported by the IST Programme of the European Union under ....
....number IST 1999 14186 (ALCOM FT) dynamic, unstructured parallelism. During the execution of a multithreaded computation, a thread may spawn child threads which can be executed in parallel, and it can synchronize with other currently executing threads. In most of the work in the literature [1, 4, 5, 6, 7, 9, 15, 16, 24, 25, 26, 27], a multithreaded computation is modeled as a directed acyclic graph (see Figure 1(a) Of much concern is how a multithreaded computation can be executed efficiently on a parallel computer. A parallel execution of a multithreaded computation specifies which processor executes each thread and ....
[Article contains additional citation context not shown here]
R. D. Blumofe. Executing Multithreaded Programs Efficiently. PhD thesis, Department of Electrical Engineering and Computer Science, September 1995.
.... is scalable, adaptive, portable, and yet efficient, without even modifying the JVM We believe that by means of a user level thread scheduler that employs the work stealing algorithm, which has been shown to be both efficient and adaptive in scheduling multithreaded programs over multiprocessors [4,5], we can turn Java into a parallel programming environment. Similar projects such as Javalin [6] and ATLAS [3] focus on global computing infrastructure while JAWS s focus is on local area clusters. 2 2.0 Design Overview JAWS allows programmers to write parallel programs in pure Java that can ....
Robert D. Blumofe. Executing Multithreaded Programs Efficiently. Ph.D. thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September 1995.
....user does not have to know whether the library is parallel or not. As for the implementation cost, a large body of existing literature in parallel programming language community indicates that nested parallelism, and dynamic parallelism management in general, can be implemented efficiently [10, 2, 19, 8, 5, 13, 6]. In particular, our previous work on fine grain thread library StackThreads MP [18, 17, 20] shows that this can be done using regular sequential compilers, imposing almost no overhead for serial execution. This paper describes implementation of nested parallelism using fine grain thread library ....
....kinetic and potential energy. We parallelize this program by inserting parallel for directive only at outermost level. In the experiments, the number of molecules is 8192 and we simulate for ten steps. The following four programs are taken from KAI benchmark [16] which was originally from Cilk [2] benchmark suite. They do not have enough parallelism in the outermost level, but do have nested parallelism, so should benefit from nested parallelism support. Since they were originally written using KAI s workqueuing pragmas, we simply replaced them with standard OpenMP pragmas. Queens: The ....
[Article contains additional citation context not shown here]
Robert D. Blumofe. Executing Multithreaded Programs Efficiently. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September 1995.
....Cilk s runtime system takes care of details like load balancing and communication protocols. Unlike other multithreaded languages, however, Cilk is algorithmic in that the runtime system s scheduler guarantees provably efficient and predictable performance. Cilk grew out of theoretical work [1, 5, 6] on the scheduling of multithreaded computations. The basis of Cilk is a provably good scheduling algorithm that has been the cornerstone of Cilk system development. Cilk s provably good scheduler engendered a performance model that accurately predicts the efficiency of a Cilk program using two ....
....The basis of Cilk is a provably good scheduling algorithm that has been the cornerstone of Cilk system development. Cilk s provably good scheduler engendered a performance model that accurately predicts the efficiency of a Cilk program using two simple parameters: work and critical path length [1, 4, 6]. More recent research has included page faults as a measure of locality [2, 3, 12] The first implementation of Cilk was a direct descendent of PCM Threaded C [10, 12] a C based package which provided continuation passing style threads on Thinking Machines Corporation s Connection Machine Model ....
[Article contains additional citation context not shown here]
Robert D. Blumofe. Executing Multithreaded Programs Efficiently. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September 1995. Available as MIT Laboratory for Computer Science Technical Report MIT/LCS/TR-677.
....of the program can be viewed as a set of operations on memory that obey the dependencies imposed by the fork join constructs. Although 2 the issues of how the computation is expressed and scheduled are extremely important, they are outside the scope of this paper. The reader is referred to [Blu95, Joe96] for one way to deal with these issues. In this paper, we consider the computation as fixed and given a priori. In this paper, we consider only read write memories. We denote reads and writes to location l by R(l) and W (l) respectively. For the rest of the paper, the set of instructions ....
....C 0 such that (C 0 ; Phi 0 ) 2 Delta and the restriction of Phi 0 to C is Phi, i.e. Phi 0 j C = Phi. Completeness follows immediately from constructibility, since the empty computation is a prefix of all computations 3 This is the case with multithreaded languages, such as Cilk [Blu95, Joe96] and, together with its unique observer function, belongs to every memory model. Not all memory models are constructible; we shall discuss some nonconstructible memory models in Section 5. However, a nonconstructible model Delta can be strengthened in an essentially unique way until it ....
[Article contains additional citation context not shown here]
Robert D. Blumofe. Executing Multithreaded Programs Efficiently. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September 1995.
....for foo can be reused for bar. By checking the variable SYNCHED, the procedure detects this situation, and the space is reused. Otherwise, it allocates new space for bar. The variable SYNCHED is actually more like a macro than an honest to goodness variable. The Cilk type checking preprocessor [22] actually produces two clones of each Cilk procedure: a fast clone that executes common case serial code, and a slow clone that worries about parallel communication. In the slow clone, which is rarely executed, SYNCHED inspects Cilk s internal state to determine if any children are ....
Robert C. Miller. A type-checking preprocessor for Cilk 2, a multithreaded C language. Master 's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, May 1995.
....parallelism, since many parallel programming languages allow the expression of dynamic threads. Such languages include data parallel languages like High Performance Fortran [31] and NESL [5] data ow languages like ID [21] control parallel languages with fork join constructs like CILK [8, 27], CC [19] and Proteus [40] languages with synchronization primitives like Multilisp [30] Mul T [35] COOL [20] SISAL [25] Jade [49] and various other user level thread libraries [3, 46, 53] For the execution of a multithreaded computation on a parallel computer, one should specify which ....
....27, 30, 35] have used work stealing to achieve the scheduling goals described above and to provide high performance. Work stealing is a technique in which underutilized processors try to steal work from other, hopefully overutilized processors. Indeed, work stealing has been proved (see e.g. [1, 8, 12, 41]) to achieve a fair combination of the above objectives and to be quite ecient in terms of these performance parameters. Several programming models have been proposed for parallel computing. The model of parallelism supported by a language determines the style in which threads may be created or ....
[Article contains additional citation context not shown here]
R. D. Blumofe, Executing Multithreaded Programs Eciently, Ph.D. Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September 1995.
....statement before returning to C 1 . 8.5. 2 Performance Model The Cilk runtime system provides an algorithmic model of performance based on two parameters: T 1 (work) and T1 (critical path length) The execution of any parallel program can be measured in terms of these two parameters [BJK 95,Blu95,BL94,KR90] The work of a program corresponds to its execution time on one processor. The critical path length corresponds to its execution time on an infinite number of processors, which is also the time required to execute the threads along the longest dependency path in the Cilk dag. The ....
....which is also the time required to execute the threads along the longest dependency path in the Cilk dag. The performance of the Cilk runtime system depends on the efficiency of its work scheduler. Cilk uses a provably efficient scheduler which is based on the concept of randomized work stealing [Blu95,BL94] With this technique, a processor (the thief ) who runs out of work selects another processor (the victim) randomly, from whom to steal a ready thread. It has been shown that the scheduler guarantees that the expected 20 execution time of a lock free Cilk program 2 on P processors is ....
[Article contains additional citation context not shown here]
Robert D. Blumofe. Executing Multithreaded Programs Efficiently. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September 1995.
....but he has no direct control over the scheduling of his application on a given number of processors. It is up to the runtime scheduler to map the dynamically unfolding computation onto the available processors so that the computation executes efficiently. Good on line schedulers are known [3, 4, 5] but their analysis is complicated. For simplicity, we ll illustrate the principles behind these schedulers using an off line greedy scheduler. A greedy scheduler schedules as much as it can at every time step. On a P processor computer, time steps can be classified into two types. If there ....
Robert D. Blumofe. Executing Multithreaded Programs Efficiently. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September 1995. 12
....computation on a computer with an infinite number of processors and a perfect scheduler (imagine God s computer) Work and critical path are properties of the computation alone, and they do not depend on the number of processors executing the computation. In previous work, Blumofe and Leiserson [30, 25] designed Cilk s work stealing scheduler and proved that it executes a Cilk program on P processors in time T P , where T P T 1 =P O(T1 ) 1.1) In this dissertation we improve on their work by observing that Equation (1.1) suggests both an efficient implementation strategy for Cilk and an ....
....used this idea in the FFTW codelet generator (see Chapter 6) which generates cache oblivious fast Fourier transform programs. 1.1. 3 Coping with parallelism and memory hierarchy together What happens when we parallelize a cache oblivious algorithm with Cilk The execution time upper bound from [25] (that is, Equation (1.1) does not hold in the presence of caches, because the proof does not account for the time spent in servicing cache misses. Furthermore, cache oblivious algorithms are not necessarily cache optimal when they are executed in parallel, because of the communication among ....
[Article contains additional citation context not shown here]
R. D. BLUMOFE, Executing Multithreaded Programs Efficiently, PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September 1995.
....protocol for implementing the ready deque in the work stealing scheduler. Unlike in Cilk 1, where the Cilk scheduler was an identifiable piece of code, in Cilk 5 both the compiler and runtime system bear the responsibility for scheduling. Cilk 5 s compiler cilk2c is a source to source translator [74, 24] which converts the 1 The contents of this chapter are joint work with Matteo Frigo and Charles Leiserson and will appear at PLDI 98 [41] fib.cilk fib.c fib.o rts.a fib cilk2c gcc ld Figure 3 1: Generating an executable from a Cilk program. Our compiler cilk2c translates Cilk code into regular ....
Robert C. Miller. A type-checking preprocessor for Cilk 2, a multithreaded C language. Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, May 1995.
....a Cilk programmer to predict the performance of his Cilk programs on parallel machines. 2.1 The history of Cilk Cilk is a multithreaded language for parallel programming that generalizes the semantics of C by introducing linguistic constructs for parallel control. The original Cilk 1 release [11, 14, 58] featured a provably efficient, randomized, work stealing scheduler [11, 16] but the language was clumsy, because parallelism was exposed by hand using explicit continuation passing. The Cilk 2 language provided better linguistic support by allowing the user to write code using natural ....
....machines. 2.1 The history of Cilk Cilk is a multithreaded language for parallel programming that generalizes the semantics of C by introducing linguistic constructs for parallel control. The original Cilk 1 release [11, 14, 58] featured a provably efficient, randomized, work stealing scheduler [11, 16], but the language was clumsy, because parallelism was exposed by hand using explicit continuation passing. The Cilk 2 language provided better linguistic support by allowing the user to write code using natural spawn and sync keywords, which the compiler then converted to the Cilk 1 ....
[Article contains additional citation context not shown here]
Robert D. Blumofe. Executing Multithreaded Programs Efficiently. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September 1995.
....in our scheme. Our algorithm is described in detail in Section 5. In Section 6 we present experimental evidence that the loss of efficiency traded in for portability is acceptable. 2 Related Work Much of recent related work is in the area of checkpointing distributed programs, for example [2, 4, 5, 6, 9, 11, 16, 19, 20]. In addition, projects described in [10, 12, 15, 18] concentrate on shared or distributed shared memory (DSM) systems. Many of the optimizations discussed in the work cited above are applicable to checkpointing sequential programs. We restrict our discussion below to these optimizations. ....
....objects can be copied as entities, because they are UCF aligned. Consequently, array a is not pushed element wise, which would reverse the order of the elements on the shadow stack. offset pointer stack s X p X ps X chkpt main main checkpoint growth stack function1 save return a[3] a[0] a[1] a[2] chkpt main function1 restore call offset a[0] a[3] a[2] a[1] X activation stack restore pop save push shadow stack Figure 7: Checkpointing the stack in the presence of a call by reference. After saving, the restore phase begins. The stack is restored by calling the returned functions and ....
[Article contains additional citation context not shown here]
Robert D. Blumofe. Executing Multithreaded Programs Efficiently. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 1995.
.... The expected running time on P processors, including scheduling overhead, is T P = T 1 =P O( log P )T 1 ) Moreover, for any ffl 0, with probability at least 1 Gamma ffl, the execution time is T P = T 1 =P O( log P ) T 1 lg(1=ffl) Proof: The proofs of Blumofe and Leiserson [Blu95, BL94] work with some simplifications and minor modifications. In those proofs, an accounting scheme is used where processor time is accounted for by putting a dollar in one of three buckets on every time unit. The first bucket, the work bucket gets a dollar whenever a processor is actually ....
Robert D. Blumofe. Executing Multithreaded Programs Efficiently. PhD thesis, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, September 1995. (http://www.cs.utexas.edu/users/rdb/papers/PhDthesis.ps.gz).
....at least largely protocol free, multithreaded code. Our second version of Cilk provides a call return semantics for parallelism using simple spawn and sync keywords, features that remain in today s Cilk 5. Instead of being a simple C preprocessor, Rob Miller implemented Cilk 2 s compiler cilk2c [32] as a type checking source to source translator which compiles a Cilk source into a C postsource. The C postsource is then run through an ordinary C compiler and linked with the Cilk runtime system to produce object code. Cilk 2 was a resounding success. Its call return parallelism simpli ed the ....
Robert C. Miller. A type-checking preprocessor for Cilk 2, a multithreaded C language. Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, May 1995.
....the Cilk language, illustrating how Cilk supports the programming of parallel game tree search and other chess mechanisms. 1 Introduction The Supercomputing Technologies (Supertech) Research Group in the MIT Laboratory for Computer Science began developing the Cilk multithreaded language [5, 8, 22, 27] in 1994. Development of Cilk has been intertwined with the development of a series of computer chess programs: StarTech, Socrates, and Cilkchess. Although the development of Cilk itself has been funded by the U.S. Defense Advanced Research Projects Agency (DARPA) all of our chess programs have ....
....including support for parallel I O and streams, job scheduling, and fault tolerance. Cilk software, documentation, publications, and up to date information are available via the Web at http: supertech.lcs.mit.edu cilk. Detailed descriptions of the foundation and history of Cilk can be found in [5, 27, 37, 21]. Acknowledgments Many people have contributed to the series of Cilk chess programs produced by the Supertech research group of the MIT Laboratory for Computer Science. Special thanks go to Chris Joerg, who has continually provided solid input to every program, even after receiving his Ph.D. ....
Robert D. Blumofe. Executing Multithreaded Programs Eciently. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September 1995.
No context found.
R. D. Blumofe. Executing Multithreaded Programs Eciently. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Sept. 1995.
.... Cilk 1 was an adaptively parallel and fault tolerant network of workstations implementation, called Cilk NOW [1, 7] The next release, Cilk 2, featured full type checking, supported all of ANSI C in its C language subset, and offered call return semantics for writing multithreaded procedures [14]. Instead of the C This research was supported in part by the Defense Advanced Research Projects Agency (DARPA) under Grant N00014 94 1 0985. Aske Plaat was supported in part by a postdoctoral fellowship from the Erasmus University Rotterdam, the Netherlands. An early version of this paper ....
Robert C. Miller. A type-checking preprocessor for Cilk 2, a multithreaded C language. Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, May 1995.
No context found.
R. D. Blumofe. Executing Multithreaded Programs Eciently. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Sept. 1995.
....currently working on improving these results. Appendix. During the time between our results becoming publicly known [7] and this journal publication, we have explored multithreaded computing more fully. We have been able to characterize the performance of a distributed thread stealing algorithm [5, 8]. For the class of fully strict (well structured) computations, this randomized algorithm achieves execution space bounded by S 1 P and expected execution time bounded by O(T 1 P T# ) including scheduling overheads. Additionally, in contrast to Algorithm LDF, this thread stealing algorithm is ....
....by S 1 P and expected execution time bounded by O(T 1 P T# ) including scheduling overheads. Additionally, in contrast to Algorithm LDF, this thread stealing algorithm is e#cient with respect to communication. We have implemented this thread stealing algorithm in the runtime system for Cilk [5, 6], a parallel multithreaded extension of the C language. By employing a provably e#cient scheduler, Cilk is able to deliver e#cient and predictable performance, guaranteed. Moreover, structure in the Cilk programming model facilitates the implementation of adaptive parallelism and transparent ....
[Article contains additional citation context not shown here]
<F4.694e+05> R. D.<F3.854e+05> Blumofe,<F4.047e+05> Executing Multithreaded Programs<F3.854e+05> E#ciently, Ph.D. thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 1995.
....apply for Algorithm GDF # and for the distributed algorithm that will be presented in the next section. Alternatively, by sequestering the nonstrictly spawned threads, the scheduler itself can budget the nonstrict spawns and achieve these same time and space bounds; details can be found in [4]. 6. Distributed scheduling algorithms. In a distributed thread scheduling algorithm, each processor works depth first out of its own local priority queue. Specifically, to get a thread to work on, a processor removes the deepest ready thread from its local queue. Ideally, we would like the ....
....the naive adoption of this technique does not work. In particular, threads must migrate occasionally and some degree of synchronization is needed to avoid the large deviations that result if this random process is run over a long period of time. Further discourse on these problems can be found in [4]. In order to achieve the desired result, we modify the Karp and Zhang technique by incorporating a new mechanism to enforce a modest degree of synchrony among the processors. Algorithm LDF (which stands for local depth first) operates in iterations, with each iteration consisting of a ....
[Article contains additional citation context not shown here]
<F4.694e+05> R. D.<F3.854e+05> Blumofe,<F4.047e+05> Managing Storage for Multithreaded<F3.854e+05> Computations, Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 1992; Tech. Report MIT/LCS/TR-552, MIT Laboratory for Computer Science, Cambridge, MA, 1992.
....apply for Algorithm GDF 0 and for the distributed algorithm that will be presented in the next section. Alternatively, by sequestering the nonstrictly spawned threads, the scheduler itself can budget the nonstrict spawns and achieve these same time and space bounds; details can be found in [4]. 6. Distributed scheduling algorithms. In a distributed thread scheduling algorithm, each processor works depth first out of its own local priority queue. Specifically, to get a thread to work on, a processor removes the deepest ready thread from its local queue. Ideally, we would like the ....
....the naive adoption of this technique does not work. In particular, threads must migrate occasionally and some degree of synchronization is needed to avoid the large deviations that result if this random process is run over a long period of time. Further discourse on these problems can be found in [4]. In order to achieve the desired result, we modify the Karp and Zhang technique by incorporating a new mechanism to enforce a modest degree of synchrony among the processors. Algorithm LDF (which stands for local depth first) operates in iterations with each iteration consisting of a ....
[Article contains additional citation context not shown here]
R. D. Blumofe, Managing storage for multithreaded computations, master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Sept. 1992. Also available as MIT Laboratory for Computer Science Technical Report MIT/LCS/TR-552.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC