| S.L. Peyton Jones, C. Clack, and J. Salkild, "High-Performance Parallel Graph Reduction", Proc. PARLE '89 -- Conf. on Parallel Architectures and Languages Europe, Springer-Verlag LNCS 365, pp. 193-206, 1989. |
....this synchronous communication into 15 3.0 2.5 2.0 1.5 1.0 0.5 0.0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 Number of Clients an asynchronous form, where computation can continue once a message has been sent. We are now examining reimplementing RVL using a form of parallel graph reduction [22] to allow computation on the client machines to continue while network requests are pending. One of the pleasant advantages of semi replicated distribution is that the system is robust to failure. The server never waits for clients, so a client crash cannot cause the server to hang. Clients ....
Peyton Jones, S.L., Clack, C., and Salkild, J. High-performance parallel graph reduction. In Proceedings of PARLE 89, pages 193--206, 1989.
....Calls the ( function. In this way, the evaluation of A and B can occur in parallel. Note that a task is sparked only when a value is definitely needed. This is called conservative parallelism. It would also be possible to spark tasks speculatively, when a value might be needed. As mentioned in [19], speculative parallelism introduces various administrative problems, such as: 2 DETERMINISTIC PARALLELISM 6 . y x 3 Figure 3: Graph for the expression ( x 3) y) 1. Ensuring that speculative tasks do not take priority over ....
....nodes in a graph: each node represents a function application, and the child nodes represent its arguments (see figure 3) When an expression is evaluated, the code associated with the expressions root node is executed, and then the root node is overwritten with the result value. As described in [19], the runtime system keeps a pool of ready tasks and schedules them as follows. If a task sparks a child, then the child is placed in the ready pool. When a task requires the value of a child it previously sparked, it evaluates the node normally, causing one of the following cases to apply: 1. ....
Simon L. Peyton Jones, Chris Clack, Jon Salkild, "High-performance parallel graph reduction," in LNCS, PARLE'89, Vol 1, pp. 193-206.
....blocked task pool hold tasks which wait for a result from another task. A five node prototype implementation compares favourably with the sequential implementation of the LML. GRIP Graph Reduction in Parallel (GRIP) is a shared memory architecture for parallel functional programs. The hardware [101] consists of up to eighty M68020 Processing Elements (PEs) up to twenty microprogrammable Intelligent Memory Units (IMUs) and a high bandwidth packet switched bus to interconnect them. This enables a two level address space: fast, private memory is held in the PEs as a local address space and ....
....at a higher level than read write instructions of conventional memory; it performs variable sized node allocation, garbage collection and thread scheduling. Each PE uses its private memory as a local heap; it allocates new closures and caches copies of global closures from 46 IMUs. GRIP [101, 43, 3] runs parallel Haskell programs. Its design has been adapted to the evaluation model of the STG machine [105, 100] by adding a special field in every closure to indicate the global closure of which it is a copy. If the closure has been locally allocated, this field is set to zero. A thread is a ....
[Article contains additional citation context not shown here]
Simon L. Peyton Jones, Chris Clack, and Jon Salkild. High-performance parallel graph reduction. In E Odijk, M. Rem, and Syte J.C., editors, Parallel Architectures and Languages Europe, volume 365 of LNCS, pages 193--206, Eindhoven, The Netherlands, June 1989. Springer Verlag.
....detail are needed. Slowly new languages are being developed that hide various complexities of parallel programming. Functional programming languages have often been proposed and used for parallel programming as they are semantically clean and offer a non von Neumann abstraction of computation [3, 20, 72]. The whole range of parallel programming styles have been tried using functional languages, from message passing libraries (e.g. SML with message passing libraries [50] through to automatic parallelisation (e.g. Goldberg s Buckwheat and Alfalfa [37] with varying degrees of success. Most ....
S. Peyton Jones, C. Clack, and J. Salkild. High-performance parallel graph reduction. In PARLE'89, volume 365(1) of LNCS, pages 193--206. Springer Verlag, 1989.
....these systems 1 implement a shared heap with near ideal performance, but the number of processors is limited by contention for the snooping bus. Meanwhile there have been numerous attempts to implement PGR on more scalable architectures, interesting examples being Alfalfa [19] Alice [23] GRIP [37], and George s Butterfly implementation [18] None of these has achieved performance nearly as satisfactory as Augustsson and Johnsson s h; Gi machine. A key reason for this has been the difficulty of communicating and sharing data structures. Recent developments in the design of large ....
....shared abstract data types) where cache coherency can be optimised. 7.1 Related work Several authors have studied the cache performance issues with heaps and garbage collection, most interestingly Wilson et al. [46] and Appel [3] Our work concentrates instead on coherency. The GRIP machine [37] uses a kind of cache mechanism: nodes are built in local memory and a subgraph is flushed to the globally accessible memory only when another processor might access it. In the Glasgow group s more recent work (see for example, 22] they propose a more cache like scheme where, when a node is ....
Simon L. Peyton Jones, Chris Clack, and Jon Salkild. High-performance parallel graph reduction. In E. Odijk, M. Rem, and J.-C Syre, editors, PARLE 89 Parallel Architectures and Languages Europe, Eindhoven, June 1989, volume 365 of Lecture Notes in Computer Science, pages 193--206, Berlin, 1989. Springer-Verlag.
....is the time required to perform all its computations, not including the overhead of creating the thread and the other overheads imposed by parallel execution, such as communication costs. We use a non strict, purely functional language (Haskell) with an evaluateand die mechanism of computation [10]. In this model it is possible to dynamically create new subsidiary threads to evaluate sub expressions that are found to be needed, or to entirely avoid creating threads by absorbing the work which they would have done into a parent thread. The optimal granularity for all threads is a compromise ....
....construct is the one which sparks a closure. Sparks are similar to lazy futures [9] in that they could potentially be turned into parallel threads. If so they compute the result and terminate without having to notify the parent thread. It is important to note that this evaluate and die mechanism [10] dynamically increases the granularity of the threads: a parent process may subsume the computation of a child thread. However, this does not prevent the system from producing many small threads if the overall workload is low. Therefore, our granularity control mechanisms aim at increasing thread ....
S.L. Peyton Jones, C. Clack, and J. Salkild. High-Performance Parallel Graph Reduction. Proc. PARLE '89, LNCS 365, pp. 193--206, 1989.
....a) and continues with the evaluation of b, which will normally contain a use of a. a seq b evaluates first a then b in turn. It is thus an exact equivalent of the normal sequential composition operator ; in an imperative language. This represents the evaluate and die model of parallelism [PCS89] A precise operational semantics for these primitives has been constructed by Hall et al. HBT 98] 1 With apologies to Pascal Serrarens [Ser98] For example, a parallel Fibonacci function can be described as follows. Note the use of seq to ensure that n1 is evaluated before n2 in the body ....
S.L. Peyton Jones, C. Clack, and J. Salkild, "High-Performance Parallel Graph Reduction", Proc. PARLE '89 -- Conf. on Parallel Architectures and Languages Europe, Springer-Verlag LNCS 365, pp. 193--206, 1989.
....the local transformations and a host that implements the global transformation (see figure 1) A reduction agent is typically a processor memory pair with communication capabilities. 6. 1 Parallel evaluation model Each reduction agent evaluates programs using parallel graph reduction techniques [1, 18, 19]. In graph reduction, a program is represented by a graph of expressions and the execution of this program consists of reducing the corresponding graph until the normal form, i.e. the result, is reached. This process may be carried out in parallel since any subgraph can be 37 16 ....
S.L. Peyton-Jones, C. Clack and J. Salkild, High Performance Parallel Graph Reduction, In Proc. PARLE '89 Parallel Architectures and Languages Europe, June 1989, Odijk E. et al (Eds), Springer Verlag (LNCS 365), pp. 193-206.
....languages. In the first place, of all implementations, there are not many aimed at distributed memory architectures. Most recent research in parallel functional programming focuses at implementations for (virtual) shared memory architectures ( u,G , AMPGR, GAML, GRIP, Flagship, and HyperM) [5, 6, 7, 8, 9, 10, 11, 12, 13]. And secondly, the exact costs of copying have not been made explicit for distributed memory implementations. Sometimes the use of an interpreter causes uncertainty about realistic copying costs (PAM, p RED ) 14, 15] In other cases only some simple divide and conquer programs have been ....
S. L. Peyton Jones, C. Clack, J. Salkild (1989). High Performance Parallel Graph Reduction. In Proceedings of Parallel Architecture and Languages Europe (PARLE `89), Eindhoven, the Netherlands, Springer LNCS 365/366, page 193-206.
....the game is seeing how far it extends to more realistic programs. Nevertheless, such tests provide an important sanity check: if the system does badly here then all is lost. garbage collect its local heap independently of any other PE, a property we found to be crucial on the GRIP multiprocessor [30]. ffl Thread distribution is performed lazily, but data distribution is performed somewhat eagerly. Threads are never exported to other PE to try to balance the load. Instead, work is only moved when a processor is idle (Section 2.2) Moving work prematurely can have a very bad effect on ....
....strategy, tasks are statically allocated to processors by means of annotations. Relative speedups of 8.2 to 14.8 are reported for simple benchmarks on a 16 processor Transputer system [17] 5. 3 GRIP GUM s design is a development and simplification of our earlier work on the GRIP multiprocessor [30]. GRIP s memory was divided into fast unshared memory that was local to a PE, with separate banks of globally addressed memory that could be accessed through a fast packet switched network. Closures were fetched from global memory singly on demand rather than using GUM style bulk fetching. While ....
Peyton Jones SL, Clack C, Salkild J, "Highperformance parallel graph reduction", Proc PARLE '89, Springer Verlag LNCS 365 (June 1989).
....run time. ffl An expression only needs to be marked as being evaluated when its evaluation begins, and unmarked at the end. At the time of doing the work, many of these features were novel; some similar ideas now appear in publications of work that was proceeding concurrently with ours see [PCS89, AJ89, LKID89] for example. 1 We will see that the code produced for will spawn one of the argument expressions and have the process evaluating the application of try to evaluate the other itself. It is therefore possible that the spawned argument will be evaluated while the process ....
....EVAL instruction is at the front of the list of processors) 11 . We assume that a number of auxiliary functions have been defined: 10 This idea grew up during the time when we were holding regular meetings with Simon Peyton Jones and his GRIP team. It has already been reported elsewhere, in [PCS89, Bur88b] for example. 11 Miranda has lists as a built in data type, with : as an infix Cons, so that Cons h t is written h:t, and the empty list is [ rather than Nil. It also allows a shorthand notation for finite lists, so that the list: Cons a (Cons b (Cons c Nil) can be written: a, b, ....
S.L. Peyton Jones, C. Clack, and J. Salkild. High-performance parallel graph reduction. In E. Odijk, M. Rem, and J.-C Syre, editors, Proceedings of PARLE 89, volume 1, pages 193--206, Eindhoven, The Netherlands, 12--16 June 1989. Springer-Verlag LNCS 365.
....tolerates latency. Traditionally in functional languages, distribution has been considered as providing support for parallelism and thus, tolerating latency has been a major issue in improving efficiency. This distributed system is not aimed at providing a high performance parallel application [25, 16, 7, 32], but a new model of distribution and computational mobility for loosely coupled systems. However, tolerating latency is also a major issue since a distributed system needs always to respond quickly. As already explained, introducing additional demand which speculatively evaluates expressions ....
Simon L. Peyton Jones, Chris Clack, and Jon Salkild. High-performance parallel graph reduction. In E Odijk, M. Rem, and Syte J.C., editors, Parallel Architectures and Languages Europe, volume 365 of LNCS, pages 193--206, Eindhoven, The Netherlands, June 1989. Springer Verlag.
....implementation techniques which have already been developed for functional languages. In particular, pseudo parallel execution on a single processor can be introduced with only minor changes to the usual graph reduction model, and adds only a very small overhead to ordinary sequential execution [5]. Parallel execution can be introduced by implementing separate lightweight threads of execution to evaluate subexpressions, with simple interlocks to prevent multiple evaluation of the same subexpression. The threads share a single heap, but each has its own stack. If several threads are ....
S.L. Peyton Jones & C. Clack & J. Salkild, "High-performance parallel graph reduction", Proc. Parallel Architectures and Languages Europe, Lecture Notes in Computer Science 365, Springer Verlag, 1989
....of aggregate structures. The second phase in the compiler generates code from the annotated program, according to some abstract parallel evaluation model (e.g. graph reduction) Several implementations have been developed around this scheme (e.g. Augustsson and Johnsson 1989, Burn 1989, Peyton Jones et al. 1989)) However, their success has been limited due to the following reasons (Rabhi and Manson 1991 a. ffl the compiler fails to detect parallelism, or there is no parallelism because of the way the program is written ffl the compiler generates too much parallelism and the resources of the machine ....
Peyton-Jones S.L. et al. (1989). High Performance Parallel Graph Reduction, In Proc. PARLE' 89, Odijk E. et al. (Eds), LNCS 365, pp.
.... issue demands for remote data in a dynamic and unpredictable fashion; the resulting long latency communication operations have adverse effects on distributed implementations of abstract machines, such as the Spineless Tagless G machine (STGM) PS88, Pey92] with its implementations for GRIP [PCS89] and GUM [THJ 96] The subject of this paper is a variant of the STGM that is designed to reduce the impact of long latency communication on the execution time; to this end, it exploits the inherent fine grain parallelism contained in functional programs by employing multithreading and ....
Simon L. Peyton Jones, Chris Clack, and Jon Salkild. Highperformance parallel graph reduction. In E. Odijk, M. Rem, and J.-C. Syre, editors, Proceedings of PARLE (Volume 1). Springer Verlag, 1989. LNCS 365.
....in this straighforward extention there would be a pair of stacks, or perhaps only a single stack, for each thread of control. Such systems have been designed and implemented by Maranget [Mar91] and [Geo89] apparently with good performance. This is also the approach taken in the GRIP project [Jon87, JCS89]. However, there are some properties of the standard G machine that made us want to try a different approach for a parallel implementation. Firstly, in the G machine, when reduction of a function application starts, the arguments of the application node (either in the form of a chain of binary ....
S. L. Peyton Jones, C. Clack, and J. Salkild. High-Performance Parallel Graph Reduction. In Proceedings of PARLE'89 Parallel Architectures and Languages Europe (Vol I), volume LNCS 365, pages 193--206. Springer-Verlag, June 1989.
....on performance, as long as sufficient resources are made available for the truly necessary computations. Although this strategy appears simple at first, speculative evaluation is difficult to implement, and some researchers have predicted that the resulting overhead will outweigh its benefits [Peyton Jones et al. 89] Roe 91] Functional programming languages are amenable to parallel processing and to speculative evaluation, because a function must always produce the same mapping from inputs to outputs, and no function may ever produce any other side effects. In other words, the expressions in a functional ....
....(n 2) 1 otherwise = p par (p nfib (n 2) 1) where p = nfib (n 1) Figure 5. 9: Execution Times for Nfib 30 0 20 40 60 80 100 0 10 20 30 40 50 60 70 80 90 100 Number of Processors Nfib 30 Conservative Speculative 95 On GRIP, Cfib 30 runs approximately twice as fast as Nfib 30 [Hammond Peyton Jones 90], but on the BBN TC2000, Cfib is only marginally better than Nfib. This result indicates that the overhead for task creation and scheduling is significantly less for the TC2000 implementation than for the GRIP implementation. Low overhead task creation and scheduling is a crucial part of the ....
S.L. Peyton Jones, C. Clack, and J. Salkid. High-performance parallel graph reduction. In [Odijk et al. 89], pages (1):193a--206.
....GUM Runtime System 2. 1 The Design of GUM GUM is the runtime system for GpH [THM 96] a parallel variant of the Haskell lazy functional language [PHA 97] The design was based on our experience developing parallel functional runtime systems for the GRIP distributed memory machine (DMP) PCS89] and the BBN Butter y shared memory machine (SMP) Mat93] GUM represents an abstract machine model appropriate to both architectures. More concretely GUM is a parallel graph reduction machine [PCS89] In this model both data and program are represented via graph structures. Executing a program ....
....developing parallel functional runtime systems for the GRIP distributed memory machine (DMP) PCS89] and the BBN Butter y shared memory machine (SMP) Mat93] GUM represents an abstract machine model appropriate to both architectures. More concretely GUM is a parallel graph reduction machine [PCS89] In this model both data and program are represented via graph structures. Executing a program means replacing a graph with its result. Semi explicit parallelism in GpH requires the programmer to annotate expressions that can be evaluated in parallel. The runtime system then dynamically ....
[Article contains additional citation context not shown here]
S.L. Peyton Jones, C. Clack, and J. Salkild \High Performance Parallel Graph Reduction", Proc. PARLE'89, Springer-Verlag LNCS 365, June 1989.
....our earlier research on parallelising substantial Haskell applications [1] and developing a suite of simulation and pro ling tools. 2 The GUM Runtime System GUM is the runtime system for GpH [7] a parallel variant of the Haskell lazy functional language. Being a parallel graph reduction machine [3], GUM represents an architecture independent abstract machine model appropriate to both shared and distributed memory architectures. In this model both data and program are represented via graph structures. Executing a program means rewriting a graph with its result. Semi explicit parallelism in ....
....a program means rewriting a graph with its result. Semi explicit parallelism in GpH requires the programmer to annotate expressions that can be evaluated in parallel. The runtimesystem then dynamically distributes data and work among the available processors. Potential parallelism may be subsumed [3] by existing threads in a similar way as in the lazy task creation mechanism [2] thereby dynamically increasing thread granularity. This dynamic granularity control, together with overlapping computation with communication (latency hiding) is crucial for achieving high performance on very ....
S.L. Peyton Jones, C. Clack, and J. Salkild. High Performance Parallel Graph Reduction. In Parallel Architectures and Languages Europe (PARLE'89), LNCS 365, pp. 193-206, Eindhoven, The Netherlands, Jun. 1989. Springer-Verlag.
No context found.
SL Peyton Jones, C Clack & J Salkild [June 1989], "High-performance parallel graph reduction," in Proc Parallel Architectures and Languages Europe (PARLE), E Odijk, M Rem & J-C Syre, eds., LNCS 365, Springer Verlag, .
....local addresses, within a PE s local heap, from global addresses, that point between local heaps. The management of global addresses is such that each PE can garbage collect its local heap without synchronising with other PEs, a property we found to be crucial on the GRIP multiprocessor [22]. ffl Thread distribution is performed lazily, but data distribution is performed somewhat eagerly. Threads are never exported to another PE to try to balance the load. Instead, work is only moved when a processor is idle (Section 2.2) Moving work prematurely can have a very bad effect on ....
....to the GUM fishing strategy, tasks are statically allocated to processors by means of annotations. Relative speedups of 8.2 to 14.8 are reported for simple benchmarks on a 16 processor Transputer system [10] 4. 3 GRIP GUM s design is a development of our earlier work on the GRIP multiprocessor [22]. GRIP s memory was divided into fast unshared memory that was local to a PE, with separate banks of globally addressed memory that could be accessed through a fast packet switched network. Objects were fetched from global memory singly on demand rather than using GUM style bulk fetching. While ....
Peyton Jones SL, Clack C, Salkild J, "Highperformance parallel graph reduction", Proc PARLE '89, Springer Verlag LNCS 365 (June 1989).
....distinguishes local addresses, within a PE s local heap, from global addresses, that point between local heaps. The management of global addresses is such that each PE can garbagecollect its local heap independently of any other PE, a property we found to be crucial on the GRIP multiprocessor [20]. ffl Thread distribution is performed lazily, but data distribution is performed somewhat eagerly. Threads are never exported to other PE to try to balance the load. Instead, work is only moved when a processor is idle (Section 2.2.2) Moving work prematurely can have a very bad effect on ....
Peyton Jones SL, Clack C, Salkild J, "High-performance parallel graph reduction ", Proc PARLE, Odijk E, Rem M, Syre J-C (Eds) Springer Verlag LNCS 365 (June 1989)
No context found.
SL Peyton Jones, C Clack & J Salkild [June 1989], "High-performance parallel graph reduction, " in Proc Parallel Architectures and Languages Europe (PARLE), E Odijk, M Rem & J-C Syre, eds., LNCS 365, Springer Verlag, 193--206.
No context found.
S.L. Peyton Jones, C. Clack, and J. Salkild, "High-Performance Parallel Graph Reduction", Proc. PARLE '89 -- Conf. on Parallel Architectures and Languages Europe, Springer-Verlag LNCS 365, pp. 193-206, 1989.
No context found.
Peyton Jones, S. L., Clack, C. and Salkild, J., "High-Performance Parallel Graph-Reduction", Proceedings of PARLE, pages 193-206, 1989.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC