18 citations found. Retrieving documents...
A. G. Ranade. "Fluent Parallel Computation". PhD thesis, Yale University, 1989.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Program Development and Performance Prediction on BSP Machines.. - Knee (1994)   (9 citations)  (Correct)

....much work has been done in this area [ J aj a, 1992 ] The major disadvantage of the PRAM model is that these machines are very 3 difficult to construct. Designing a machine that has a global shared memory with access to any element within one instruction cycle is an extremely hard problem. Ranade, 1989 ] proposes a possible implementation of the PRAM model with the Fluent Abstract Machine. This uses combining networks on a butterfly topology with a hashed address space to try and hide the network latency. Abolhassan et al. 1991 ] analyses Ranade s approach in a quantitative way by giving cost ....

....be true so therefore the model limits the number of packets that can be injected into the network to one every L g steps. 2.4 Comparison The PRAM model, although excellent for analyzing algorithms, is unrealistic for real parallel machines. Using the techniques of [ Abolhassan et al. 1991, Ranade, 1989 ] could lead to an implementation of a PRAM machine but this is still to be proved. The logP model seems excellent for predicting the performance of algorithms and even captures the concept of prefetching. This cost of prefetching cannot be estimated so directly in the BSP model, although in ....

A G Ranade. Fluent Parallel Computation. PhD thesis, Yale University, May 1989.


Models and Languages for Parallel Computation - Skillicorn, Talia (1996)   (51 citations)  (Correct)

....by the architecture of the Connection Machine 2. These often included a map operation, some form of reduction, perhaps using only a fixed set of operators, and later scans (parallel prefixes) and permutation operations. In approximately chronological order, these models are: scan [32] multiprefix [170], paralations [100, 171] the C data parallel language [111, 165] the scan vector model and NESL [33 38] and CamlFlight [109] As for other data parallel languages, these models are simple and fairly abstract. For instance, C is an extension of the C language that incorporate features of the ....

A.G. Ranade. Fluent Parallel Computation. PhD thesis, Yale University, 1989.


Asynchronous Shared Memory Search Structures - Adler (1996)   (7 citations)  (Correct)

....to the search structure. Processors access a butterfly graph stored in the shared memory and are routed by following pointers between memory locations to a 2 3 Tree that contains all the keys from a small region of the set U . Butterfly networks have been shown to efficiently route packets by [Ran88], Lei92] ST91] and many others. The idea of using butterfly graphs in shared memory has been used previously in counting networks [HLS92] Following the lead of the packet routing literature, we analyze the behavior of search structure algorithms in two cases: the fixed case, where p ....

....for any strongly independent search structure. Proof: The largest number of processors that access any given output of the butterfly is Y . Since every processor accesses a randomly chosen input node of the butterfly, w.h.p. at most Y log n processors access any input node of the butterfly. [Ran88] gives a proof that realizing a p packet per input routing problem to random destinations in a butterfly with any non predictive contention resolution scheme requires only time O(log n Y ) Note that FIFO contention resolution with up to present arbitrary resolution of ties is non predictive. This ....

A. Ranade. Fluent Parallel Computation. Ph.D. Thesis, Yale University, 1988.


A General Purpose Shared-Memory Model For Parallel Computation - Ramachandran   (3 citations)  (Correct)

....not truly general purpose. Dept. of Computer Sciences, University of Texas at Austin, Austin, TX 78712. email: vlr cs.utexas.edu. This work was supported in part by NSF grant CCR GER 90 23059. Thus is not surprising that a variety of other models have been proposed in the literature, e.g. [2, 5, 6, 7, 9, 13, 15, 18, 23, 29, 32, 34, 35, 38, 40, 45, 46]) to address specific drawbacks of the pram although none of these are general purpose models. In recent years, distributed memory models that characterize the interconnection network abstractly by parameters that capture its performance have gained much attention. An early work along these lines ....

A. G. Ranade. Fluent parallel computation. PhD thesis, Department of Computer Science, Yale University, New Haven, CT, May 1989.


Performance Modeling of Distributed Memory Architectures - Johnsson (1991)   (9 citations)  (Correct)

....randomized routing [43, 42] or randomized address maps with deterministic routing [35, 37] can be used. Using either of these strategies guarantees an asymptotically optimal worst case behavior. The routing time is a small multiple of the network diameter, with high probability. Simulations [43, 36, 37] have shown that the proven bounds are pessimistic for many types of permutations, and that the routing time is very insensitive to the type of randomization used. As an example we consider sparse matrix vector multiplication. The vector elements required for the multiplication are collected with ....

Abhiram G. Ranade. Fluent Parallel Computation. PhD thesis, Yale University, 1988.


General Purpose Parallel Computing - McColl (1993)   (64 citations)  (Correct)

....more processors can write so long as they write the same value, one of the processors attempting to write will succeed but the choice of which one will succeed will be made nondeterministically, the lowest numbered processor will succeed (assuming some appropriate numbering. In other CRCW models [221] one might have the possibility of concurrent writing in which the memory location is updated to the sum of the written values, or to the minimum of the written values. As a simple example of a CREW PRAM computation, consider the problem of computing ab ac bd cd from inputs a; b; c; d. Let p i t ....

....of using a hashed address space is that we do not need then to resort to randomising to avoid bottlenecks in packet routing, simple deterministic methods will suffice. Detailed technical accounts of the role of hashing in achieving efficient general purpose parallel computing can be found in [145, 147, 220, 221, 242, 258, 259]. We will mention only the following two results which demonstrate that distributed memory architectures can efficiently simulate PRAMs. Let EPRAM(p; t) CPRAM(p; t) HY PERCUBE(p; t) COMPLETE(p; t) denote the class of problems which can be solved on a p processor EREW PRAM [ CRCW PRAM, ....

[Article contains additional citation context not shown here]

A G Ranade. Fluent parallel computation. Ph.D. Thesis, Department of Computer Science, Yale University, May 1989.


The Queue-Read Queue-Write PRAM Model: Accounting for.. - Gibbons, Matias (1994)   (6 citations)  (Correct)

....at a cost, but enforces the exclusive rule on reads and writes occurring between synchronization points. indicated by the crcw model. Hardware approaches for executing high contention crcw steps without hot spots incorporate combining logic into the interconnection network. Ranade s work [Ran89] shows that any crcw step can be simulated on certain hypercube based networks in the same asymptotic time as an erew step, and development of machines based on his technique have been reported (e.g. AKP91, DS92] It is an open question whether the system cost of supporting crcw efficiently in ....

.... Intel Paragon [Bel92] A qrqw Kendall Square KSR1 [FBR93] A crqw MasPar MP 1 [Mas91] MP 2 global router S qrqw xnet S limited crew nCUBE 2S [SV94] A qrqw Thinking Machines CM 5 [Lei92b] data network A qrqw control network S fast scan ops Bus based machines A limited crqw Fluent [Ran89, AKP91] P S crcw MIT J Machine [DKN93] P A qrqw Stanford DASH [LLG 92] P A qrqw Tera Computer [ACC 90] P A qrqw Table 3: Contention rules of some existing multiprocessors. We have included message passing machines, as well as shared memory ones, since they are often used to run ....

A. G. Ranade. Fluent parallel computation. PhD thesis, Department of Computer Science, Yale University, New Haven, CT, May 1989.


The Bird-Meertens Formalism as a Parallel Model - Skillicorn (1993)   (35 citations)  (Correct)

....connected with some bounded transit time. Another kind of solution involves restricting the full generality of programs that can be written to certain primitives with known computation or communication patterns. Experiments with this idea include adding operations such as scan [8] multiprefix [23], skeletons [14] P 3 L [15, 16] paralations [24] and the scan vector model [9] In these approaches, the complexity of the mapping problem is avoided by reducing the topological structure of the program. 2 Some Proposed Models Let us consider some proposed models to see how well they ....

A.G. Ranade. Fluent Parallel Computation. PhD thesis, Yale University, 1989.


Scans as Primitive Parallel Operations - Blelloch (1987)   (97 citations)  (Correct)

....implementation and they can be used to implement many other useful scan operations. On a P RAM, each element a i is placed in a separate processor, and the scan executes over a fixed order of the processors the prefix operation on a linked list [48, 27] and the fetch and op type instructions [21, 20, 37] are not considered. 1 The AKS sorting network [1] takes O(lg n) time deterministically, but is not practical. 2 The appendix gives a short history of the scan operations. Model Algorithm EREW CRCW Scan Graph Algorithms (n vertices, m edges, m processors) Minimum Spanning Tree O(lg 2 n) ....

....are (11, 2) 23, 14) 2, 13) 13,8) and (16, 4) 31, 4) The algorithm allocates 12, 11 and 16 pixels respectively for the three lines. 4 7 1 z processor 0 0 5 2 z processor 1 6 4 8 z processor 2 1 9 5 z processor 3 ] Sum = 12 7 18 15] scan(Sum) [0 12 19 37] [ 0 4 11 z processor 0 12 12 17 z processor 1 19 25 29 z processor 2 37 38 47 z processor 3 ] Figure 10: When operating on vectors with more elements than processors (long vectors) each processor is assigned to a contiguous block of elements. To execute an ....

Abhiram G. Ranade. Fluent Parallel Computation. PhD thesis, Yale University, Department of Computer Science, New Haven, CT, 1989.


Scheduling Parallel Communication: The h-relation Problem - Adler, Byers, Karp (1995)   (12 citations)  (Correct)

....in order of the chosen priorities, from highest to lowest. Theorem 5 An h relation can be routed under the priority queue discipline in time (2e Gamma 1)h o(h) log n w.h.p. Proof: We begin by providing a simple proof for a bound of (2e ffl)h which uses the delay sequence argument of [Ran 88] We focus attention on the last message to arrive at its destination, and retrace the sequence of delays which resulted in this message s delayed arrival. In the context of this problem, a delay sequence for a message p 0 with priority r 0 consists of a sequence of messages with increasing ....

A. Ranade. Fluent Parallel Computation. PhD thesis, Yale University, New Haven, CT, 1988.


A Provably Time-Efficient Parallel Implementation of Full.. - Greiner, Blelloch (1996)   (1 citation)  (Correct)

....basic idea is to start with an array of fixed size. When the array overflows, we move the elements to a new array of twice the number of elements in the queue. Adding to the array, growing of the array, and dequeuing from the array can all be implemented in parallel using a fetch and add operation [11, 31]. To account for the cost of growing the array, we amortized it against the cost of originally inserting into the queue. As well as handling the queues the implementation also has to handle the scheduling of the threads. This is somewhat more complicated than in the in the call by value ....

....underlying tools. Since speculative evaluation centers on the use of queues, the operations on queues are of particular importance. Our simulation bounds are parameterized by the asymptotic time T fetchadd (p) required to implement a fetch andadd operation [11] also called a multiprefix [31]) on p processors. In a fetch and add operation, each processor has an address and an integer value i. In parallel all processors can atomically fetch the value from the address while incrementing the value by i. This can be implemented in a butterfly or hypercube network by combining requests as ....

[Article contains additional citation context not shown here]

Abhiram G. Ranade. Fluent Parallel Computation. PhD thesis, Yale University, New Haven, CT, 1989.


Parallel Programming Models Based on the.. - Danelutto.. (1994)   (1 citation)  (Correct)

....structures in the model and can be used (almost) everywhere in the code to express the corresponding paradigm of parallel computation. This is the case of data parallel extensions to conventional sequential languages and of the models derived by the extension of PRAM with collective operations [41]. ffl by adopting a structured parallel language, in this case, the language provides a fixed set of forms of parallelism as primitives in the language and the structure of a parallel computation can only be expressed by means of the hierarchical nesting of the basic forms in the language. In all ....

.... operator Phi and a list of elements [a 1 ; an ] computes a 1 Phi : Phi an , ffl scan or parallel prefix [9] which given an associative operator Phi and a list of elements [a 1 ; an ] computes a list [a 1 ; a 1 Phi a 2 ; a 1 Phi : Phi an ] ffl multiprefix [41] which is a generalisation of the scan primitive. Suppose that some set of k processors references a variable x. The processors are ordered and Phi is a binary associative operator. If the initial value if x is a then the execution of multiprefix mp(x; v i ; Phi) by processor i results in it ....

A.G. Ranade. Fluent Parallel Computation. PhD thesis, Yale University, 1989.


A Parallel Complexity Model for Functional Languages - Guy Blelloch, John Greiner (1994)   (3 citations)  (Correct)

....log p) 2k 0 v e (w=p s log p) kv e (w=p s log p) where we have set k = 2k 0 . 2 5 Simulating a PRAM on an A PAL In this section we consider simulating a PRAM on an A PAL. The simulation we use gives the same results for the EREW, CREW, and CRCW PRAM as well as for the multiprefix [29] and scan models [4] The simulation is optimal in terms of work for all the PRAM variants since there is a lower bound of O(log M) work required for each random access (this is the same as for pointer machines [2] Since we don t know how to do better for the weaker models, we will base our ....

Abhiram G. Ranade. Fluent Parallel Computation. PhD thesis, Yale University, Department of Computer Science, New Haven, CT, 1989.


A Provable Time and Space Efficient Implementation of NESL - Blelloch, Greiner (1996)   (12 citations)  (Correct)

.... for memory latency in the butterfly and hypercube, and for the latency in the fetch and add operation for all three machines, we process p log p states on each step instead of just p (i.e. we use a P CEK(p log p) machine) Our simulation uses the fetch and add operation [16] or multiprefix [26]) In this operation, each processor has an address and an integer value i. In parallel all processors can atomically fetch the value from the address while incrementing the value by i. In our case it is important that the fetch and add is stable if two processors make a request simultaneously, ....

....that the fetch and add is stable if two processors make a request simultaneously, the processor with the smaller ID will access the counter first. The stable fetch and add operation can be implemented in a butterfly or hypercube network by combining requests as they go through the network [26], and on a PRAM by various other techniques [24, 15] For all machines, if each processor makes up to m fetch andadd requests, all requests can be processed in O(m log p) time with high probability (the bounds can be slightly improved on the CRCW PRAM [15] These bounds assume the butterfly ....

Abhiram G. Ranade. Fluent Parallel Computation. PhD thesis, Yale University, New Haven, CT, 1989.


Parallelism in Sequential Functional Languages - Blelloch, Greiner (1995)   (8 citations)  (Correct)

....d(1 log p) 2k 0 ve(w=p d log p) kve(w=p d log p) where we have set k = 2k 0 . 2 4 Simulating a PRAM on an A PAL In this section we consider simulating a PRAM on an APAL. The simulation we use gives the same results for the EREW, CREW, and CRCW PRAM as well as for the multiprefix [32] and scan models [4] The simulation is optimal in terms of work for all the PRAM variants. This is because it takes logarithmic work to simulate each random access into memory (this is the same as for pointer machines [2] Since we don t know how to do better for the weaker models, we will base ....

Abhiram G. Ranade. Fluent Parallel Computation. PhD thesis, Yale University, Department of Computer Science, New Haven, CT, 1989.


Parallelism and the Bird-Meertens Formalism - Skillicorn (1992)   (6 citations)  (Correct)

.... completely abstract graph reduction [20, 30, 47, 40] OBJ and its derivatives [24, 23] UNITY [13] action systems [4] partly abstract dataflow [22, 25] Linda [2] actors [1] low level PMI [15] Pi [48] the PRAM model models restricting the form of computations PRAM extensions scan [10] multiprefix [41] XPRAM, bulk synchronous parallelism [46, 45] YPRAM [21] hierarchical PRAM [29] VLIW in the large [16] control structured skeletons [14] P 3 L [17, 18, 5] data parallel vectors, architecture independent scan vector model, Vcode [11, 12] arrays, SIMD Fortran90 [32, 3, 39] paralations [42] ....

.... how this power can be used to compute the parallel prefix [33] or scan of a list of values, that is for a list of elements [a 0 ; a 1 ; a n ] the list [a 0 ; a 0 ( a 1 ; a 0 ( a 1 ( a n ] Ranade suggested a further operation that relies on the computing power of the switch [41] the multiprefix. Suppose that some set of k processors references a variable A, the processors are ordered, and ( is a binary operation. If the initial value of A is a then the execution of the multiprefix MP(A; v i ; by processor i results in it acquiring the value a ( v 1 ( v 2 ( v i ....

A.G. Ranade. Fluent Parallel Computation. PhD thesis, Yale University, 1989.


Randomized Routing and Sorting on Fixed-Connection Networks - Leighton, Maggs, Ranade, Rao (1994)   (62 citations)  Self-citation (Ranade)   (Correct)

....of the rank must be transmitted before the less significant ones. With the message format as above, it is possible for each node to send outgoing message packets as soon as the corresponding packets arrive on all incoming edges. In fact message combining can also be made to work with pipelining [30, 31]. It is possible to show that Theorem 2.12 still applies. The analysis involves constructing a delay sequence and a counting argument similar that for Theorem 2.9. 4 3 2 1 0 0 1 2 3 4 column row Figure 2: A 5 Theta 5 mesh. 3 Routing on meshes In this section we apply the O(c L log N) ....

A. G. Ranade. Fluent Parallel Computation. PhD thesis, Yale University, New Haven, CT, 1988.


Practical Structured Parallelism Using BMF - Crooke (1998)   (Correct)

No context found.

A. G. Ranade. "Fluent Parallel Computation". PhD thesis, Yale University, 1989.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC