39 citations found. Retrieving documents...
ABOLHASSAN, F., KELLER,J.,AND PAUL, W. 1999. On the cost-effectiveness of PRAMs. Acta Informatica 36, 6, 463--487.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Descriptive Simplicity in Parallel Computing - Marr (1997)   (Correct)

....does not have to be SIMD based, as demonstrated by the Fork95 shared memory programming language. The FORK language was developed by Hagerup et al.[HSS92, HSS90, Sch91, RS92, Sei93] for expressing PRAM algorithms for a physical realisation of a PRAM machine called the Saarbrucken PRAM (SB PRAM) AKP91a, AKP91b, DS92] The language has been developed into the more realistic and usable Fork95 language by Kessler and Seidl[KS95a, KS95b, KS97] Fork95 is based on the C language, with additional constructs and type specifiers. The most important new constructs are the start fg, farm fg and fork ( ....

F Abolhassan, J Keller, and W J Paul. On the cost-effectiveness of PRAMs. In Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing, pages 2--9. IEEE; ACM, IEEE Comput. Soc., December 1991.


Program Development and Performance Prediction on BSP Machines.. - Knee (1994)   (9 citations)  (Correct)

....to any element within one instruction cycle is an extremely hard problem. Ranade, 1989 ] proposes a possible implementation of the PRAM model with the Fluent Abstract Machine. This uses combining networks on a butterfly topology with a hashed address space to try and hide the network latency. Abolhassan et al. 1991 ] analyses Ranade s approach in a quantitative way by giving cost models for implementing various parts of the PRAM machine. This is then used to demonstrate an improvement on Ranade s Fluent machine using multiple butterflies and parallel slackness. It is then shown that the proposed improved ....

....the PRAM model involves it s simulation on conventional distributed memory architectures. This method usually involves hashing the address space of the PRAM across the distributed memory of the machine and replication of variables [ Mehlhorn and Vishkin, 1984 ] or using multiple hash functions [ Abolhassan et al. 1991 ] 2.2 BSP A Bulk Synchronous Parallel machine consists of a number of processor memory pairs connected by an communications network [ Valiant, 1990, Valiant, 1989 ] This network is assumed to be able to deliver messages from point to point with a uniform cost this means the cost of ....

[Article contains additional citation context not shown here]

F Abolhassan, J Keller, and W J Paul. On the cost-effectiveness of PRAMs. In Proc. 3rd IEEE Symposium on Parallel and Distributed Processing, pages 2--9, 1991.


A Shared-Memory Implementation of the Hierarchical.. - Podehl, Rauber, Rünger (1997)   (Correct)

....that the granularity of tasks enables a good load balance. Usually, the competition between load balance (or granularity) and data locality hinders a concise study of scalability. To overcome this difficulty, we use the SB PRAM which provides a large number of processors and a global shared memory [1]. Because the SB PRAM has uniform access time, the implementation can concentrate on the efficient exploitation of task granularity and can ignore issues of locality. The original implementation is optimized on the algorithmic level, on the design level for tasks (to achieve a finer granularity) ....

....support by the machine for a coordinated access to shared data. An example for such support is the multiprefix operation provided by the SB PRAM, as described in the following subsection. 3. 2 Execution platform and software support The SB PRAM is an implementation of a modified fluent machine [1]. A number p of physical processors have access to p memory modules which are connected to the processors via a butterfly interconnection network. The memory is accessed as a virtual linear shared memory distributed among the modules. To avoid congestion of a memory module, logical addresses are ....

F. Abolhassan, J. Keller, and W.J. Paul. On the Cost--Effectiveness of PRAMs. In Proceeding of the 3rd IEEE Symposium on Parallel and Distributed Processing, pages 2--9, 1991.


Scalability and Granularity Issues of the Hierarchical.. - Podehl, Rauber, Rünger (1996)   (4 citations)  (Correct)

....task queue. Usually, the competition between load balance (and granularity) and data locality hinders a concise study for scalability. This limitation vanishes when using an execution platform like the SB PRAM providing a large number of processors and a global shared memory with unit access time [1]. Thus, the implementation can concentrate on the efficient exploitation of the task granularity and can neglect effects of locality. The original implementation is optimized on the algorithmic level, on the design level for tasks (towards a finer granularity) and on the task administration ....

....Portions of the program that can be executed independently from each other in parallel are separated by a k sign. 3 Parallel implementation The hierarchical radiosity method has been implemented in a task oriented shared memory model on the SB PRAM, a realization of a modified fluent machine [1]. A number p of physical processors have access to p memory modules each consisting of m memory cells. The processors are connected to the memory modules via a butterfly interconnection network. Thus, the memory is accessed as a virtual linear shared memory distributed among the modules. Besides ....

F. Abolhassan, J. Keller, and W.J. Paul. On the Cost--Effectiveness of PRAMs. In Proceeding of the 3rd IEEE Symposium on Parallel and Distributed Processing, pages 2--9, 1991.


Experimental Results for Four Work-Optimal PRAM Simulation.. - Leppänen (1994)   (Correct)

....proposed in the literature. The letters E (Exclusive) and C (Concurrent) define the access modes of allowed read ( R ) and write ( W ) instructions per memory cell per step basis. EREW and CRCW are the most popular ERCW has received only a little attention in the literature, although in [2] it is considered to be easier to implement than the more popular (and perhaps natural) CREW. The meaning of concurrent reading is immediately clear, whereas the meaning of concurrent writing is not. Several write conflict resolution rules have been proposed according to which the CRCW and ERCW ....

....this holds when the hardware requirements (implied by the algorithms) are taken into account. We also would like to know whether the combining queues method can be beaten The original stimulation to study coated meshes was to find alternatives to butterfly based work optimal PRAM constructions [2, 1, 18, 20]. In [16] such a construction for coated meshes was presented, although the constant factors left a lot to hope for. The advantage of coated butterflies over coated meshes is the asymptotically small diameter O(log N ) which allows the overloading factor also to be relatively small. However, the ....

[Article contains additional citation context not shown here]

F. Abolhassan, J. Keller, and W.J. Paul. On the Cost-Effectiveness of PRAMs. In Proceedings, 3rd IEEE Symposium on Parallel and Distributed Computing, ACM Special Interest Group on Computer Architecture, and IEEE Computer Society, pages 2 -- 9, 1991.


Can Parallel Algorithms Enhance Serial Implementation? (Extended.. - Vishkin (1996)   (15 citations)  (Correct)

....if p 1 is fixed, having larger and larger p 2 (and therefore larger processor slackness) leads to improvements in efficiency. Several parallel machine designs take advantage of processor slackness; this includes the 1981 HEP design by B.J. Smith and the more recent design [ACCKPS] as well as [AKP] In other words, the concept of processor slackness considers parallelism as a resource. Our thesis takes advantage of this resource for the degenerate case where the computer has a single processor. Informally, when replacing 1 by p 1 processors, requests to remote memories will be managed ....

F. Abolhassan, J. Keller and W.J. Paul. On the cost-effectiveness of PRAMs. In Proc. 3rd IEEE Symp. on Par. and Dist. Proc., 1991, Dallas, TX.


Shared-Memory Implementation of an Irregular Particle.. - Rauber, Rünger, Scholtes (1996)   (3 citations)  (Correct)

....balancing issues have a much larger influence on the efficiency than for a smaller number of processors. The investigations are executed on a simulator of the SB PRAM. The SBPRAM is designed for 128 physical processors which provide a total number of 4096 virtual processors seen by the programmer [1]. The machine provides a global shared memory with uniform access time, i.e. from a virtual processor s point of view, an access to the global memory takes the same time as two arithmetic operations, independently from the memory location that is addressed. Because of this memory organization, ....

F. Abolhassan, J. Keller, and W.J. Paul. On the Cost--Effectiveness of PRAMs. In Proceeding of the 3rd IEEE Symposium on Parallel and Distributed Processing, pages 2--9, 1991.


On Implementing EREW Work-Optimally on Mesh of Trees - Ville Leppänen (1995)   (Correct)

.... memory technique to the p port memory technique (this does not seem to hold for relatively small p [Forsell 93] Therefore, the implementation of PRAM is usually considered on distributed memory machines (DMMs) where processor memory pairs are connected by some interconnection network [Abolhassan et al. 91, Karp et al. 92, Leppanen and Penttonen 94a, Ranade 91, Valiant 90] Simulation of PRAM on a 2 dimensional Mesh of Trees (MT) based DMM has been considered previously in [Luccio et al. 88, Pucci 93] probabilistically and in [Luccio et al. 90, Pucci 93] deterministically. The probabilistic ....

Abolhassan, F., Keller, J., Paul, W.J.: "On the CostEffectiveness of PRAMs"; Proc. 3rd IEEE Symposium on Parallel and Distributed Computing, ACM Special Interest Group on Computer Architecture, and IEEE Computer Society (1991), 2 -- 9.


The Programming Environment of the SB-PRAM - Grün, Rauber, Röhrig   (Correct)

....i.e. read or write accesses of different processors to the same memory cell at the same time. The CREW (concurrent read, exclusive write) model allows concurrent read accesses, but forbids concurrent write accesses. The CRCW model allows allows both concurrent read and write accesses. The SB PRAM [1] realizes a priority CRCWPRAM, i.e. for concurrent write operations to the same memory cell the processor with the highest processor number is allowed to write into the memory cell, the values of the other processors are thrown away. The global shared memory can be accessed in unit time, i.e. in ....

F. Abolhassan, J. Keller, and W.J. Paul. On the Cost--Effectiveness of PRAMs. In Proceeding of the 3rd IEEE Symposium on Parallel and Distributed Processing, pages 2--9, 1991.


Performance of Work-Optimal PRAM Simulation Algorithms on Coated .. - Leppänen (1996)   (Correct)

....relative cost of routing phase O( ffi load load ) thus decreases as the load increases. When load exceeds the routing machinery capacity (per processor) the relative cost can be expected to show asymptotic behavior. Is high multithreading level an unrealistic assumption Tera [3] and SB PRAM [1, 2] support 128 and 1024 threads per each physical processor, respectively. Supporting even more is not unrealistic from hardware point of view, since a thread basically requires only few tens of words of memory. From algorithm design point of view, the answer is not as clear. However, the facts that ....

....each axis. The purpose of having 4 bidirectional connections between neighboring nodes is to build a virtual acyclic directed graph (VDAG) from processors to memory modules and vice versa. Such a VDAG can be used to implement more sophisticated synchronization mechanism: synchronization wave [2, 11]. The idea is that when a source has sent all its packets on their way, it sends a synchronization packet. Synchronization packets from various sources push on the actual packets, and spread to all possible paths that the actual packets could go. When a (logical) node receives a synchronization ....

[Article contains additional citation context not shown here]

F. Abolhassan, J. Keller, and W.J. Paul. On the Cost-Effectiveness of PRAMs. In Proceedings, 3rd IEEE Symposium on Parallel and Distributed Computing, ACM Special Interest Group on Computer Architecture, and IEEE Computer Society, pages 2 -- 9, 1991.


Parallel Implementation of Functional Languages - Wilhelm, Alt, Martin, Raber (1997)   (2 citations)  (Correct)

....during our discussion open questions for which no efficient solution is known so far by: This is an open problem. 2 Target architectures and fair measurements As target architectures we consider distributed memory machines. Until PRAMarchitectures become commercially available [ACC 90,AKP91] these are the favourite target architectures for reasons of scalability and availability in a rather large variety (at least until most makers went bankrupt) Virtual shared memory machines often are not scalable and expose a nonuniform memory access behaviour, which makes it hard to predict ....

F. Abolhassan, J. Keller, and W.J. Paul. On the Cost-Effectiveness of PRAMs. In SPDP, pages 2--9. IEEE Computer Society, 1991.


General Purpose Parallel Computing - McColl (1993)   (64 citations)  (Correct)

....result is the same as if a single prefix operation were performed with the processors ordered by their index. McCOLL : GENERAL PURPOSE PARALLEL COMPUTING A parallel architecture which is very close in design to the Fluent machine is currently under construction at the University of Saarbr ucken [1, 2]. Valiant [258] has investigated the extent to which concurrent access to shared variables can be provided without the use of combining networks. Working with the BSP model, he has shown that if one has enough parallel slackness, then one can support concurrent accesses in software on networks ....

....to try to obtain the extreme case of the PRAM, where l and g are both 1. At any given point in time, the capabilities and economics of the technologies available will determine the most cost effective values of such parameters. An important advantage of the BSP model [258] over the PRAM [1, 2, 221] is that it provides an architecture independent framework which allows us to take full advantage of whichever values of l and g are the most cost effective at a given point in time. Large general purpose parallel computer systems will inevitably suffer hardware faults of various kinds during ....

F Abolhassan, J Keller, and W J Paul. On the cost-effectiveness of PRAMs. In Proc. 3rd IEEE Symposium on Parallel and Distributed Processing, pages 2--9, 1991.


The Queue-Read Queue-Write PRAM Model: Accounting for.. - Gibbons, Matias (1994)   (6 citations)  (Correct)

....hot spots incorporate combining logic into the interconnection network. Ranade s work [Ran89] shows that any crcw step can be simulated on certain hypercube based networks in the same asymptotic time as an erew step, and development of machines based on his technique have been reported (e.g. AKP91, DS92] It is an open question whether the system cost of supporting crcw efficiently in hardware is justified, particularly on mimd machines, and work continues in this area (e.g. DK92] Existing commercial machines are primarily designed to process low contention steps efficiently; high ....

.... [Bel92] A qrqw Kendall Square KSR1 [FBR93] A crqw MasPar MP 1 [Mas91] MP 2 global router S qrqw xnet S limited crew nCUBE 2S [SV94] A qrqw Thinking Machines CM 5 [Lei92b] data network A qrqw control network S fast scan ops Bus based machines A limited crqw Fluent [Ran89, AKP91] P S crcw MIT J Machine [DKN93] P A qrqw Stanford DASH [LLG 92] P A qrqw Tera Computer [ACC 90] P A qrqw Table 3: Contention rules of some existing multiprocessors. We have included message passing machines, as well as shared memory ones, since they are often used to run (slightly ....

F. Abolhassan, J. Keller, and W. J. Paul. On the cost-effectiveness of PRAMs. In Proc. 3rd IEEE Symp. on Parallel and Distributed Processing, pages 2--9, December 1991.


Language Support for Synchronous Parallel Critical Sections - Keßler, Seidl (1995)   (Correct)

....shared memory MIMD machines (also known as PRAM s) PRAM s are particularly well suited for the implementation of irregular numerical computations, non numerical algorithms, and database applications. One such machine currently under construction at Saarbrucken University is the SB PRAM [1, 2]. The SB PRAM is a lock stepsynchronous, massively parallel multiprocessor with up to 4096 RISC style processing elements and with a (from the programmer s view) physically shared memory of up to 2GByte with uniform memory access time. In Fork95, processors are organized in groups. Groups may be ....

F. Abolhassan, J. Keller, and W. Paul. On the cost-- effectiveness of PRAMs. In Proc. 3rd IEEE Symp. on Parallel and Distr. Processing, pages 2--9. IEEE, Dec. 1991.


Work-Optimal Simulation of PRAM Models on Meshes - Ville Leppänen, Martti.. (1994)   (Correct)

....there have been only modest interest in actually implementing PRAM. The approach in the long running Ultracomputer project [23] is PRAM style, but the machine itself must still be regarded experimental 2 . A remarkable effort of implementation is being taken at the University of Saarbrucken [2, 1, 16]. The underlying interconnection network in that 128 processor machine is the butterfly. Other efforts to implement PRAMs also exist [38] This work was financially supported by the Academy of Finland. A preliminary version of this report was published in Proceedings of the Seventh Finnish ....

....for most of the CRCW PRAMs in two ways: by extending the routing machinery with on route combining mechanism [44] or by requiring that N = P 1 ffl for some ffl 0 [51] In [30] some of the above results were also given. An implementation of the ideas 5 of [44, 45, 51] is described in [2, 1, 16]. In addition to the above work optimal simulations, it is easy to construct others simply by taking some routing mechanism RN;P , which has P inputs and outputs, and can deliver N messages in time O(N=P ) from input to outputs (with high probability) 1.1.4 Some remarks What are we allowed to ....

[Article contains additional citation context not shown here]

F. Abolhassan, J. Keller, and W.J. Paul. On the Cost-Effectiveness of PRAMs. In Proceedings, 3rd IEEE Symposium on Parallel and Distributed Computing, ACM Special Interest Group on Computer Architecture, and IEEE Computer Society, pages 2 -- 9, 1991.


Scalability and Granularity Issues of the.. - Axel Podehl.. (1996)   (4 citations)  (Correct)

....task queue. Usually, the competition between load balance (and granularity) and data locality hinders a concise study for scalability. This limitation vanishes when using an execution platform like the SB PRAM providing a large number of processors and a global shared memory with unit access time [1]. Thus, the implementation can concentrate on the efficient exploitation of the task granularity and can neglect effects of locality. The original implementation is optimized on the algorithmic level, on the design level for tasks (towards a finer granularity) and on the task administration ....

....radiosity method with maximum degree of parallelism. 3 Parallel implementation For the parallel implementation of the hierarchical radiosity method we used a task oriented shared memory model and the SB PRAM as execution platform. The SB PRAM is a realization of a modified fluent machine [1]. A number p of physical processors has access to p memory modules each consisting of m memory cells. The processors are connected to the memory modules via a butterfly interconnection network. Thus, the memory is accessed as a virtual linear shared memory distributed among the modules. Besides ....

F. Abolhassan, J. Keller, and W.J. Paul. On the Cost--Effectiveness of PRAMs. In Proceeding of the 3rd IEEE Symposium on Parallel and Distributed Processing, pages 2--9, 1991.


Simulation of PRAM Models on Meshes - Leppänen, Penttonen (1994)   (1 citation)  (Correct)

....of the loose implementation (in the sense of Valiant s parallel slackness 3 ) of the PRAM models on meshes is the subject of [50] 1. 1 Previous work The authors are aware of only one implementation project that aims at the hardware implementation of PRAM style parallel computation [2, 1, 18]. This is a little bit alarming, since in the literature the PRAM model has been a very popular platform for the design of parallel algorithms. The underlying physical structure of that implementation is the butterfly interconnection structure. The PRAM approach is not all that strange although ....

.... optimal, because time cannot be smaller than the diameter, which is Omega Gamma 33 N ) Work optimal randomized simulations of PRAM models on low degree networks (butterfly, cube connected cycles, hypercubes) have been shown in [68] An implementation of the ideas of [56, 57, 68] is described in [2, 1, 18]. The deterministic simulation of PRAM models has also been investigated. Unfortunately, a fast deterministic simulation is not possible: If the shared memory is not small and several copies of each shared memory location are not maintained, then one cannot 3 Routing machinery and processors are ....

[Article contains additional citation context not shown here]

F. Abolhassan, J. Keller, and W.J. Paul. On the Cost-Effectiveness of PRAMs. In Proceedings, 3rd IEEE Symposium on Parallel and Distributed Computing, ACM Special Interest Group on Computer Architecture, and IEEE Computer Society, pages 2 -- 9, 1991.


Ray Tracing Complex Scenes: Sequential or In Parallel? - Arno Formella   (Correct)

....de Vigo Xunta de Galicia The communication overhead is too large to be handled by the message passing systems. Our implementations on a KSR 1 [2] show that with the shared memory approach at least a cost effective relative speedup can be obtained on that machine. However, the SB Pram [3] provides a cost effective absolute speedup, which is higher than any published data of other machines. Cost effective speedup and efficiency are defined in Section 2. Section 3 introduces ray tracing, optimization methods and approaches how the problem can be implemented in parallel. Section 4 ....

....Hierarchical data structures which allow to trace large scenes on a single processor can only be handled by the shared memory approach efficiently. 4 Machines and Benchmarks We have implemented ray trace programs to run on the architectures listed in Tab. 1. The programs for the SB Pram [3] were executed on the simulator (the machine is still under construction) We measured the run times as wall clock time while being single user on the machines. We use two benchmark suites to analyze the performance of the different algorithms and machines. The set of nine images of the ....

F. Abolhassan, J. Keller, and W. J. Paul. On the Cost--Effectiveness of PRAMs. IEEE Proc. of the 3rd Symp. on Para. and Dist. Proc., pp. 2--9, 1991.


Performance of MP3D on the SB-PRAM Prototype - Dementiev, Klein, Paul (2002)   Self-citation (Paul)   (Correct)

No context found.

F. Abolhassan, J. Keller, and W. J. Paul. On the cost-effectiv eness of PRAMs. Acta Informatica, 36(6):463--487, 1999.


Reduction of network cost and wiring in Ranade's.. - Cross, Drefenstedt.. (1993)   (5 citations)  Self-citation (Keller)   (Correct)

....parallel architectures. Ranade used his algorithm for the design of a very elegant emulation of a shared memory parallel machine on a processor network [6] A reengineered version of his emulation was shown to have an emulation overhead of O(log n) where the constant involved is very small [2]. This makes the emulation interesting for practical use. Before, shared memory emulations were thought to have very large constant factors involved. Therefore they only seemed to be of academical interest. Because shared memory parallel machines are easier to program than distributed memory ....

F. Abolhassan, J. Keller, W. J. Paul, On the cost--effectiveness of PRAMs, in: Proc. 3rd Symposium on Parallel and Distributed Processing (IEEE, 1991) 2--9.


Hashing and Rehashing in Emulated Shared Memory - Keller (1992)   (2 citations)  Self-citation (Keller)   (Correct)

....theoretical and practical computer scientists. Valiant gives a good overview of theoretical results [21, 20] Ranade developed one of the first optimal routing algorithms for machines of that kind [16, 17] Abolhassan, Keller and Paul showed the efficiency of these approaches in a formal model [2] (it was formerly believed that despite optimal asymptotic efficiency the constants in these constructions were too large for practical use) and presented the design of a prototype [1] On the practical side some of these features were implemented already in the IBM RP3 [15] The Tera Computer ....

....; 10 processors and 5 jobs per processor. This serves to hide the network latency from processes. More exactly, the number c of processes per processor is proportional to log n. We chose a fixed c to obtain comparable results. The value c = 5 was taken as an average from a machine size of n = 128 [2]. Therefore in each step 5n requests are made. Step in this context means synchronous execution of one instruction on each process. As polynomials we used functions of degree = 2; 10; 20. Thus we used 6 different classes of hash functions on 7 workloads and for 6 machine sizes. Each of the 6 ....

Ferri Abolhassan, Jorg Keller, and Wolfgang J. Paul. On the cost--effectiveness of PRAMs. In Proceedings of the 3rd IEEE Symposium on Parallel and Distributed Processing, pages 2--9. IEEE, December 1991.


Optimal Sorting in Linear Arrays With Minimum Global Control - Abolhassan, Keller.. (1992)   Self-citation (Abolhassan Keller)   (Correct)

....fast. The abililty to eliminate duplicates can be used in combining networks. The ability to perform prefix computations allows for support of parallel prefix computations during routing [6] In this surrounding the algorithm is used to implement a sorting chip for a parallel machine architecture [1]. Acknowledgements The authors want to thank Werner Massonne for helpful discussions. ....

Ferri Abolhassan, Jorg Keller, and Wolfgang J. Paul. On the cost--effectiveness of PRAMs. In Proceedings of the 3rd IEEE Symposium on Parallel and Distributed Processing, pages 2--9. IEEE, December 1991.


Conservative Circuit Simulation on Shared-Memory.. - Keller, Rauber.. (1996)   (6 citations)  Self-citation (Keller)   (Correct)

....from Silicon Graphics. The disadvantage of busbased systems is that they usually can only provide a small number of processors. The SB PRAM which is currently under construction at the University of Saarbrucken is an UMA machine that provides a shared address space with a fast memory access time [1]. The latency of the network between the processors and the memory modules is hidden by pipelining of processors, i.e. each physical processor simulates a number of virtual processors. Thus, a write operation to the global memory by a virtual processor takes the same time as an arithmetic ....

F. Abolhassan, J. Keller, and W.J. Paul. On the Cost-- Effectiveness of PRAMs. In Proc. 3rd IEEE Symp. on Parallel and Distributed Processing, pages 2--9, 1991.


Fast Rehashing in PRAM Emulations - Keller (1993)   (2 citations)  Self-citation (Keller)   (Correct)

....A second approach for shared memory emulations uses caches to avoid using the network. An example is the DASH multiprocessor [12] We do not consider that approach here. To obtain unit memory access time when emulating a PRAM, multiple threads are run per processor to mask the network latency L [2, 5]. Each thread has its own register set. The threads are executed in a round robin manner with one instruction per turn. The processors are pipelined with pipeline depth L. Hence every L steps of the machine, each thread has executed another instruction. We will call the N = Lp threads of the ....

....network latency L and a shared memory of size m = 2 u can be done in time O(m=p log m L) Each processor needs local storage of size O(log m L) If we only consider polynomial time algorithms, we can assume that m is polynomial in p. Furthermore, there are PRAM emulations with L = O(log p) [2, 15]. With these assumptions the runtime is O(m=p log p) the storage requirements are O(log p) Proof: We assume that multiplication, shifts of integers, blog 2 (x)c for positive integers x and x mod 2 u Gammaj can be computed in one instruction. All operations during the preprocessing phase ....

[Article contains additional citation context not shown here]

F. Abolhassan, J. Keller and W. J. Paul, On the cost--effectiveness of PRAMs, in: Proc. 3rd Symp. on Parallel and Distributed Processing (IEEE CS Press, Los Alamitos, CA, 1991) 2--9.


A Note on Implementing Combining Networks - Keller, Walle (1995)   (1 citation)  Self-citation (Keller)   (Correct)

....routing algorithm uses six phases, i.e. six traversals of butterfly networks to route and combine requests from processors to memory modules and to re duplicate and route answers back to processors. Routing only occurs in phases 2 and 5, the other phases can be implemented by dedicated hardware [1]. In Ranade s scheme, each butterfly node contains a processor and a memory module. This can be changed such that processors (together with dedicated hardware for phases 1 and 6) are only placed at the inputs of phase 2 and the outputs of phase 5. Memory modules with multiple banks (implementing ....

....(implementing phases 3 and 4) are only placed at the outputs of phase 2 and the inputs of phase 5. One physical processor simulates a number of Ranade s processors. We call the execution of one instruction of each simulated processor a processor round. For details of the processor architecture see [1, 4]. We will focus on phase 2 because combining happens here. Phase 2 is implemented on a butterfly network as given by Def. 1. Definition 1 A butterfly network with N = 2 n inputs and outputs is a graph G n that consists d d d d Delta Delta Delta Delta Delta Delta ....

F. Abolhassan, J. Keller and W. J. Paul, On the cost--effectiveness of PRAMs, in: Proc. 3rd IEEE Symp. on Parallel and Distributed Processing (1991) 2--9.


Parallel Software Caches - Formella, Keller (1996)   Self-citation (Keller)   (Correct)

....should be avoided if high performance is the aim [3, 8] But with the upcoming of shared memory architectures both as massively parallel multiprocessors or as small scale bus oriented multiprocessors the concept of a parallel cache promises additional performance. We show that the SB PRAM [1, 2] is a good platform to investigate the numerous tradeoffs that one encounters while implementing such a parallel data structure. Some of the concepts might be transferable to other architectures such as NYU Ultracomputer [9] Tera MTA [4] and Stanford DASH [13] We define the notion of a cache ....

Ferri Abolhassan, Jorg Keller, and Wolfgang J. Paul. On the cost--effectiveness of PRAMs. In Proceedings of the 3rd IEEE Symposium on Parallel and Distributed Processing, pages 2--9. IEEE, December 1991.


Reduction of Network Cost and Wiring in Ranade's.. - Cross, Drefenstedt.. (1993)   (5 citations)  Self-citation (Keller)   (Correct)

....could end in a deadlock. Ranade used his algorithm for the design of a very elegant emulation of a shared memory parallel machine on a processor network [7] The emulation overhead is c log n. A reengineered version of his emulation was shown to have an emulation overhead where c is very small [2], making the emulation interesting for practical use. Prior to [2] shared memory emulations were thought to be impractical because of large constant factors involved. Because shared memory parallel machines are easier to program than distributed memory machines, they could become a serious ....

....of a very elegant emulation of a shared memory parallel machine on a processor network [7] The emulation overhead is c log n. A reengineered version of his emulation was shown to have an emulation overhead where c is very small [2] making the emulation interesting for practical use. Prior to [2], shared memory emulations were thought to be impractical because of large constant factors involved. Because shared memory parallel machines are easier to program than distributed memory machines, they could become a serious competitor to the latter, if there are prac1 d d d d Delta Delta ....

F. Abolhassan, J. Keller, W. J. Paul, On the cost-- effectiveness of PRAMs, in: Proc. 3rd Symposium on Parallel and Distributed Processing (IEEE, 1991) 2--9.


Realization of PRAMs: Processor Design - Jörg Keller, Wolfgang J. Paul.. (1994)   (15 citations)  Self-citation (Keller Paul)   (Correct)

....on each physical processor. A job that accesses memory is de scheduled until the answer from the remote module is back. This requires a minimum number of ready to run processes. Valiant calls this parallel slackness [8] Smith used this idea for the HEP and the TERA [3] Abolhassan et al. [1] used this idea in an architecture called SB PRAM when investigating whether PRAM emulations are feasible with todays technology. There the access time is uniformly large which makes scheduling much simpler. We will present a processor architecture for the SB PRAM. There were several design ....

....In section 5 we sketch the processor board. 2 SB PRAM Architecture The SB PRAM is an example of the shared memory emulations described in the introduction. Here we will sketch some of its features necessary to understand the processor architecture. A more detailed description can be found in [1]. The SB PRAM uses a linear hash function of the form H(x) a Delta x mod m where m, the size of the shared address space, is assumed to be a power of two. The factor a is an odd integer between 1 and m Gamma 1, that is chosen randomly before the start of an application. The module h(x) that ....

F. Abolhassan, J. Keller, and W. J. Paul. On the cost--effectiveness of PRAMs. In Proc. 3rd Symp. on Parallel and Distributed Processing, pages 2--9. IEEE, Dec. 1991.


Isolating the Reasons for the Performance of Parallel .. - Formella, Müller.. (1992)   (5 citations)  Self-citation (Paul)   (Correct)

....compiler, i.e. code generation strategies. Built in libary functions exploiting the underlying hardware are implicitely tested too. In the end we want to be able to model the machines based on the gathered information in a 1 1) At the workshop a survey talk with results from [ADK 91] and [AKP91] was given by the third author. 2) This research is part of the PARANUSS project, which is funded by BMFT and DLR. Book title and editor name c fl1992 John Wiley Sons Ltd 2 way, that a run time prediction is reasonably close to the measured run time, even for parallel algorithms. At the ....

F. Abolhassan, J. Keller, and W.J. Paul. On the Cost--Effectiveness of PRAMs. In Proc. 3rd IEEE Symposium on Parallel and Distributed Processing, pages 2--9, 1991.


Applications of PRAMs in Telecommunications - Drefenstedt, Keller, Paul (1994)   (3 citations)  Self-citation (Keller Paul)   (Correct)

....completed in time O(log p) leading to O(p) requests completed per time unit. As with many new approaches, it was first unclear whether the emulations were practical in terms of constant factors, or of pure theoretical interest. Ranade s emulation [9] was re engineered and found to be practical [12]. A prototype of the resulting architecture with p = 128 processors, called the SB PRAM, is currently being constructed [13] The prototype does not use state of the art technology. Memory access latency in this prototype is 3.8 s, all data paths are 32 bits wide, a request can be sent every 140 ....

F. Abolhassan, J. Keller, and W. J. Paul, On the cost--effectiveness of PRAMs. In Proc. 3rd Symp. on Parallel and Distributed Processing, pp. 2--9. IEEE Computer Society Press, Los Alamitos, 1991.


A Note on Implementing Combining Networks - Keller, Walle (1995)   (1 citation)  Self-citation (Keller)   (Correct)

....algorithm uses six phases, i.e. six traversals of Butterfly networks to route and combine requests from processors to memory modules and to route and re duplicate answers back to processors. However, routing only occurs in phases 2 and 5, the other phases can be implemented by dedicated hardware [1]. In Ranade s scheme, each butterfly node contains a processor and a memory module, however this can be changed such that processors (together with dedicated hardware for phases 1 and 6) are only placed at the inputs of phase 2. Memory modules with multiple banks are only placed at the outputs of ....

....inputs of phase 2. Memory modules with multiple banks are only placed at the outputs of phase 2. One physical processor simulates a number of Ranade s processors. We call the execution of one instruction of each simulated processor a processor round. For details of the processor architecture see [1, 4]. We will focus on phase 2 because combining happens here. Phase 2 is implemented on a butterfly network as given by Def. 1. Definition 1 A butterfly network with N = 2 n inputs and outputs is a graph G n that consists of n 1 stages, numbered from 0 to n, with N nodes per stage, numbered 1 ....

F. Abolhassan, J. Keller and W. J. Paul, On the cost--effectiveness of PRAMs, in: Proc. 3rd IEEE Symp. on Parallel and Distributed Processing (1991) 2--9.


On the Physical Design of PRAMs - Abolhassan, Drefenstedt, Keller.. (1993)   (36 citations)  Self-citation (Abolhassan Keller Paul)   (Correct)

....behaviour of a shared memory are called PRAMs (Parallel Random Access Machine) in the theoretical literature. The problem of simulating PRAMs on processor networks has been studied in depth [10, 17, 20, 24] A re engineered version of Ranade s Fluent machine construction [20, 21] was proven in [2] to be cost effective at the gate level, even in comparison with multi computers. This motivated the present effort to design and construct a prototype, called the SB PRAM [1] The prototype will have 128 processors. The current designs assume a clock speed of 7 Mhz for processors and 28 Mhz for ....

....sorted, in phases 3 and 4, rows are shifted. We therefore will use two butterfly networks to realize phases 2 and 5, use linear sorting arrays [14] for phases 1 and 6, and use 2 n modules with multiple banks to omit phases 3 and 4. A more detailed description of the changes made can be found in [2]. We realize the processors of one row by one physical processor that runs cn virtual processors in a pipeline. We obtain a total of p = 2 n physical processors and N = cnp virtual processors. Each virtual processor has its own register set in hardware. The instruction set is similar to that of ....

[Article contains additional citation context not shown here]

F. Abolhassan, J. Keller and W. J. Paul, On the cost--effectiveness of PRAMs. In Proc. 3rd Symp. on Parallel and Distributed Processing, pp. 2--9. IEEE CS Press, Los Alamitos (1991).


HPP: A High Performance PRAM - Formella, Keller, Walle (1996)   (4 citations)  Self-citation (Keller)   (Correct)

....how processors, network nodes and network links can be improved. In section 4, we investigate our benchmark applications and show which performance gain is possible by careful instruction scheduling. In section 5, we conclude and present further directions of research. 2 SB PRAM The SB PRAM [2] is a massively parallel multiprocessor architecture with p processors providing users with a virtual shared memory. It is based on Ranade s Fluent Machine [15] 1 In [3, p. 379] an Intel Paragon XPS with 1872 processors is reported to obtain a performance of 36.45 GFlop s on Linpack with a ....

Ferri Abolhassan, Jorg Keller, and Wolfgang J. Paul. On the cost--effectiveness of PRAMs. In Proceedings of the 3rd IEEE Symposium on Parallel and Distributed Processing, pages 2--9. IEEE, December 1991.


Logic of Global Synchrony - Yifeng Chen University (2000)   (Correct)

No context found.

ABOLHASSAN, F., KELLER,J.,AND PAUL, W. 1999. On the cost-effectiveness of PRAMs. Acta Informatica 36, 6, 463--487.


The Queue-Read Queue-Write PRAM Model: Accounting for.. - Gibbons, al. (1996)   (6 citations)  (Correct)

No context found.

F. Abolhassan, J. Keller, and W. J. Paul. On the cost-effectiveness of PRAMs. In Proc. 3rd IEEE Symp. on Parallel and Distributed Processing, pages 2--9, December 1991.


The Queue-Read Queue-Write PRAM Model: Accounting for Contention. .. - Gibbons (1996)   (6 citations)  (Correct)

No context found.

F. Abolhassan, J. Keller, and W. J. Paul. On the cost-effectiveness of PRAMs. In Proc. 3rd IEEE Symp. on Parallel and Distributed Processing, pages 2--9, December 1991.


SB-PRAM - Instruction Set Simulator System Software - Keßler   (Correct)

No context found.

F. Abolhassan, J. Keller, W.J. Paul. On the Cost-- Effectiveness of PRAMs. Proc. 3rd IEEE Symp. on Par. and Distr. Processing, IEEE CS press, 1991.


[MPS92] C. Martel, A. Park, and R. Subramonian. Work-optimal.. - Siam Journal   (Correct)

No context found.

F. Abolhassan, J. Keller, and W. J. Paul. On the cost-effectiveness of PRAMs. In Proc. 3rd IEEE Symp. on Parallel and Distributed Processing, pages 2--9, December 1991.


[MPS92] C. Martel, A. Park, and R. Subramonian. Work-optimal.. - Siam Journal   (Correct)

No context found.

F. Abolhassan, J. Keller, and W. J. Paul. On the cost-effectiveness of PRAMs. In Proc. 3rd IEEE Symp. on Parallel and Distributed Processing, pages 2--9, December 1991.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC