| Selim G. Akl, The Design and Analysis of Parallel Algorithms, Prentice Hall, New Jersey, 1989 |
....Key words. Parallel computing, PRAM model, randomized PRAM simulation algorithms, module parallel computer, universal hashing, parallel slackness, T9000 transputer 1 Introduction A parallel random access machine (PRAM) is the most commonly used model for describing parallel computations [10, 4, 1]. Although the model has many advantages, it is unrealistic from the technological point of view, since on large scale machines a parallel shared memory access can only be accomplished at the cost of a substantial time delay. Several PRAM simulation algorithms have been proposed in the literature ....
Akl, S.G., The Design and Analysis of Parallel Algorithms, Prentice-Hall, Englewood Cliffs, N.J., (1989).
....strands at the same time. It corresponds to that a single instruction stream is executed by all processors to manipulate their local data synchronously. According to the architecture classification of Flynn, DNA computing can be classified into the class of Single Instruction Multiple Data (SIMD) [2]. Therefore, we could say DNA computing has the capability of parallel processing. There are two main strategies in DNA computation. One is to implement the brute force algorithm for solving NP complete problems which are still difficult to be solved in silicon based computers [8, 9, 11, 12, 14, ....
S. G. Akl, The Design and Analysis of Parallel Algorithms. Prentice-Hall, Englewood Cliffs, New Jersey, USA, first ed., 1989.
....or write to any variable. There is also a further classification of the CRCW PRAM model based on a writing conflict resolution strategy which specifies what is written when more than one processor writes to a particular variable on a given step. For more details regarding this classification see [8, 3, 1]. 3 Module parallel computer A module parallel computer (MPC) consists of n RAM processors, each of which has an associated memory module [7] A memory module is a collection of variables. Every processor may access every memory module via a fully connected network linking the processors (see ....
....C104 can route packets of any length [6] The SN9500 contains five C104s and up to 32 fully interconnected T9000s (Fig. 3) Each data link of each T9000 is connected to one of the C104 routing devices. Except for two of the T9000s, data link 0 of each T9000 is connected to C104[0] link 1 to C104[1], etc. This means that every T9000 is connected to every other T9000 via only one C104. The data links of the two T9000s and of the interface card are connected to the fifth C104 which in turn is connected via four pairs of its data links to each of the other routing devices [5] 6 PRAM ....
[Article contains additional citation context not shown here]
Akl, S.G., The Design and Analysis of Parallel Algorithms, Prentice-Hall, Englewood Cliffs, N.J., (1989). 12
....of map. Thus, the algorithm is asymptotically optimal. However, its performance in practice can su er from the fact that it exploits two collective operations, each of which involves a considerable amount of interprocessor communication. Note that the algorithm presented in a popular textbook [1] may have an even higher cost than mss alg2, since it exploits three collective operations: two scans and one reduce. The user s responsibility is not only to produce an algorithm but also to understand whether it is really usable in practice and, even more importantly, how it can be transformed ....
S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, 1989. 25
....Complexity of Some Basic Operations A two dimensional Fast Fourier Transform (FFT) is a commonly used technique in digital image processing, and several algorithms in this paper make use of it. The FFT is wellsuited for parallel applications because it is efficient and inherently parallel ( 20] [1], 22] 23] 42] With an image size of n elements, O(n log n) operations are needed for an FFT. On a parallel machine with p processors, O computational steps are required. The communications needed for an FFT are determined by the FFT algorithm implemented on a particular parallel ....
....parameters f ff g and array f Theta g. 2.2 Calculate p[0] 1 e T 2.3 For all gray levels fgg from [1. G 1] do: 2.3.1 Psi(T; g) gT 1 e 2.3.2 p[g] G Gamma 1 Psi (1 Gamma Psi) G Gamma1 Gammag (The Binomial Distribution) 2. 4 Generate a random number in the interval [0,1] at each pixel location and use this to select the new gray level fgg from p[g] An example of a binary synthetic texture generated by the Gibbs Sampler is given in Figure 5. 12 Figure 5: Isotropic Inhibition Texture using Gibbs Sampler (Texture 9b from [14] With p I Theta J processing ....
S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, Englewood Cliffs, NJ, 1989.
....prefix sum information, each processor easily determines where these elements are located and issues read primitives for the respective remote locations to fill the [1x p distributed output array. The analysis for the dynamic data redistribution algorithm shows that [4] Tcomm(r ,p) 2r maxi N[i]) p; Tcomp(n,p) O(maxi N[i] 1) Note that the input distribution N for dynamic data re distribution can range from already balanced data (N[i] ra, Vi) to the case where all data is located on a single processor ( VIi] N,i = i;N[il = 0, Vi i) For a large class of irregular problems such ....
....processor easily determines where these elements are located and issues read primitives for the respective remote locations to fill the [1x p distributed output array. The analysis for the dynamic data redistribution algorithm shows that [4] Tcomm(r ,p) 2r maxi N[i] p; Tcomp(n,p) O(maxi N[i] ) 1) Note that the input distribution N for dynamic data re distribution can range from already balanced data (N[i] ra, Vi) to the case where all data is located on a single processor ( VIi] N,i = i;N[il = 0, Vi i) For a large class of irregular problems such that data are distributed ....
[Article contains additional citation context not shown here]
S. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, Inc., Englewood Cliffs, NJ, 1989.
....paper is the implementation of parallel sorting algorithms on RASOB. Because of its fundamental importance, sorting is one of the most extensively studied computing problems. Many researchers have developed various parallel algorithms to speed up sorting on different parallel computation models [1, 19]. In particular, fast state of the art sorting algorithms were presented recently for various models of processor arrays with reconfigurable electronic buses [21] Wang et al. proposed a constant time algorithms using O(N [39] Using the Columnsort technique proposed by Leighton [18] Ben Asher ....
.... For example, what if we want to sort N data items on a P processor 1D RASOB, where P N Fortunately, there are several good methods for converting an algorithm that was designed for a P 1 processor network so that it can run on a P 2 processor network (where P 2 P 1 ) with minimum slowdown [1, 21]. The only requirement is that the processors of G 2 be coarser grained than the processors of G 1 . For example, in order to sort N data items on a P processor 1D RASOB, where P N, each processor would have to store at least N#P data items for the RASOB to be able to sort N items. As a result, ....
[Article contains additional citation context not shown here]
S. G. Akl, The Design and Analysis of Parallel Algorithms," Prentice#Hall, Englewood Cliffs, NJ, 1989.
....this thesis took a more theoretical direction. 1.2 Parallelism The field of parallel algorithms is relatively new. In this field there are two mainstreams. One mainstream tries to parallelize sequential algorithms in a rather straightforward way keeping the processor time product constant (see [Qui87, Akl89]) while the other tries to find parallel algorithms that run in polylogarithmic time (see [GR88, J aJ92] Sequential algorithms are considered fast (or feasible) when they solve their problems in polynomial time. Parallel algorithms, however, are only called fast when they solve their problems ....
S.G. Akl. The Design and Analysis of Parallel Algorithms. Prentice Hall, 1989.
.... problems of parallel processing computing the prefix sums of a bit sequence (BPS, for short) The BPS problem is fundamental, for its solution is the principle ingredient in arithmetic expression evaluation, storage and data compaction, processor assignment, and routing, among many others [1, 5, 7, 24, 25]. The remainder of the paper is organized as follows: Section 2 reviews shift switching and established terminology; Section 3 presents a naive design for BPS computation; Section 4 discusses the first proposed VLSI architecture for BPS a complete example of the working of the first proposed ....
S. G. Akl, The design and analysis of parallel algorithms, Prentice-Hall, Englewood Cliffs, New Jersey, 1989.
....Phase Abstractions Language C, Fortran, Pascal, etc. A ZPL viously ignores the cost of communication between processors, which has yet and is unlikely to be realized in hardware. In fact, with the PRAM, absurd conclusions may be drawn, including constant time sorting algorithms [Akl89] Although the PRAM is useful in studies of the limits of parallelism, it is not a practical tool for guiding algorithm design. The LogP model contains P serial processors, each with a local memory, connected via a network of unspecified network topology. Rather than a unit cost memory access, ....
Selim G. Akl. The design and analysis of parallel algorithms. Prentice Hall, Englewood Cliffs, N.J., 1989.
....or write to any variable. There is also a further classification of the CRCW PRAM model based on a writing conflict resolution strategy which specifies what is written when more than one processor writes to a particular variable on a given step. For more details regarding this classification see [10, 4, 1]. 3 Module parallel computer A module parallel computer (MPC) consists of n synchronously working RAM processors, each of which has an associated memory module [8] A memory module is a collection of variables. Every processor may access every memory module via a fully connected network linking ....
....C104 can route packets of any length [7] The SN9500 contains five C104s and up to 32 fully interconnected T9000s (Fig. 3) Each data link of each T9000 is connected to one of the C104 routing devices. Except for two of the T9000s, data link 0 of each T9000 is connected to C104[0] link 1 to C104[1], etc. This means that every T9000 is connected to every other T9000 via only one C104. The data links of the two T9000s and of the interface card are connected to the fifth C104 which in turn is connected via four pairs of its data links to each of the other routing devices [6] 6 PRAM ....
[Article contains additional citation context not shown here]
Akl, S.G., The Design and Analysis of Parallel Algorithms, Prentice-Hall, Englewood Cliffs, N.J., (1989).
....migration of the quicksort algorithm to hardware. The algorithm and its mapping to hardware are discussed. Keywords: sorting, VHDL, FPGA, digital systems, fast prototyping. 1 Introduction Ordering data according to some criterion is one of the most basic tasks using in processing information [1]. Although this ordering may be a goal in itself, the main reason for ordering data is mostly to enhance performance of other processing tasks. The data classification problem is usually stated formally using a set theory. We use the designation sorting for the theoretical problem associated to ....
....to exactly one element t of the sequence Q, and additionally, the ordering holds for any pair of elements of SS, i.e. t i t i 1 for i=1, 2, 3, p 1. The last 40 years have seen an enormous research effort in developing, implementing and testing sorting algorithms in real life problems [1]. Yet, every month new results are published suggesting enhancements in previously published sorting algorithms, or problem specific solutions that enhance performance of some real life applications needing to use sorting techniques. This work investigates strategies for accelerating sorting using ....
S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, Inc., NJ, 1989.
....stored in M 0 repeat log kA times the following steps: a. For i = 1 to kA : Sort all numbers in column i of the submesh M 0 from the bottom if i is odd and from the top if i is even; b. In parallel sort the numbers in each row of M 0 with kA steps of odd even transposition sort (see e.g. [1]) As a result the kA Delta n numbers are sorted columnwise in M 0 . Now, we give each number an additional index in such a way that after sorting the numbers again but this time according to the added indices, the numbers will be sorted rowwise in M 0 . Altogether, this sorting takes time ....
S.G. Akl. The design and analysis of parallel algorithms. Prentice-Hall, Englewood Cliffs, NJ, 1989.
.... 6, 8, 10, 14, 19, 29, 30, 31, 42, 43] Simultaneously, the structure of sets of all (n,k) combinations and related combinatorial objects like combinations with repetitions and integer compositions was investigated, and new ranking unranking techniques were proposed satisfying various requirements [3, 14, 17, 21, 23, 44]. Most known generation algorithms use the conventional representation of combinations, which is not suitable for applications, where fast generation in the binary representation is required, and therefore objects generated in one representation have to be converted into another. One instance of ....
Akl S.G.: Design and analysis of parallel algorithms, Prentice Hall, Englewood Cliffs, N.J., 1989, pp. 148-150.
....basic problem and its parallel solution lies behind our methods. Given K inputs x 1 , x K and an associative operator o, the parallel prefix problem is to compute the K partial products x 1 , x 1 ox 2 , x 1 ox 2 o . ox K ; see Ladner and Fischer (1980) and pp. 47, 341 of Akl (1989). The problem can be solved in O( K P) log P) time using P processors; see Kruskal, Rudolph and Snir (1985) Moreover, as shown by Leighton (1992) very efficient solutions can be tailored to a wide variety of parallel architectures, and special hardware or microcode is commonly used to effect ....
....a second batch of values, A(B 1) A(2B) and D(B 1) D(2B) using the same algorithm, and retaining only A(B) and D(B) playing the roles of A(0) and D(0) resp. from the first batch. Processing in this way, the time used becomes O(N( K P) K B) log P) References Akl, S. G. 1989) The Design and Analysis of Parallel Algorithms, Prentice Hall, Englewood Cliffs, NJ. Aldous, D. and Diaconis, P. 1987) Shuffling cards and stopping times. Amer. Math. Monthly 93, 333 348. Apostolico, A. Atallah, M. J. Larmore, L. L. and McFaddin, H. S. 1988) Efficient parallel ....
[Article contains additional citation context not shown here]
Akl, S. G. (1989). The Design and Analysis of Parallel Algorithms, Prentice Hall, Englewood Cliffs, NJ.
....of whether memory is shared or distributed, and independent of the control mode (SIMD or MIMD) of a parallel machine. Similar extensions could easily be included in other languages. 1 Introduction While the area of parallel algorithms is quite well developed (see for instance, textbooks [1, 7]) the state of the art in compiler technology for translating parallel programs for highly parallel computers is unsatisfactory. This paper describes some of the problems that compilers for problem oriented parallel programming languages must solve, and suggests possible solutions. It also ....
....locus of control, which dramatically simplifies the state space of a program compared to that of an MIMD program with thousands of independent loci of control. 3) There is a wide range of data parallel algorithms. Most parallel algorithms in textbooks are data parallel (compare for instance [1, 7]) According to Fox [6] more than 80 of the 84 existing, parallel applications he examined fall in the class of synchronous, data parallel programs. Furthermore, systolic algorithms as well as vector algorithms are special cases of data parallel algorithms. But data parallelism, at least as ....
[Article contains additional citation context not shown here]
Akl SG. The Design and Analysis of Parallel Algorithms. Prentice Hall, Englewood Cliffs, New Jersey, 1989.
....minor extensions of existing programming languages are required to express highly parallel programs. Thus, programmers will need only moderate additional training, mainly in the area of parallel algorithms and their analysis. This area, fortunately, is well developed; see for instance textbooks [1] and [5] In compiler technology, however, new techniques must be found to map machine independent programs to existing architectures, while at the same time parallel machine architecture must evolve to efficiently support the features that are required for problem oriented programming styles. We ....
....locus of control, which dramatically simplifies the state space of a program compared to that of an MIMD program with thousands of independent loci of control. 3) There is a wide range of data parallel algorithms. Most parallel algorithms in textbooks are data parallel (compare for instance [1, 5]) According to Fox [4] more than 80 of the 84 existing, parallel applications he examined fall in the class of synchronous, data parallel programs. Furthermore, systolic algorithms as well as vector algorithms are special cases of data parallel algorithms. But data parallelism, at least as ....
Selim G. Akl. The Design and Analysis of Parallel Algorithms. Prentice Hall, Englewood Cliffs, New Jersey, 1989.
....cause a significant slowdown. We are convinced that performance will improve when our Modula 2 compiler implements alignment and exploits locality of reference in the near future. 4 Test Problems and Results At this time, our benchmark suite consists of nine problems collected from literature [1, 4, 8, 5]. For each problem, we implemented the same algorithm in Modula 2 in C, and in MPL 2 . Then we measured the runtimes of our implementations on a 16k MasPar MP 1 and a Sparc 1 for widely ranging problem sizes. In the Modula 2 programs, we use highly efficient library routines such as reductions ....
....the error b 0 Gamma a 0 ffl. Approach II: Again, the interval [a; b] is divided evenly over all processes. Then each process performs Newton s iteration. The algorithm terminates when a process finds the root. Note: This problem occurs frequently in science and engineering applications [1]. 0.25 0.5 0.75 1 2 6 2 8 2 10 2 12 2 14 2 16 2 18 2 20 2 22 2 24 problem size Problem RSI t(c) t(m2 ) t(mpl) t(m2 ) 0.25 0.5 0.75 1 2 6 2 8 2 10 2 12 2 14 2 16 2 18 2 20 2 22 2 24 problem size Problem RSII t(c) t(m2 ) t(mpl) t(m2 ) 4.4 Point in Polygon Problem: A simple ....
[Article contains additional citation context not shown here]
Selim G. Akl. The Design and Analysis of Parallel Algorithms. Prentice Hall, Englewood Cliffs, New Jersey, 1989.
....of nodes have been proposed and recently built. One of the earliest and the most prominent one is a complete binary tree [2, 5] Many algorithms can be naturally programmed on complete binary trees (e.g. algorithms using a divide and conquer strategy) and these networks arise in many applications [1, 2, 5, 8, 13]. In some situations not all the nodes of a complete binary tree machine are of the same type; i.e. some nodes may have more memory and or processing power than others. In particular, the leaf nodes of a complete binary tree may do the actual processing, while the interior nodes may be simply ....
....trees. Let A be the class of algorithms in which the computation is performed levelby level on a complete binary tree. There are many fundamental problems that have a solution in class A (e.g. the problems of broadcasting or the problems of computing reduction functions such as min and max [1, 9, 13]) We show that any algorithm from class A can be simulated to run on a compressed tree with no slow down. In [6] we considered an example of a class A algorithm, namely the parallel prefix computation [1, 9, 13] and showed that given an algorithm to compute parallel prefix on a complete binary ....
[Article contains additional citation context not shown here]
S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice Hall, 1989. 17
....load among the processors in H. When the load of an embedding is balanced, we have = d n m e. In some architectures not all processors may have 2 identical capabilities. For example, tree networks in which the leaf processors are different from the interior processors have been designed [1, 3]. We thus investigate embeddings in which the difference between the number of leaf (resp. interior) processors assigned to any two processors of H is at most 1. More formally, we say an embedding achieves a balanced l i load if every processor of H is assigned (i) at least b n 1 2m c and at ....
S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice Hall, 1989.
....The array P is then used to permute the input array resulting in a sorted form of array V. Because the algorithm computes over a 2dimensional problem space, it has Q(n 2 ) work complexity. Similar algorithms have been described as constant time sorting algorithms for unrealizable CRCW machines [3, 22]. 2 For simplicity we assume that the input sequence contains no duplicates. The structure of the algorithm is unchanged when extended to handle duplicates. 1. n] begin S : 0; for i : 1 to n do [i. n] S = i] V) V) end; frequency of mode count : max S; get actual mode ....
Selim G. Akl. The Design and Analysis of Parallel Algorithms. Prentice Hall, Englewood Cliffs, New Jersey, 1989.
....parallel paradigm. Preliminaries and Operational Specifications Given a set S = fs 1 ; s 2 ; s 2n g of points in the plane, the convex hull of S is the smallest convex polygon P , for which each point in S is either on the boundary of P or in its interior. The following analogy given in [Akl89] might be useful: Assume that the points of S are nails driven halfway into a wooden board. A rubber band is now stretched around the set of nails and then released. When the band settles, it has the shape of a polygon. Those nails touching the band at the corners of that polygon are the vertices ....
....g pq y s pq y = reduce max y n 1 s 0 This algorithm uses all those higher order functions on sequences, which can immediately be rewritten as skeletons for a particular massively parallel architecture. The algorithm we have derived here differs from those in the parallel literature (cf. J aJ92, Akl89] Especially, it does not need unrealistic assumptions like a concurrent read access to shared memory variables as e.g. given by the PRAM model, but is well suited for massively parallel computation on distributed memory architectures by making efficiently use of the underlying interconnection ....
S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, 1989.
....and Creative Activities Support Funds WMU FRCASF 90 15, WMU 89 225274, and by the National Science Foundation under grant USE 90 52346. can be naturally programmed on complete binary tree machines (e.g. algorithms using a divide and conquer strategy) and these networks arise in many applications [2, 5, 4, 7, 10, 14, 20, 25, 27]. In some situations not all the processors of a complete binary tree machine are of the same type; i.e. some processors may have more memory and or processing power than others. In particular, the leaf processors of a complete binary tree may do the actual processing, while the interior ....
....level by level at any given time on a complete binary tree. We show that any algorithm from class A can be simulated to run on a compressed tree machine with no slow down. We first consider one of the fundamental problems in parallel computation, namely the computation of parallel prefix sum [2, 8, 15, 27], and show that given an algorithm to compute parallel prefix sum on a complete binary tree T (h) the algorithm can be converted with the same time complexity, rather in a straight forward way, to run on a compressed tree CT k (h) Note that the well known parallel prefix sum algorithms ....
[Article contains additional citation context not shown here]
S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice Hall, 1989.
....on a 448 Theta 448 matrix. Sor computes the steady state temperature of a metal sheet using a banded parallelization of red black successive overrelaxation on a 640 Theta 640 grid. Fft computes an one dimensional FFT on a 65536element array of complex numbers, using the algorithm described in [2]. Mp3d and water are part of the SPLASH suite [29] Mp3d is a wind tunnel airflow simulation. We simulated 40000 particles for 10 steps in our studies. Water is a molecular dynamics simulation computing inter and intra molecule forces for a set of water molecules. We used 256 molecules and 3 ....
S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice Hall, Inc., Englewood Cliffs, NJ, 1989.
....schemes : linear (PN0 Vertex0, PN1 Vertex1, etc. random (random(PN) Vertex0, random(PN) Vertex1, etc. semi linear (for hypercube and 3 d mesh) interleaved (for the 2 d and 3 d meshes) and infix (for tree) Note that all these schemes are 1 1 mappings. We embedded 2 d and 3 d meshes[1, 2] with end around cyclic connections using linear, interleaved and random embedding schemes. The linear embeddings corresponds to the layout the CM Fortan compiler uses for meshes (multi dimensional arrays) Measurements for each scheme are shown individually for each dimension of the mesh (see ....
....3.331 3.174 3.821 3.320 6 3.464 3.685 2.232 3.075 3.509 2.864 7 3.194 3.270 1.747 2.930 3.268 2.383 8 3.004 2.955 1.393 2.735 2.929 1.978 9 2.556 2.366 1.289 2.242 2.262 1.273 Table 8: Effective bandwidth per PN of different hypercube embeddings for CMMD V 1.3. 1 (unit = MB s) 18 A hypercube[1, 7] was embedded using linear, random and semi linear schemes ( see Table 8) A semi linear scheme (see Figure 9) involves swapping one processing node in each group of 4 nodes with another processing node in a neighboring group of 4 (PN3 Vertex4, PN4 Vertex3) At the next higher level in the ....
S.G. Akl. The design and analysis of parallel algorithms. Prentice-Hall, 1989.
....derives a better strategy in the program for improving the performance. 6.4 Performance Tuning advice Figure 8 shows the broadcasting strategy suggested by PPA after the system observed that there was a bad broadcasting strategy in the example program. The suggested strategy was proposed by Akl [10] and the operations of this strategy include 1. P 0 sends data b to P1; 2. P0 and P1 send data b to P2 and P3 respectively in parallel; 3. P0, P1, P2, and P3 send data b to P4, P5, P6, and P7 respectively in parallel; and so on. procedure broadcast (b, N) for i = 0 to (log N 1) do for j ....
S. G. Akl, "The Design And Analysis of Parallel Algorithms", Prentice--Hall International Editions, 1989.
....n p m;im) 6.48) The cost for the input independent case is given by: C hc g scan = C f (2 n p 2d Gamma 4) d Gamma1 X i=0 T 2 i n p m com (d Gamma 1)T 2 (d Gamma1) n p m com : 6.49) 6.2.3. 3 scan on the 2 D Torus Mesh The algorithm is based on the one described in [Akl89] ffl s scan If the wrap around links are not used, then the cost is given by: C1 m s scan = C f ( n p Gamma 1) z step 1 (T 1 com C f ) p 1 Gamma 1) z step 2 (C f (p 2 Gamma 1) T 1 com (p 2 Gamma 1) T 1 com (p 1 Gamma 1) z step 3 C f ....
Selim G Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, 1989.
....on a larger number of processors. The scalability analysis of FFT on hypercube provides several important insights. On the hypercube architecture, a commonly used parallel formulation of the FFT algorithm (which we shall refer to as the binary exchange algorithm in the rest of the paper) [3, 4, 6, 11, 21, 32, 41, 36, 31] can obtain linearly increasing speedup with respect to the number of processors with only a moderate increase in problem size. This is not surprising in the light of the fact that the FFT computation maps naturally to the hypercube architecture [35] However, there is a limit on the achievable ....
....of the Cooley Tukey algorithm. As the analysis of Sections 4 and 5 will show, each of these formulations minimizes the cost due to one of these constants. 3. 1 The Binary exchange Algorithm In the most commonly used mapping that minimizes communication for the binary exchange algorithm [25, 3, 4, 6, 11, 21, 32, 41, 36, 31], if (b 0 b 1 b r 1 ) is the binary representation of i, then for all i, R[i] and S[i] are mapped to processor number (b 0 b d 1 ) With this mapping, processors need to communicate with each other in the first d iterations of the main loop (starting at line 3) of the algorithm. For the ....
S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, Englewood Cliffs, NJ, 1989. 17
....to be handled by the routers, and can be done concurrently with the computation. The general outline of the algorithm is as follows: first A 1 (x; y) is determined, which requires the sum of pixels in a 3 Theta 3 square, which can be done in 4 steps using parallel reduction on 3 Theta 3 PEs [1]. For computing any A k (x; y) the row sums computed for A k (x; y) and a certain number of its neighbors are used as shown in Fig. 5(b) to get a column of sums (Fig. 5(c) which are summed to get the final A k . These steps are depicted in Figure 5, and are described in the following ....
Akl, S.G. The Design and Analysis of Parallel Algorithms, Prentice--Hall, 1989.
....the adjacency matrix A for G is needed. The elements of A are 45 defined as, a ij = 1 if there is a path of length zero or one from v i to v j 0 otherwise. By repeatedly performing boolean matrix multiplication of the matrix A, C is derived in dlog(n Gamma 1)e matrix multiplications. Akl [1] describes an algorithm for a cube connected SIMD computer that has complexity O(n 3 log 2 n) Using matrix multiplication, the load balancing of the algorithm would be evenly distributed. This boolean matrix multiplication is a variant of the block matrix multiplication algorithm discussed ....
....program communication. Perhaps another asset based on the round table or board room meeting analogy should be developed. The pre compiler output did not need further modification to run correctly. 4. 4 Alpha Beta Search The following is a summary of the discussion found in Nilsson [22] and Akl [1] on alpha beta search of combinatorial spaces. 50 A min max tree can represent the state space of a game. The state of the game is represented by a node, while the arc connecting two nodes represents a legal move. That is, the possible combinations of legal moves that lead from a given state of ....
[Article contains additional citation context not shown here]
Selim G. Akl. The Design and Analysis of Parallel Algorithms. Prentice Hall, Englewood Cliffs, NJ., 1989.
....we address of determining when all message activity has ceased. 2.4 Convergence Checking As shown in [11] noncommittal synchronization also arise in numerical contexts, when convergence determines termination. A process whose subdomain has converged enters the noncommittal barrier, but can 5 [0,1,2,3,4,5] [0,1,2] 3,4,5] 0,1] 2] 0] 1] 2] 4] 3] 5] 5] 3,4] Dimension 2 Dimension 1 Dimension 0 Dimension 3 Figure 1: Balanced tree created by splitting sets of process ids. receive a message containing boundary interface values from a neighboring process whose subdomain has not. This ....
....determining when all message activity has ceased. 2.4 Convergence Checking As shown in [11] noncommittal synchronization also arise in numerical contexts, when convergence determines termination. A process whose subdomain has converged enters the noncommittal barrier, but can 5 [0,1,2,3,4,5] [0,1,2] [3,4,5] 0,1] 2] 0] 1] 2] 4] 3] 5] 5] 3,4] Dimension 2 Dimension 1 Dimension 0 Dimension 3 Figure 1: Balanced tree created by splitting sets of process ids. receive a message containing boundary interface values from a neighboring process whose subdomain has not. This unblocks ....
[Article contains additional citation context not shown here]
S.G. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, Englewood Cliffs, NJ, 1989.
....participants are working full time on computer science, and they have already had an advanced Logo programming course the previous year. The algorithms course is taught by Selim Akl, a professor at Queen s University in Ontario and the author of three texts in the area of algorithms [Akl, 1985; Akl, 1989; Akl, 1992] The introduction to Pascal has been taught by Michael Levy, professor at the University of Victoria, and by Darrell Turnidge, an Assistant Dean of Arts and Sciences at Kent State University and an associate director of IFSMACSE. Each summer s programming course runs in parallel with ....
Akl, Selim. The Design and Analysis of Parallel Algorithms, Prentice Hall, 1989.
....C(W ) depends only on the parallel algorithm, and is independent of the architecture. For example, for multiplying two N N matrices using Fox s parallel matrix multiplication algorithm [37] W = N 3 and C(W ) N 2 = W 2 3 . It is easily seen that if the processor time product [5] is #(W ) i.e. the algorithm is cost optimal) then C(W ) O(W ) Maximum Number of Processors Usable, p max : The number of processors that yield maximum speedup S max for a given W . This is the maximum number of processors one would like to use because using more processors will not ....
.... b 0 0 0) S[ b 0 b l 1 1b l 1 b r 1 ) 9. end; 10. end; 11. end. Figure 3.1: The Cooley Tukey algorithm for single dimensional unordered FFT. The Binary Exchange Algorithm In the most commonly used mapping that minimizes communication for the binary exchange algorithm [81, 5, 11, 17, 29, 72, 106, 133, 116, 94], if (b 0 b 1 b r 1 ) is the binary representation of i, then for all i, R[i] and S[i] are mapped to processor number (b 0 b d 1 ) With this mapping, processors need to communicate with each other in the first d iterations of the main loop (starting at line 3) of the algorithm. ....
S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, Englewood Cliffs, NJ, 1989.
....making each physical processor emulate N p processors. If context switching costs are ignored, such a technique will result in a slow down of at most a factor of N p over the original parallel algorithm. If the original parallel algorithm is cost optimal (i.e. if the processor time product [2] of the original parallel algorithm is the same as the sequential time complexity of the best sequential algorithm) then this technique works well; the resulting scaled down parallel algorithm still has the same processor time product. Thus, one could choose the parallel algorithm with the best ....
....were used as the sequential sort. ponents in the recursive steps, communication overheads, etc. the problem size must grow at least exponentially to mask the effect of the large sequential component. The reader should note that there are other parallel formulations of quicksort on PRAM [2, 10, 5] which can be shown to have much better scalability. 5 More Scalable Formulations of Quicksort Here we present a new parallel quicksort algorithm and some of its variations and show that all of them are more scalable on a mesh than the naive parallel quicksort (on a PRAM) All time complexity ....
S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, 1989.
....like BBN Butterfly 4 can be done using the same model as the Cube; however, the proportionality constants are much smaller for BBN Butterfly compared to a Cube such as Intel iPSC 2. 5 4. 3 Shared Memory Parallel Architectures There are many different models of shared memory parallel processors [2]. The one we consider here is the CREW (concurrent read, exclusive write) PRAM model [2] Memory latency for both reads and writes, when they are allowed, is uniform for any location in memory. Concurrent reads are allowed but only one write to a given location can take place at one time. ....
....constants are much smaller for BBN Butterfly compared to a Cube such as Intel iPSC 2. 5 4. 3 Shared Memory Parallel Architectures There are many different models of shared memory parallel processors [2] The one we consider here is the CREW (concurrent read, exclusive write) PRAM model [2]. Memory latency for both reads and writes, when they are allowed, is uniform for any location in memory. Concurrent reads are allowed but only one write to a given location can take place at one time. Shared Memory parallel processor is abbreviated as SM in the rest of the paper. 5 Sequential ....
[Article contains additional citation context not shown here]
S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, 1989.
....or write to any variable. There is also a further classification of the CRCW PRAM model based on a writing conflict resolution strategy which specifies what is written when more than one processor writes to a particular variable on a given step. For more details regarding this classification see [11, 4, 1]. 3 Module parallel computer A module parallel computer (MPC) consists of n synchronously working RAM processors, each of which has an associated memory module [9] A memory module is a collection of variables. Every processor may access every memory module via a fully connected network 1 ....
....through all its links at the same time. The SN9500 contains five C104s and up to 32 fully connected T9000s (Fig. 5) 7] Each data link of each T9000 is connected to one of the C104 routing devices. Except for two of the T9000s, data link 0 of each T9000 is connected to C104[0] link 1 to C104[1], etc. This means that every T9000 is connected to every other T9000 via only one C104. The data links of the two T9000s and of the interface card are connected to the fifth C104 which in turn is connected via four pairs of its data links to each of the other routing devices. 6 PRAM simulators ....
[Article contains additional citation context not shown here]
Akl, S.G., The Design and Analysis of Parallel Algorithms, Prentice-Hall, Englewood Cliffs, N.J., (1989).
....then quicksort(i,r,a) END PROCEDURE The routine COMPARE returns SMALLER if the value of the first argument is smaller than the second argument. Figure 5: The sequential quicksort algorithm 2.1. 2 Parallel Mergesort Algorithm The parallel sorting algorithm is based on the technique described in [1, 2] under Merge Splitting Sort. This algorithm is based on the Odd Even Transposition Sort but takes into account that not every element can be stored on a separate processor. This is a very realistic restriction since MIMD machines are build only with a medium number of processing elements. An ....
....the data elements from one processor is done in O( n p ) steps. A mergesort of two lists requires at most 2 n p steps. Thus, the two steps are done in O( n p ) steps. Since they are repeated p=2 times, the total running time is t(n; p) O n p log n p O(n) In contrast to [1, 2] the following mergesort algorithms are used in the hope to improve the average Northeast Parallel Architectures Center Syracuse University, Northeast Parallel Architectures Center ffl Syracuse University Science and Technology Center ffl 111 College Place ffl Syracuse, NY 13244 4100 A 8 7 ....
Akl, S. G. The Design and Analysis of Parallel Algorithms. Prentice Hall, New Jersy, 1989.
....then quicksort(i,r,a) END PROCEDURE The routine COMPARE returns SMALLER if the value of the first argument is smaller than the second argument. Figure 5: The sequential quicksort algorithm 2.1. 2 Parallel Mergesort Algorithm The parallel sorting algorithm is based on the technique described in [1, 2] under Merge Splitting Sort. This algorithm is based on the Odd Even Transposition Sort but takes into account that not every element can be stored on a separate processor. This is a very realistic restriction since MIMD machines are build only with a medium number of processing elements. An ....
....the data elements from one processor is done in O( n p ) steps. A mergesort of two lists requires at most 2 n p steps. Thus, the two steps are done in O( n p ) steps. Since they are repeated p=2 times, the total running time is t(n; p) O n p log n p O(n) In contrast to [1, 2] the following mergesort algorithms are used in the hope to improve the average Northeast Parallel Architectures Center Syracuse University, Northeast Parallel Architectures Center ffl Syracuse University Science and Technology Center ffl 111 College Place ffl Syracuse, NY 13244 4100 A 8 7 ....
Akl, S. G. The Design and Analysis of Parallel Algorithms. Prentice Hall, New Jersy, 1989.
....standard deviation is relative to the mean. This can cause a slowdown on an Ethernetbased network of workstations (NOW) used for the parallel processing of such applications, for the following reasons. Suppose for example, we are using a messagepassing paradigm in a parallel root finding program [2]. Here, a function is known to have a single root somewhere in a given interval, which the program finds (to the desired level of accuracy) in a parallel iterative procedure. 1 In any given iteration, the current interval to be searched is divided into m subintervals, where m is the total number ....
S.G. Akl. "The Design and Analysis of Parallel Algorithms", Prentice Hall, Inc, 1989.
....sorted can be extended to n c for arbitrary c ; in this case the complexity of our algorithm is O(c ) on a reconfigurable mesh of size n n . 5.2. Representation conversion for graphs and trees A most desirable representation of a directed graph is the well known adjacency list representation [11]. However, as it turns out, this is not always the input format of the graph, particularly, when the graph is produced in an intermediate step of some computation. Typically, a directed graph is specified by giving a list of nodes along with a list of directed edges (i.e. ordered pairs of nodes of ....
S. G. Akl, The design and analysis of parallel algorithms, Prentice-Hall, 1989.
....3.2 Communication and Synchronization Overhead The conversion of MEs to LEs in steps 3 and 4 requires communication among processors and this can be accomplished by using either broadcast or butterfly switch. Broadcast is easy to implement but expensive. Butterfly switch communication scheme [1] as shown in Figure 4 is better than broadcast and has time complexity of O(log 2 P ) where P is the number of processors. However, it requires synchronization among processors at each level which is expensive when the number of processors is large. To minimize the communication and ....
S. G. Akl. The Design and Analysis of Parallel Algorithms. Pretice-Hall, 1989.
....handled via explicit calls to message passing directives. This concept of independent instruction execution, when combined with the distribution of different data to each processor, is defined by another model that governs this implementation, the Multiple Instruction, Multiple Data (MIMD) model [1]. With this conceptual framework established, the rest of this section will describe the implementation of the primary functions of the parallel refactorization code. This discussion will begin with a description of the primary data objects used by the parallel refactorization code followed by an ....
S. G. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, Inc., Englewood Cliffs, NJ, 1989.
No context found.
Selim G. Akl, The Design and Analysis of Parallel Algorithms, Prentice Hall, New Jersey, 1989
No context found.
Selim G. Akl, The Design and Analysis of Parallel Algorithms, Prentice Hall, New Jersey, 1989
No context found.
Akl, S.G.: Design and analysis of parallel algorithms. Prentice Hall (1989) 148-150
No context found.
S.G. Akl, The Design and Analysis of Parallel Algorithms. Inglewood Cliffs, N.J.: Prentice Hall, 1989.
No context found.
Akl, S.G. Design and Analysis of Parallel Algorithms, Prentice-Hall, 1989.
No context found.
S.G. Akl. The Design and Analysis of Parallel Algorithms. Prentice Hall, 1989.
No context found.
S.G. Akl. The Design and Analysis of Parallel Algorithms. Prentice Hall, Englewood Cliffs, New Jersey 07632, 1989.
No context found.
S.G. Akl. The Design and Analysis of Parallel Algorithms. Prentice-Hall, 1989.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC