Results 1  10
of
45
Will Physical Scalability Sabotage Performance Gains?
, 1997
"... Many designers expect processor performance to keep improving at the current rate indefinitely as feature sizes shrink. However, as wire delays become a larger percentage of overall signal delay and as clock speeds grow faster than transistor speed, I believe performance increases will ultimately fa ..."
Abstract

Cited by 88 (1 self)
 Add to MetaCart
Many designers expect processor performance to keep improving at the current rate indefinitely as feature sizes shrink. However, as wire delays become a larger percentage of overall signal delay and as clock speeds grow faster than transistor speed, I believe performance increases will ultimately fall off. These delays are inevitable simply because wires are not keeping pace with the scaling of other features. In fact, for CMOS processes below 0.25 micron, the physical limits of wire scaling 1 may begin to change highspeed processor design. That is, an unacceptably small percentage of the die will be reachable during a single clock cycle. To support my prediction, I have mapped trends in a metric that relates time and distance and projections in clock speed across eight processor generations, from 0.6 to 0.06 micron. During this span (probably 0.1 micton) we'll see a billion transistor processor. To illustrate how physical scalability could affect the design of processors on this scale, I also compared signal drive distance and clock speed for the span endpoints, 0.6 and 0.06 micon.
BSPLike ExternalMemory Computation
 IN PROC. 3RD ITALIAN CONFERENCE ON ALGORITHMS AND COMPLEXITY
"... In this paper we present a paradigm for solving externalmemory problems, and illustrate it by algorithms for matrix multiplication, sorting, list ranking, transitive closure and FFT. Our paradigm is based on the use of BSP algorithms. The correspondence is almost perfect, and especially the noti ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
In this paper we present a paradigm for solving externalmemory problems, and illustrate it by algorithms for matrix multiplication, sorting, list ranking, transitive closure and FFT. Our paradigm is based on the use of BSP algorithms. The correspondence is almost perfect, and especially the notion of xoptimality carries over to algorithms designed according to our paradigm. The advantages of the approach are similar to the advantages of BSP algorithms for parallel computing: scalability, portability, predictability. The performance measure here is the total work, not only the number of I/O operations as in previous approaches. The predicted performances are therefore more useful for practical applications.
BandwidthOptimal Complete Exchange on WormholeRouted 2D/3D Torus Networks: A DiagonalPropagation Approach
, 1997
"... Alltoall personalized communication, or complete exchange, is at the heart of numerous applications in parallel computing. Several complete exchange algorithms have been proposed in the literature for wormhole meshes. However, these algorithms, when applied to tori, can not take advantage of wrap ..."
Abstract

Cited by 20 (5 self)
 Add to MetaCart
(Show Context)
Alltoall personalized communication, or complete exchange, is at the heart of numerous applications in parallel computing. Several complete exchange algorithms have been proposed in the literature for wormhole meshes. However, these algorithms, when applied to tori, can not take advantage of wraparound interconnections to implement complete exchange with reduced latency. In this paper, a new diagonalpropagation approach is proposed to develop a set of complete exchange algorithms for 2D and 3D tori. This approach exploits the symmetric interconnections of tori and allows to develop a communication schedule consisting of several contentionfree phases. These algorithms are indirect in nature and they use message combining to reduce the number of phases (message startups). It is shown that these algorithms effectively use the bisection bandwidth of a torus which is twice that for an equal sized mesh, to achieve complete exchange in time which is almost half of the best known complet...
Implementing the Hierarchical PRAM on the 2D Mesh: Analyses and Experiments
, 1995
"... We investigate aspects of the performance of the EREW instance of the Hierarchical PRAM (HPRAM) model, a recursively partitionable PRAM, on the 2D mesh architecture via analysis and simulation experiments. Since one of the ideas behind the HPRAM is to systematically exploit locality in order to ne ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
We investigate aspects of the performance of the EREW instance of the Hierarchical PRAM (HPRAM) model, a recursively partitionable PRAM, on the 2D mesh architecture via analysis and simulation experiments. Since one of the ideas behind the HPRAM is to systematically exploit locality in order to negate the need for expensive communication hardware and thus promote costeffective scalability, our design decisions are based on minimizing implementation costs. The Peano indexing scheme is used as a simple and natural means of allowing the dynamic, recursive partitioning of the mesh into arbitrarilysized submeshes, as required by the HPRAM. We show that for any submesh the ratio of the largest manhattan distance between two nodes of the submesh to that of the square mesh with an identical number of processors is at most 3/2, thereby demonstrating the locality preserving properties of the Peano scheme for arbitrary partitions of the mesh. We provide matching analytical and experimenta...
ProcessorTime Tradeoffs under BoundedSpeed Message Propagation: Part I, Upper Bounds
 Theory of Computing Systems
, 1995
"... Upper bounds are derived for the processortime tradeoffs of machines such as linear arrays and twodimensional meshes, which are compatible with the physical limitation expressed by boundedspeed propagation of messages (due to the finiteness of the speed of light). It is shown that parallelism and ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
Upper bounds are derived for the processortime tradeoffs of machines such as linear arrays and twodimensional meshes, which are compatible with the physical limitation expressed by boundedspeed propagation of messages (due to the finiteness of the speed of light). It is shown that parallelism and locality combined may yield speedups superlinear in the number of processors. The speedups are inherent, due to the optimality of the obtained tradeoffs as established in a companion paper. Simulations are developed of multiprocessor machines by analogous machines with fewer processors. A crucial role is played by the hierarchical nature of the memory system. A divideandconquer technique for hierarchical memories is developed, based on the graphtheoretic notion of topological separator. For multiprocessors, this technique also requires a careful balance of memory access and interprocessor communication costs, which leads to nonintuitive orchestrations of the simulation process. Dipart...
Integer Sorting and routing in arrays with reconfigurable optical buses
 Proceedings of International Conference of Parallel Processing
, 1996
"... ..."
Incomplete kary ncube and Its Derivatives
 J. Parallel and Distributed Computing
, 2004
"... Incomplete or pruned kary ncube, nX3; is derived as follows. All links of dimension n 1 are left in place and links of the remaining n 1 dimensions are removed, except for one, which is chosen periodically from the remaining dimensions along the intact dimension n 1: This leads to a node degree of ..."
Abstract

Cited by 10 (8 self)
 Add to MetaCart
(Show Context)
Incomplete or pruned kary ncube, nX3; is derived as follows. All links of dimension n 1 are left in place and links of the remaining n 1 dimensions are removed, except for one, which is chosen periodically from the remaining dimensions along the intact dimension n 1: This leads to a node degree of 4 instead of the original 2n and results in regular networks that are Cayley graphs, provided that n 1 divides k: For n 3 ðn 5Þ; the preceding restriction is not problematic, as it only requires that k be even (a multiple of 4). In other cases, changes to the basis network to be pruned, or to the pruning algorithm, can mitigate the problem. Incomplete kary ncube maintains a number of desirable topological properties of its unpruned counterpart despite having fewer links. It is maximally connected, has diameter and fault diameter very close to those of kary ncube, and an average internode distance that is only slightly greater. Hence, the cost/performance tradeoffs offered by our pruning scheme can in fact lead to useful, and practically realizable, parallel architectures. We study pruned kary ncubes in general and offer some additional results for the special case n 3:
Lower Bounds on ProcessorTime Tradeoffs under BoundedSpeed Message Propagation
, 1995
"... Upper bounds are derived for the processortime tradeoffs of machines such as linear arrays and twodimensional meshes, which are compatible with the physical limitation expressed by boundedspeed propagation of messages (due to the finiteness of the speed of light). It is shown that parallelism and ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Upper bounds are derived for the processortime tradeoffs of machines such as linear arrays and twodimensional meshes, which are compatible with the physical limitation expressed by boundedspeed propagation of messages (due to the finiteness of the speed of light). It is shown that parallelism and locality combined may yield speedups superlinear in the number of processors. The speedups are inherent, due to the optimality of the obtained tradeoffs as established in a companion paper. Simulations are developed of multiprocessor machines by analogous machines with fewer processors. A crucial role is played by the hierarchical nature of the memory system. A divideandconquer technique for hierarchical memories is developed, based on the graphtheoretic notion of topological separator. For multiprocessors, this technique also requires a careful balance of memory access and interprocessor communication costs, which leads to nonintuitive orchestrations of the simulation process. 1
Augmented Ring Networks
, 1999
"... We study four augmentations of ring networks which are intended to enhance a ring's efficiency as a communication medium significantly, while increasing its structural complexity ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
We study four augmentations of ring networks which are intended to enhance a ring's efficiency as a communication medium significantly, while increasing its structural complexity
On the ManhattanDistance Between Points on SpaceFilling MeshIndexings
, 1996
"... Indexing schemes based on space filling curves like the Hilbert curve are a powerful tool for building efficient parallel algorithms on meshconnected computers. The main reason is that they are localitypreserving, i.e., the Manhattandistance between processors grows only slowly with increasing in ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Indexing schemes based on space filling curves like the Hilbert curve are a powerful tool for building efficient parallel algorithms on meshconnected computers. The main reason is that they are localitypreserving, i.e., the Manhattandistance between processors grows only slowly with increasing index differences. We present a simple and easytoverify proof that the Manhattandistance of any indices i and j is bounded by 3 p ji \Gamma jj \Gamma 2 for the 2DHilbert curve. The technique used for the proof is then generalized for a large class of selfsimilar curves. We use this result to show a (quite tight) bound of 4:73458 3 p ji \Gamma jj \Gamma 3 for a 3DHilbert curve. 1 Introduction It has become increasingly clear that meshconnected processor arrays, grids for short, are among the most realistic models of parallel computation [1, 4, 14, 18]. The indexing of the processors is an important aspect in the design of mesh algorithms. Several indexing schemes are wellknown. Mos...