Results 11 - 20
of
208
Removing Randomness in Parallel Computation Without a Processor Penalty
- Journal of Computer and System Sciences
, 1988
"... We develop some general techniques for converting randomized parallel algorithms into deterministic parallel algorithms without a blowup in the number of processors. One of the requirements for the application of these techniques is that the analysis of the randomized algorithm uses only pairwise in ..."
Abstract
-
Cited by 49 (1 self)
- Add to MetaCart
We develop some general techniques for converting randomized parallel algorithms into deterministic parallel algorithms without a blowup in the number of processors. One of the requirements for the application of these techniques is that the analysis of the randomized algorithm uses only pairwise independence. Our main new result is a parallel algorithm for coloring the vertices of an undirected graph using at most \Delta + 1 distinct colors in such a way that no two adjacent vertices receive the same color, where \Delta is the maximum degree of any vertex in the graph. The running time of the algorithm is O(log 3 n log log n) using a linear number of processors on a concurrent read, exclusive write (CREW) parallel random access machine (PRAM). 1 Our techniques also apply to several other problems, including the maximal independent set problem and the maximal matching problem. The application of the general technique to these last two problems is mostly of academic interest because...
Planar Separators and Parallel Polygon Triangulation
, 1992
"... We show how to construct an O( p n)-separator decomposition of a planar graph G in O(n) time. Such a decomposition defines a binary tree where each node corresponds to a subgraph of G and stores an O( p n)-separator of that subgraph. We also show how to construct an O(n ffl )-way decomposition tree ..."
Abstract
-
Cited by 46 (7 self)
- Add to MetaCart
We show how to construct an O( p n)-separator decomposition of a planar graph G in O(n) time. Such a decomposition defines a binary tree where each node corresponds to a subgraph of G and stores an O( p n)-separator of that subgraph. We also show how to construct an O(n ffl )-way decomposition tree in parallel in O(log n) time so that each node corresponds to a subgraph of G and stores an O(n 1=2+ffl )-separator of that subgraph. We demonstrate the utility of such a separator decomposition by showing how it can be used in the design of a parallel algorithm for triangulating a simple polygon deterministically in O(log n) time using O(n= log n) processors on a CRCW PRAM. Keywords: Computational geometry, algorithmic graph theory, planar graphs, planar separators, polygon triangulation, parallel algorithms, PRAM model. 1 Introduction Let G = (V; E) be an n-node graph. An f(n)-separator is an f(n)-sized subset of V whose removal disconnects G into two subgraphs G 1 and G 2 each...
Parallel Algorithms for Higher-Dimensional Convex Hulls
"... We give fast randomized and deterministic parallel meth-ods for constructing convex hulls in IR d, for any fixed d. Our methods are for the weakest shared-memory model,the EREW PRAM, and have optimal work bounds (with high probability for the randomized methods). In partic-ular, we show that the co ..."
Abstract
-
Cited by 42 (14 self)
- Add to MetaCart
We give fast randomized and deterministic parallel meth-ods for constructing convex hulls in IR d, for any fixed d. Our methods are for the weakest shared-memory model,the EREW PRAM, and have optimal work bounds (with high probability for the randomized methods). In partic-ular, we show that the convex hull of n points in IRd canbe constructed in O(log n) time using O(n log n + nbd=2c)work, with high probability. We also show that it can be constructed deterministically in O(log2 n) time using O(n log n) work for d = 3 and in O(log n) time using O(nbd=2c logc(dd=2e\Gamma bd=2c) n) work, for d * 4, where c? 0is a constant, which is optimal for even d * 4. We also showhow to make our 3-dimensional methods output-sensitive with only a small increase in running time.These methods can be applied to other problems as well. A variation of the convex hull algorithm for even dimen-sions deterministically constructs a (1=r)-cutting of n hy-perplanes in IR d in O(log n) time using optimal O(nrd\Gamma 1) work; when r = n, we obtain their arrangement and a pointlocation data structure for it. With appropriate modifications, our deterministic 3-dimensional convex hull algorithmcan be used to compute, in the same resource bounds, the intersection of n balls of equal radius in R³. This leads to asequential algorithm for computing the diameter of a point set in IR3 with running time O(n log³ n), which is arguablysimpler than an algorithm with the same running time by Brönnimann et al.
Can a Shared-Memory Model Serve as a Bridging Model for Parallel Computation?
, 1999
"... There has been a great deal of interest recently in the development of general-purpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style fo ..."
Abstract
-
Cited by 41 (11 self)
- Add to MetaCart
There has been a great deal of interest recently in the development of general-purpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style for designing algorithms when compared with the PRAM model. Indeed, while many consider data parallelism as a convenient style, and the shared-memory abstraction as an easyto-use platform, the bandwidth limitations of current machines have diverted much attention to message-passing and distributed-memory models (such as the BSP and LogP) that account more properly for these limitations. In this paper we consider the question of whether a shared-memory model can serve as an effective bridging model for parallel computation. In particular, can a shared-memory model be as effective as, say, the BSP? As a candidate for a bridging model, we introduce the Queuing Shared-Memory (QSM) model, which accounts for limited communication bandwidth while still providing a simple shared-memory abstraction. We substantiate the ability of the QSM to serve as a bridging model by providing a simple work-preserving emulation of the QSM on both the BSP, and on a related model, the (d, x)-BSP. We present evidence that the features of the QSM are essential to its effectiveness as a bridging model. In addition, we describe scenarios
Doubly Logarithmic Communication Algorithms for Optical Communication Parallel Computers
- In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1994
"... In this paper we consider the problem of interprocessor communication on parallel computers that have optical communication networks. We consider the Completely Connected Optical Communication Parallel Computer (OCPC), which has a completely connected optical network and also the Mesh of Optical Bus ..."
Abstract
-
Cited by 38 (4 self)
- Add to MetaCart
In this paper we consider the problem of interprocessor communication on parallel computers that have optical communication networks. We consider the Completely Connected Optical Communication Parallel Computer (OCPC), which has a completely connected optical network and also the Mesh of Optical Buses Parallel Computer (MOBPC) , which has a mesh of optical buses as its communication network. The particular communication problem that we study is that of realizing an h-relation. In this problem, each processor has at most h messages to send and at most h messages to receive. It is clear that any 1-relation can be realized in one communication step on an OCPC. However, the best previously known p-processor OCPC algorithm for realizing an arbitrary h-relation for h ? 1 requires \Theta(h + log p) expected communication steps. (This algorithm is due to Valiant and is based on earlier work of Anderson and Miller.) Valiant's algorithm is optimal only for h = \Omega\Gamma139 p) and it is an op...
Improved Parallel Integer Sorting without Concurrent Writing
, 1992
"... We show that n integers in the range 1 : : n can be sorted stably on an EREW PRAM using O(t) time and O(n( p log n log log n + (log n) 2 =t)) operations, for arbitrary given t log n log log n, and on a CREW PRAM using O(t) time and O(n( p log n + log n=2 t=logn )) operations, for arbitrary ..."
Abstract
-
Cited by 38 (4 self)
- Add to MetaCart
We show that n integers in the range 1 : : n can be sorted stably on an EREW PRAM using O(t) time and O(n( p log n log log n + (log n) 2 =t)) operations, for arbitrary given t log n log log n, and on a CREW PRAM using O(t) time and O(n( p log n + log n=2 t=logn )) operations, for arbitrary given t log n. In addition, we are able to sort n arbitrary integers on a randomized CREW PRAM within the same resource bounds with high probability. In each case our algorithm is a factor of almost \Theta( p log n) closer to optimality than all previous algorithms for the stated problem in the stated model, and our third result matches the operation count of the best previous sequential algorithm. We also show that n integers in the range 1 : : m can be sorted in O((log n) 2 ) time with O(n) operations on an EREW PRAM using a nonstandard word length of O(log n log log n log m) bits, thereby greatly improving the upper bound on the word length necessary to sort integers with a linear t...
Designing Efficient Sorting Algorithms for Manycore GPUs
, 2009
"... We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23 % faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, we carefully design our algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. We exploit the high-speed onchip shared memory provided by NVIDIA’s GPU architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be wellsuited for other manycore processors.
On the Cost-Effectiveness of PRAMs
, 1991
"... We introduce a formalism which allows to treat computer architecture as a formal optimization problem. We apply this to the design of shared memory parallel machines. Present computers of this type support the programming model of a shared memory. But simultaneous access to the shared memory by seve ..."
Abstract
-
Cited by 33 (12 self)
- Add to MetaCart
We introduce a formalism which allows to treat computer architecture as a formal optimization problem. We apply this to the design of shared memory parallel machines. Present computers of this type support the programming model of a shared memory. But simultaneous access to the shared memory by several processors is in many situations processed sequentially. Asymptotically good solutions for this problem are offered by theoretical computer science. We modify these constructions under engineering aspects and improve the price/performance ratio by roughly a factor of 6. The resulting machine has surprisingly good price/performance ratio even if compared with distributed memory machines. For almost all access patterns of all processors into the shared memory, access is as fast as the access of only a single processor. 1 Introduction Commercially available parallel machines can be classified as distributed memory machines or shared memory machines. Exchange of data between different proce...
A New Parallel Algorithm For The Maximal Independent Set Problem
, 1989
"... A new parallel algorithm for the maximal independent set problem is constructed. It runs in O(log 4 n) time when implemented on a linear number of EREW-processors. This is the first deterministic algorithm for the maximal independent set problem (MIS) whose running time is polylogarithmic and whose ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
A new parallel algorithm for the maximal independent set problem is constructed. It runs in O(log 4 n) time when implemented on a linear number of EREW-processors. This is the first deterministic algorithm for the maximal independent set problem (MIS) whose running time is polylogarithmic and whose processor-time product is optimal up to a polylogarithmic factor.
Efficient Piecewise-Linear Function Approximation Using the Uniform Metric
- Discrete & Computational Geometry
, 1994
"... We give an O(n log n)-time method for finding a best k-link piecewise-linear function approximating an n-point planar data set using the well-known uniform metric to measure the error, ffl 0, of the approximation. Our method is based upon new characterizations of such functions, which we exploit to ..."
Abstract
-
Cited by 33 (0 self)
- Add to MetaCart
We give an O(n log n)-time method for finding a best k-link piecewise-linear function approximating an n-point planar data set using the well-known uniform metric to measure the error, ffl 0, of the approximation. Our method is based upon new characterizations of such functions, which we exploit to design an efficient algorithm using a plane sweep in "ffl space" followed by several applications of the parametric searching technique. The previous best running time for this problem was O(n 2 ). 1 Introduction Approximating a set S = f(x 1 ; y 1 ); (x 2 ; y 2 ); : : : ; (x n ; y n )g of points in the plane by a function is a classic problem in applied mathematics. The general goals in this area of research are to find a function F belonging to a class of functions F such that each F 2 F is simple to describe, represent, and compute and such that the chosen F approximates S well. For example, one may desire that F be the class of linear or piecewise-linear functions, and, for any parti...

