Results 1 - 10
of
14
Models and Languages for Parallel Computation
- ACM COMPUTING SURVEYS
, 1998
"... We survey parallel programming models and languages using 6 criteria [:] should be easy to program, have a software development methodology, be architecture-independent, be easy to understand, guranatee performance, and provide info about the cost of programs. ... We consider programming models in ..."
Abstract
-
Cited by 121 (4 self)
- Add to MetaCart
We survey parallel programming models and languages using 6 criteria [:] should be easy to program, have a software development methodology, be architecture-independent, be easy to understand, guranatee performance, and provide info about the cost of programs. ... We consider programming models in 6 categories, depending on the level of abstraction they provide.
Scalable Load Balancing Techniques for Parallel Computers
, 1994
"... In this paper we analyze the scalability of a number of load balancing algorithms which can be applied to problems that have the following characteristics : the work done by a processor can be partitioned into independent work pieces; the work pieces are of highly variable sizes; and it is not po ..."
Abstract
-
Cited by 89 (16 self)
- Add to MetaCart
In this paper we analyze the scalability of a number of load balancing algorithms which can be applied to problems that have the following characteristics : the work done by a processor can be partitioned into independent work pieces; the work pieces are of highly variable sizes; and it is not possible (or very difficult) to estimate the size of total work at a given processor. Such problems require a load balancing scheme that distributes the work dynamically among different processors. Our goal here is to determine the most scalable load balancing schemes for different architectures such as hypercube, mesh and network of workstations. For each of these architectures, we establish lower bounds on the scalability of any possible load balancing scheme. We present the scalability analysis of a number of load balancing schemes that have not been analyzed before. This gives us valuable insights into their relative performance for different problem and architectural characteristi...
An Empirical Comparison of the Kendall Square Research KSR-1 and Stanford DASH Multiprocessors
, 1993
"... Two interesting variants of large-scale shared-addressspace parallel architectures are cache-coherent non-uniformmemory -access machines (CC-NUMA) and cache-only memory architectures (COMA). Both have distributed main memory and use directory-based cache coherence. While both architectures migrate a ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
Two interesting variants of large-scale shared-addressspace parallel architectures are cache-coherent non-uniformmemory -access machines (CC-NUMA) and cache-only memory architectures (COMA). Both have distributed main memory and use directory-based cache coherence. While both architectures migrate and replicate data at the cache level automatically under hardware control, COMA machines do this at the main memory level as well. Previous work had discussed the general advantages and disadvantages of the two types of architectures, and presented results comparing the performance of small problems on simulated architectures of the two types. In this paper, we compare the parallel performance of a recent realization of each type of architecture---the Stanford DASH multiprocessor (CC-NUMA) and the Kendall Square Research KSR-1 (COMA). Using a suite of important computational kernels and complete scientific applications, we examine performance differences resulting both from the CCNUMA /COMA ...
A Combining Mechanism for Parallel Computers
- IN PROCEEDINGS OF THE FIRST HEINZ NIXDORF SYMPOSIUM
, 1992
"... In a multiprocessor computer communication among the components may be based either on a simple router, which delivers messages point-to-point like a mail service, or on a more elaborate combining network that, in return for a greater investment in hardware, can combine messages to the same addre ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
In a multiprocessor computer communication among the components may be based either on a simple router, which delivers messages point-to-point like a mail service, or on a more elaborate combining network that, in return for a greater investment in hardware, can combine messages to the same address prior to delivery. This paper describes a mechanism for recirculating messages in a simple router so that the added functionality of a combining network, for arbitrary access patterns, can be achieved by it with provable efficiency. The method brings together the messages with the same destination address in more than one stage, and at a set of components that is determined by a hash function and decreases in number at each stage.
Parallel Processing of Discrete Optimization Problems
- IN ENCYCLOPEDIA OF MICROCOMPUTERS
, 1993
"... Discrete optimization problems (DOPs) arise in various applications such as planning, scheduling, computer aided design, robotics, game playing and constraint directed reasoning. Often, a DOP is formulated in terms of finding a (minimum cost) solution path in a graph from an initial node to a goa ..."
Abstract
-
Cited by 19 (6 self)
- Add to MetaCart
Discrete optimization problems (DOPs) arise in various applications such as planning, scheduling, computer aided design, robotics, game playing and constraint directed reasoning. Often, a DOP is formulated in terms of finding a (minimum cost) solution path in a graph from an initial node to a goal node and solved by graph/tree search methods such as branch-and-bound and dynamic programming. Availability of parallel computers has created substantial interest in exploring the use of parallel processing for solving discrete optimization problems. This article provides an overview of parallel search algorithms for solving discrete optimization problems.
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers
- In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture MICRO 39
, 2006
"... We examine the ability of CMPs, due to their lower onchip communication latencies, to exploit data parallelism at inner-loop granularities similar to that commonly targeted by vector machines. Parallelizing code in this manner leads to a high frequency of barriers, and we explore the impact of diffe ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
We examine the ability of CMPs, due to their lower onchip communication latencies, to exploit data parallelism at inner-loop granularities similar to that commonly targeted by vector machines. Parallelizing code in this manner leads to a high frequency of barriers, and we explore the impact of different barrier mechanisms upon the efficiency of this approach. To further exploit the potential of CMPs for fine-grained data parallel tasks, we present barrier filters, a mechanism for fast barrier synchronization on chip multi-processors to enable vector computations to be efficiently distributed across the cores of a CMP. We ensure that all threads arriving at a barrier require an unavailable cache line to proceed, and, by placing additional hardware in the shared portions of the memory subsytem, we starve their requests until they all have arrived. Specifically, our approach uses invalidation requests to both make cache lines unavailable and identify when a thread has reached the barrier. We examine two types of barrier filters, one synchronizing through instruction cache lines, and the other through data cache lines. 1.
Architectural Support for Parallel Reductions
- in Scalable Shared-Memory Multiprocessors,” Proc. 10th Int’l Conf. Parallel Architectures and Compilation Techniques
, 2001
"... Reductions are important and time-consuming operations in many scientific codes. Effective parallelization of reductions is a critical transformation for loop parallelization, especially for sparse, dynamic applications. Unfortunately, conventional reduction parallelization algorithms are not scalab ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Reductions are important and time-consuming operations in many scientific codes. Effective parallelization of reductions is a critical transformation for loop parallelization, especially for sparse, dynamic applications. Unfortunately, conventional reduction parallelization algorithms are not scalable. In this paper, we present new architectural support that significantly speeds-up parallel reduction and makes it scalable in shared-memory multiprocessors. The required architectural changes are mostly confined to the directory controllers. Experimental results based on simulations show that the proposed support is very effective. While conventional software-only reduction parallelization delivers average speedups of only 2.7 for 16 processors, our scheme delivers average speedups of 7.6. 1
Using Simple Abstraction to Guide the Reinvention of Computing for Parallelism
"... The sudden shift from single-processor computer systems to many-processor parallel ones requires reinventing much of Computer Science (CS): how to actually build and program the new parallel systems. CS urgently requires convergence to a robust parallel general-purpose platform that provides good pe ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The sudden shift from single-processor computer systems to many-processor parallel ones requires reinventing much of Computer Science (CS): how to actually build and program the new parallel systems. CS urgently requires convergence to a robust parallel general-purpose platform that provides good performance and is easy enough to program by at least all CS majors. Unfortunately, lesser ease-ofprogramming objectives have eluded decades of parallel computing research. The idea of starting with an established easy parallel programming model and build an architecture for it has been treated as radical by vendors. This article advocates a more radical idea. Start with a minimalist stepping-stone: a simple abstraction that encapsulates the desired interface between programmers and system builders.
Representing Control in Parallel Applicative Programming
, 1994
"... This research is an attempt to reason about the control of parallel computation in the world of applicative programming languages. Applicative languages, in which computation is performed through function application and in which functions are treated as first-class objects, have the benefits of ele ..."
Abstract
- Add to MetaCart
This research is an attempt to reason about the control of parallel computation in the world of applicative programming languages. Applicative languages, in which computation is performed through function application and in which functions are treated as first-class objects, have the benefits of elegance, expressiveness and having clean semantics. Parallel computation and real-world concurrent activities are much harder to reason about than the sequential counterparts. Many parallel applicative languages have thus hidden most control details with their declarative programming styles, but they are not expressive enough to characterize many real world concurrent activities that can be easily explained with concepts such as message passing, pipelining and so on. Ease of programming should not come at the expense of expressiveness. Therefore, we design a parallel applicative language Pscheme such that programmers can express explicitly the control of parallel computation while maintaining ...
Object-Oriented Parallel Parsing for Context-lee Grammars
- In Proc. of COLING'88
, 1988
"... This paper describes a new parallel parsing scheme for contextkce grammars and our experience of implementing this scheme, and it also reports the result of our simulation for running the parsing program on a massive parallel processor. ..."
Abstract
- Add to MetaCart
This paper describes a new parallel parsing scheme for contextkce grammars and our experience of implementing this scheme, and it also reports the result of our simulation for running the parsing program on a massive parallel processor.

