Results 1 - 10
of
28
MagPIe: MPI’s Collective Communication Operations for Clustered Wide Area Systems
- Proc PPoPP'99
, 1999
"... Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must be adapted to the differences in link speeds. An important class of algorithms are collective operations, such as broadcast and reduce. We have d ..."
Abstract
-
Cited by 138 (26 self)
- Add to MetaCart
Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must be adapted to the differences in link speeds. An important class of algorithms are collective operations, such as broadcast and reduce. We have developed MAGPIE, a library of collective communication operations optimized for wide area systems. MAGPIE's algorithms send the minimal amount of data over the slow wide area links, and only incur a single wide area latency. Using our system, existing MPI applications can be run unmodified on geographically distributed systems. On moderate cluster sizes, using a wide area latency of 10 milliseconds and a bandwidth of 1 MByte/s, MAGPIE executes operations up to 10 times faster than MPICH, a widely used MPI implementation; application kernels improve by up to a factor of 4. Due to the structure of our algorithms, MAGPIE's advantage increases for higher wide area latencies.
Efficient load balancing for wide-area divideand-conquer applications
- In: Proc. PPoPP’01, Snowbird, UT (2001
"... Divide-and-conquer programs are easily parallelized by letting the programmer annotate potential parallelism in the form of spawn and sync constructs. To achieve efficient program execution, the generated work load has to be balanced evenly among the available CPUs. For single cluster systems, Rando ..."
Abstract
-
Cited by 46 (16 self)
- Add to MetaCart
Divide-and-conquer programs are easily parallelized by letting the programmer annotate potential parallelism in the form of spawn and sync constructs. To achieve efficient program execution, the generated work load has to be balanced evenly among the available CPUs. For single cluster systems, Random Stealing (RS) is known to achieve optimal load balancing. However, RS is inefficient when applied to hierarchical wide-area systems where multiple clusters are connected via wide-area networks (WANs) with high latency and low bandwidth. In this paper, we experimentally compare RS with existing loadbalancing strategies that are believed to be efficient for multi-cluster systems, Random Pushing and two variants of Hierarchical Stealing. We demonstrate that, in practice, they obtain less than optimal results. We introduce a novel load-balancing algorithm, Clusteraware Random Stealing (CRS) which is highly efficient and easy to implement. CRS adapts itself to network conditions and job granularities, and does not require manually-tuned parameters. Although CRS sends more data across the WANs, it is faster than its competitors for 11 out of 12 test applications with various WAN configurations. It has at most 4 % overhead in run time compared to RS on a single, large cluster, even with high wide-area latencies and low wide-area bandwidths. These strong results suggest that divideand-conquer parallelism is a useful model for writing distributed supercomputing applications on hierarchical wide-area systems.
Wide-Area Parallel Computing in Java
- In ACM SIGPLAN Java Grande Conference
, 1999
"... Java's support for parallel and distributed processing makes the language attractive for metacomputing applications, such as parallel applications that run on geographically distributed (wide-area) systems. To obtain actual experience with a Java-centric approach to metacomputing, we have built and ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
Java's support for parallel and distributed processing makes the language attractive for metacomputing applications, such as parallel applications that run on geographically distributed (wide-area) systems. To obtain actual experience with a Java-centric approach to metacomputing, we have built and used a high-performance widearea Java system, called Manta. Manta implements the Java RMI model using different communication protocols (active messages and TCP/IP) for different networks. The paper shows how widearea parallel applications can be expressed and optimized using Java RMI. Also, it presents performance results of several applications on a wide-area system consisting of four Myrinet-based clusters connected by ATM WANs. 1 Introduction Metacomputing is an interesting research area that tries to integrate geographically distributed computing resources into a single powerful system. Many applications can benefit from such an integration [11, 22]. Metacomputing systems support such...
Bandwidth-efficient Collective Communication for Clustered Wide Area Systems
- In Proc. International Parallel and Distributed Processing Symposium (IPDPS 2000), Cancun
, 1999
"... Metacomputing infrastructures couple multiple clusters (or MPPs) via wide-area networks and thus allow parallel programs to run on geographically distributed resources. A major problem in programming such wide-area parallel applications is the difference in communication costs inside and between clu ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
Metacomputing infrastructures couple multiple clusters (or MPPs) via wide-area networks and thus allow parallel programs to run on geographically distributed resources. A major problem in programming such wide-area parallel applications is the difference in communication costs inside and between clusters. Latency and bandwidth of WANs often are orders of magnitude worse than those of local networks. Our MagPIe library eases wide-area parallel programming by providing an efficient implementation of MPI's collective communication operations. MagPIe exploits the hierarchical structure of clustered wide-area systems and minimizes the communication overhead over the WAN links. In this paper, we present improved algorithms for collective communication that achieve shorter completion times by simultaneously using the aggregate bandwidth of the available wide-area links. Our new algorithms split messages into multiple segments that are sent in parallel over different WAN links, thus resulting ...
Wide-Area Parallel Programming using the Remote Method Invocation Model
, 1999
"... Java’s support for parallel and distributed processing makes the language attractive for metacomputing applications, such as parallel applications that run on geographically distributed (wide-area) systems. To obtain actual experience with a Java-centric approach to metacomputing, we have built and ..."
Abstract
-
Cited by 16 (10 self)
- Add to MetaCart
Java’s support for parallel and distributed processing makes the language attractive for metacomputing applications, such as parallel applications that run on geographically distributed (wide-area) systems. To obtain actual experience with a Java-centric approach to metacomputing, we have built and used a high-performance wide-area Java system, called Manta. Manta implements the Java Remote Method Invocation (RMI) model using different communication protocols (active messages and TCP/IP) for different networks. The paper shows how widearea parallel applications can be expressed and optimized using Java RMI. Also, it presents performance results of several applications on a wide-area system consisting of four Myrinetbased clusters connected by ATM WANs. We finally discuss alternative programming models, namely object replication, JavaSpaces, and MPI for Java.
MPI's Reduction Operations in Clustered Wide Area Systems
- In Proc. MPIDC'99, Message Passing Interface Developer's and User's Conference
, 1999
"... The emergence of meta computers and computational grids makes it feasible to run parallel programs on large-scale, geographically distributed computer systems. Writing parallel applications for such systems is a challenging task which may require changes to the communication structure of the applica ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
The emergence of meta computers and computational grids makes it feasible to run parallel programs on large-scale, geographically distributed computer systems. Writing parallel applications for such systems is a challenging task which may require changes to the communication structure of the applications. MPI's collective operations (such as broadcast and reduce) allow for some of these changes to be hidden from the applications programmer. We have developed MAGPIE, a library of collective communication operations optimized for wide area systems. MAGPIE 's algorithms are designed to send the minimal amount of data over the slow wide area links, and to only incur a single wide area latency. This paper discusses MPI's collective reduction operations. Compared to systems that do not take the topology into account, such as MPICH, large performance improvements are possible. For larger messages, best performance is achieved when the reduction function is associative. 1 Introduction Severa...
Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
- Supercomputer, Research Letters in the Information and Mathematical Sciences Volume 5, June 2003, ISSN
, 2003
"... This paper presents some performance results obtained from a new Beowulf cluster, the Helix, built at Massey University, Auckland funded by the Allan Wilson Center for Evolutionary Ecology. Issues concerning network latency and the e#ect of the switching fabric and network topology on performance ar ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
This paper presents some performance results obtained from a new Beowulf cluster, the Helix, built at Massey University, Auckland funded by the Allan Wilson Center for Evolutionary Ecology. Issues concerning network latency and the e#ect of the switching fabric and network topology on performance are discussed. In order to assess how the system performed using the message passing interface (MPI), two test suites (mpptest and jumpshot) were used to provide a comprehensive network performance analysis. The performance of an older fast-ethernet/single processor based cluster is compared to the new Gigabit/SMP cluster. The Linpack performance of Helix is investigated. The Linpack Rmax rating of 234.8 Gflops puts the cluster at third place in the Australia/ New Zealand sublist of the Top500 supercomputers, an extremely good performance considering the commodity parts and its low cost (US$125000)
Parallel Computing on Wide-Area Clusters: the Albatross Project
- In Extreme Linux Workshop
, 1999
"... The aim of the Albatross project is to study applications and programming environments for widearea cluster computers, which consist of multiple clusters connected by wide-area networks. Parallel processing on such systems is useful but challenging, given the large differences in latency and bandwid ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
The aim of the Albatross project is to study applications and programming environments for widearea cluster computers, which consist of multiple clusters connected by wide-area networks. Parallel processing on such systems is useful but challenging, given the large differences in latency and bandwidth between LANs and WANs. We apply application-level optimizations that exploit the hierarchical structure of wide-area clusters to minimize communication over the WANs. In addition, we use highly efficient local-area communication protocols. We illustrate this approach using a highperformance Java system that is implemented on a collection of four Myrinet-based clusters connected by wide-area ATM networks. The optimized applications obtain high speedups on this wide-area system.
Wide-Area Transposition-Driven Scheduling
- In IEEE International Symposium on High Performance Distributed Computing
, 2001
"... Distributed search of state spaces containing cycles is a challenging task and has been studied for years. Traditional parallel search algorithms either ignore the cyclic nature of the state space and waste much time in duplicated search effort, or rely on heavy communication to reduce duplicate wor ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Distributed search of state spaces containing cycles is a challenging task and has been studied for years. Traditional parallel search algorithms either ignore the cyclic nature of the state space and waste much time in duplicated search effort, or rely on heavy communication to reduce duplicate work, resulting in a large communication overhead. Both methods perform poorly, even when using a fast, local interconnect. A recently developed task-distribution scheme, Transposition-Driven Scheduling (TDS), performs much better, since it communicates asynchronously and efficiently suppresses duplicate search effort. TDS, however, requires bandwidths of megabytes per second per processor. In this paper, we investigate how cyclic state spaces can be searched efficiently on a meta-computing system containing multiple clusters, connected by high-latency, lowbandwidth wide-area links. This is quite a challenge, because the wide-area links provide neither the bandwidth required for TDS, nor the latency required for traditional distributed search algorithms. We propose a scheme that strongly reduces communication between clusters, at the expense of possibly duplicate search effort. Performance measurements for several applications show that the new scheme outperforms traditional schemes by a wide margin. Keywords: meta computing, distributed search, distributed supercomputing, Transposition-Driven Scheduling 1
BSP Algorithms Design for Hierarchical Supercomputers. submitted for publication
, 2002
"... Abstract In recent years there has been a trend towards using standard workstation components to construct parallel computers, due to the enormous costs involved in designing and manufacturing special-purpose hardware. In particular, we can expect to see a large population of SMP clusters emerging i ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract In recent years there has been a trend towards using standard workstation components to construct parallel computers, due to the enormous costs involved in designing and manufacturing special-purpose hardware. In particular, we can expect to see a large population of SMP clusters emerging in the next few years. These are local-area networks of workstations, each containing around four parallel processors with a single shared memory. To use such machines effectively will be a major headache for programmers and compiler-writers. Here we consider how well-suited the BSP model might be for these two-tier architectures, and whether it would be useful to extend the model to allow for non-uniform communication behaviour.

