Results 1 - 10
of
36
MagPIe: MPI’s Collective Communication Operations for Clustered Wide Area Systems
- Proc PPoPP'99
, 1999
"... Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must be adapted to the differences in link speeds. An important class of algorithms are collective operations, such as broadcast and reduce. We have d ..."
Abstract
-
Cited by 172 (27 self)
- Add to MetaCart
(Show Context)
Writing parallel applications for computational grids is a challenging task. To achieve good performance, algorithms designed for local area networks must be adapted to the differences in link speeds. An important class of algorithms are collective operations, such as broadcast and reduce. We have developed MAGPIE, a library of collective communication operations optimized for wide area systems. MAGPIE's algorithms send the minimal amount of data over the slow wide area links, and only incur a single wide area latency. Using our system, existing MPI applications can be run unmodified on geographically distributed systems. On moderate cluster sizes, using a wide area latency of 10 milliseconds and a bandwidth of 1 MByte/s, MAGPIE executes operations up to 10 times faster than MPICH, a widely used MPI implementation; application kernels improve by up to a factor of 4. Due to the structure of our algorithms, MAGPIE's advantage increases for higher wide area latencies.
Sensitivity of Parallel Applications to Large Differences in Bandwidth and Latency in Two-Layer Interconnects
- IN FIFTH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE
, 1999
"... This paper studies application performance on systems with strongly non-uniform remote memory access. In current generation NUMAs the speed difference between the slowest and fastest link in an interconnect---the "NUMA gap"---is typically less than an order of magnitude, and many convent ..."
Abstract
-
Cited by 40 (11 self)
- Add to MetaCart
This paper studies application performance on systems with strongly non-uniform remote memory access. In current generation NUMAs the speed difference between the slowest and fastest link in an interconnect---the "NUMA gap"---is typically less than an order of magnitude, and many conventional parallel programs achieve good performance. We study how different NUMA gaps influence application performance, up to and including typical wide-area latencies and bandwidths. We find that for gaps larger than those of current generation NUMAs, performance suffers considerably (for applications that were designed for a uniform access interconnect). For many applications, however, performance can be greatly improved with comparatively simple changes: traffic over slow links can be reduced by making communication patterns hierarchical---like the interconnect. We find that in four out of our six applications the size of the gap can be increased by an order of magnitude or more without severel...
The Performance of Processor Co-Allocation in Multicluster Systems
, 2003
"... In systems consisting of multiple clusters of processors which are interconnected by relatively slow communication links and which employ space sharing for scheduling jobs, such as our Distributed ASCI Supercomputer (DAS), coallocation, i.e., the simultaneous allocation of processors to single jo ..."
Abstract
-
Cited by 39 (8 self)
- Add to MetaCart
In systems consisting of multiple clusters of processors which are interconnected by relatively slow communication links and which employ space sharing for scheduling jobs, such as our Distributed ASCI Supercomputer (DAS), coallocation, i.e., the simultaneous allocation of processors to single jobs in different clusters, may be required. We study the performance of co-allocation by means of simulations for the mean response time of jobs depending on the structure and sizes of jobs, the scheduling policy, and the communication speed ratio. Our main conclusion is that for current communication speed ratios in multiclusters, coallocation is a viable option.
Bandwidth-efficient Collective Communication for Clustered Wide Area Systems
- In Proc. International Parallel and Distributed Processing Symposium (IPDPS 2000), Cancun
, 1999
"... Metacomputing infrastructures couple multiple clusters (or MPPs) via wide-area networks and thus allow parallel programs to run on geographically distributed resources. A major problem in programming such wide-area parallel applications is the difference in communication costs inside and between clu ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
(Show Context)
Metacomputing infrastructures couple multiple clusters (or MPPs) via wide-area networks and thus allow parallel programs to run on geographically distributed resources. A major problem in programming such wide-area parallel applications is the difference in communication costs inside and between clusters. Latency and bandwidth of WANs often are orders of magnitude worse than those of local networks. Our MagPIe library eases wide-area parallel programming by providing an efficient implementation of MPI's collective communication operations. MagPIe exploits the hierarchical structure of clustered wide-area systems and minimizes the communication overhead over the WAN links. In this paper, we present improved algorithms for collective communication that achieve shorter completion times by simultaneously using the aggregate bandwidth of the available wide-area links. Our new algorithms split messages into multiple segments that are sent in parallel over different WAN links, thus resulting ...
Wide-Area Parallel Computing in Java
- In ACM SIGPLAN Java Grande Conference
, 1999
"... Java's support for parallel and distributed processing makes the language attractive for metacomputing applications, such as parallel applications that run on geographically distributed (wide-area) systems. To obtain actual experience with a Java-centric approach to metacomputing, we have built ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
Java's support for parallel and distributed processing makes the language attractive for metacomputing applications, such as parallel applications that run on geographically distributed (wide-area) systems. To obtain actual experience with a Java-centric approach to metacomputing, we have built and used a high-performance widearea Java system, called Manta. Manta implements the Java RMI model using different communication protocols (active messages and TCP/IP) for different networks. The paper shows how widearea parallel applications can be expressed and optimized using Java RMI. Also, it presents performance results of several applications on a wide-area system consisting of four Myrinet-based clusters connected by ATM WANs. 1 Introduction Metacomputing is an interesting research area that tries to integrate geographically distributed computing resources into a single powerful system. Many applications can benefit from such an integration [11, 22]. Metacomputing systems support such...
Parallel Application Experience with Replicated Method Invocation
, 2001
"... We describe and evaluate a new approach to object replication in Java, aimed at improving the performance of parallel programs. Our programming model allows the programmer to define groups of objects that can be replicated and updated as a whole, using totally-ordered broadcast to send update method ..."
Abstract
-
Cited by 22 (12 self)
- Add to MetaCart
(Show Context)
We describe and evaluate a new approach to object replication in Java, aimed at improving the performance of parallel programs. Our programming model allows the programmer to define groups of objects that can be replicated and updated as a whole, using totally-ordered broadcast to send update methods to all machines containing a copy. The model has been implemented in the Manta high-performance Java system. We evaluate system performance both with micro benchmarks and with a set of five parallel applications. For the applications, we also evaluate ease of programming, compared to RMI implementations. We present performance results for a Myrinet-based workstation cluster as well as for a wide-area distributed system consisting of four such clusters. The micro benchmarks show that updating a replicated object on 64 machines only takes about three times the RMI latency in Manta. Applications using Manta’s object replication mechanism perform at least as fast as manually optimized versions based on RMI, while keeping the application code as simple as with naive versions that use shared objects without taking locality into account. Using a replication mechanism in Manta’s runtime system enables several unmodified applications to run efficiently even on the wide-area system.
A measurement-based simulation study of processor co-allocation in multicluster systems
- SCHEDULING STRATEGIES FOR PARALLEL PROCESSING
, 2003
"... In systems consisting of multiple clusters of processors interconnected by relatively slow connections such as our Distributed ASCI Supercomputer (DAS), applications may benefit from the availability of processors in multiple clusters. However, the performance of single-application multicluster exec ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
(Show Context)
In systems consisting of multiple clusters of processors interconnected by relatively slow connections such as our Distributed ASCI Supercomputer (DAS), applications may benefit from the availability of processors in multiple clusters. However, the performance of single-application multicluster execution may be degraded due to the slow widearea links. In addition, scheduling policies for such systems have to deal with more restrictions than schedulers for single clusters in that every component of a job has to fit in separate clusters. In this paper we present a measurement study of the total runtime of two applications, and of the communication time of one of them, both on single clusters and on multicluster systems. In addition, we perform simulations of several multicluster scheduling policies based on our measurement results. Our results show that in spite of the fact that inter-cluster communication is much slower then intra-cluster communication, the performance of multicluster operation can be very reasonable compared to single-cluster execution.
The Distributed ASCI Supercomputer Project
- The International Journal of Supercomputer Applications and High Performance Computing 11(3), 212–223. URL: citeseer.nj.nec.com/casanova00netsolve.html Chetty
, 2000
"... The Distributed ASCI Supercomputer (DAS) is a homogeneous wide-area distributed system consisting of four cluster computers at different locations. DAS has been used for research on communication software, parallel languages and programming systems, schedulers, parallel applications, and distributed ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
(Show Context)
The Distributed ASCI Supercomputer (DAS) is a homogeneous wide-area distributed system consisting of four cluster computers at different locations. DAS has been used for research on communication software, parallel languages and programming systems, schedulers, parallel applications, and distributed applications. The paper gives a preview of the most interesting research results obtained so far in the DAS project.
The Maximal Utilization of Processor Co-Allocation in Multicluster Systems
- In Proc. Int’l Parallel and Distributed Processing Symp. (IPDPS
, 2003
"... In systems consisting of multiple clusters of proces-sors which employ space sharing for scheduling jobs, such as our Distributed ASCI1 Supercomputer (DAS), co-allocation, i.e., the simultaneous allocation of processors to single jobs in multiple clusters, may be required. In studies of scheduling i ..."
Abstract
-
Cited by 18 (5 self)
- Add to MetaCart
(Show Context)
In systems consisting of multiple clusters of proces-sors which employ space sharing for scheduling jobs, such as our Distributed ASCI1 Supercomputer (DAS), co-allocation, i.e., the simultaneous allocation of processors to single jobs in multiple clusters, may be required. In studies of scheduling in single clusters it has been shown that the achievable (maximal) utilization may be much less than , a problem that may be aggravated in multiclus-ter systems. In this paper we study the maximal utilization when co-allocating jobs in multicluster systems, both with analytic means (we derive exact and approximate formulas when the service-time distribution is exponential), and with simulations with synthetic workloads and with workloads derived from the logs of actual systems. 1
Wide-Area Parallel Programming using the Remote Method Invocation Model
, 1999
"... Java’s support for parallel and distributed processing makes the language attractive for metacomputing applications, such as parallel applications that run on geographically distributed (wide-area) systems. To obtain actual experience with a Java-centric approach to metacomputing, we have built and ..."
Abstract
-
Cited by 17 (10 self)
- Add to MetaCart
(Show Context)
Java’s support for parallel and distributed processing makes the language attractive for metacomputing applications, such as parallel applications that run on geographically distributed (wide-area) systems. To obtain actual experience with a Java-centric approach to metacomputing, we have built and used a high-performance wide-area Java system, called Manta. Manta implements the Java Remote Method Invocation (RMI) model using different communication protocols (active messages and TCP/IP) for different networks. The paper shows how widearea parallel applications can be expressed and optimized using Java RMI. Also, it presents performance results of several applications on a wide-area system consisting of four Myrinetbased clusters connected by ATM WANs. We finally discuss alternative programming models, namely object replication, JavaSpaces, and MPI for Java.