Results 1 - 10
of
12
Transparent fault-tolerance in parallel orca programs
- In Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems III
, 1992
"... With the advent of large-scale parallel computing systems, making parallel programs fault-tolerant becomes an important problem, because the probability of a failure increases with the number of processors. In this paper, we describe a very simple scheme for rendering a class of parallel Orca progra ..."
Abstract
-
Cited by 30 (3 self)
- Add to MetaCart
With the advent of large-scale parallel computing systems, making parallel programs fault-tolerant becomes an important problem, because the probability of a failure increases with the number of processors. In this paper, we describe a very simple scheme for rendering a class of parallel Orca programs fault-tolerant. Also, we discuss our experience with implementing this scheme on Amoeba. Our approach works for parallel applications that are not interactive. The approach is based on making a globally consistent checkpoint from time to time and rolling back to the last checkpoint when a processor fails. Making a consistent global checkpoint is easy in Orca, because its implementation is based on reliable broadcast. The advantages of our approach are its simplicity, ease of implementation, low overhead, and transparency to the Orca programmer. 1.
PLinda 2.0: A Transactional/Checkpointing Approach to Fault Tolerant Linda
- In Proceedings of the 13th Symposium on Reliable Distributed Systems
, 1994
"... Robust parallel computation in Linda requires both tuple space and processes to be resilient to failure. In this paper, we present PLinda 2.0, set of extensions to Linda to support robust parallel computation on loosely coupled processors communicating over a network. The principal extensions of PLi ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
Robust parallel computation in Linda requires both tuple space and processes to be resilient to failure. In this paper, we present PLinda 2.0, set of extensions to Linda to support robust parallel computation on loosely coupled processors communicating over a network. The principal extensions of PLinda 2.0 to Linda are transaction mechanisms for reliable tuple space and process-private logging mechanisms for resilient processes. The transaction mechanisms support two kinds of tuple space: stable tuple space always guaranteed to reflect state as of last committed transaction, and unstable tuple space protected by a transaction-consistent checkpoint. The process-private logging mechanisms are provided as tools for a process checkpointing scheme. These mechanisms allow the customization of checkpointing and recovery operations in each process to achieve low runtime overhead. 1 Introduction One of the issues that distributed programming systems must address is fault tolerance[4]. On loos...
Fault-tolerant Parallel Processing Combining Linda, Checkpointing, and Transactions
, 1996
"... With the advent of high performance workstations and fast LANs, networks of workstations have recently emerged as a promising computing platform for long-running coarse grain parallel applications. Their advantages are wide availability and coste ectiveness, as compared to massively parallel compute ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
With the advent of high performance workstations and fast LANs, networks of workstations have recently emerged as a promising computing platform for long-running coarse grain parallel applications. Their advantages are wide availability and coste ectiveness, as compared to massively parallel computers. Long-running computation in the workstation environment, however, requires both fault tolerance and the e ective utilization of idle workstations. In this dissertation, we present avariant of Linda, called Persistent Linda (PLinda), that treats these two issues uniformly: speci cally, PLinda treats non-idleness as failure. PLinda provides a combination of checkpointing and transaction support on both data and program state (an encoding of continuations). The traditional transaction model is optimized and extended to support robust parallel computation. Treatable failures include processor and main memory hard and slowdown failures, and network omission and corruption failures. The programmer can customize fault tolerance when constructing an application, trading failure-free performance against recovery time. When creating a PLinda program,
All-Pairs: An Abstraction for Data-Intensive Cloud Computing
"... Although modern parallel and distributed computing systems provide easy access to large amounts of computing power, it is not always easy for non-expert users to harness these large systems effectively. A large workload composed in what seems to be the obvious way by a naive user may accidentally ab ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Although modern parallel and distributed computing systems provide easy access to large amounts of computing power, it is not always easy for non-expert users to harness these large systems effectively. A large workload composed in what seems to be the obvious way by a naive user may accidentally abuse shared resources and achieve very poor performance. To address this problem, we propose that production systems should provide end users with high-level abstractions that allow for the easy expression and efficient execution of data intensive workloads. We present one example of an abstraction – All-Pairs – that fits the needs of several data-intensive scientific applications. We demonstrate that an optimized All-Pairs abstraction is both easier to use than the underlying system, and achieves performance orders of magnitude better than the obvious but naive approach, and twice as fast as a hand-optimized conventional approach. 1
All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids
- IEEE Transactions on Parallel and Distributed Systems
"... Abstract — Today, campus grids provide users with easy access to thousands of CPUs. However, it is not always easy for nonexpert users to harness these systems effectively. A large workload composed in what seems to be the obvious way by a naive user may accidentally abuse shared resources and achie ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Abstract — Today, campus grids provide users with easy access to thousands of CPUs. However, it is not always easy for nonexpert users to harness these systems effectively. A large workload composed in what seems to be the obvious way by a naive user may accidentally abuse shared resources and achieve very poor performance. To address this problem, we argue that campus grids should provide end users with high-level abstractions that allow for the easy expression and efficient execution of data intensive workloads. We present one example of an abstraction – All-Pairs – that fits the needs of several applications in biometrics, bioinformatics, and data mining. We demonstrate that an optimized All-Pairs abstraction is both easier to use than the underlying system, achieves performance orders of magnitude better than the obvious but naive approach, and is both faster and more efficient than a tuned conventional approach. This abstraction has been in production use for one year on a 500-CPU campus grid at the University of Notre Dame, and has been used to carry out a groundbreaking analysis of biometric data.
Harnessing Parallelism in Multicore Clusters with the All-Pairs and Wavefront Abstractions
"... Both distributed systems and multicore computers are difficult programming environments. Although the expert programmer may be able to tune distributed and multicore computers to achieve high performance, the non-expert may struggle to achieve a program that even functions correctly. We argue that h ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Both distributed systems and multicore computers are difficult programming environments. Although the expert programmer may be able to tune distributed and multicore computers to achieve high performance, the non-expert may struggle to achieve a program that even functions correctly. We argue that high level abstractions are an effective way of making parallel computing accessible to the non-expert. An abstraction is a regularly structured framework into which a user may plug in simple sequential programs to create very large parallel programs. By virtue of a regular structure and declarative specification, abstractions may be materialized on distributed, multicore, and distributed multicore systems with robust performance across a wide range of problem sizes. In previous work, we presented the All-Pairs abstraction for computing on distributed systems of single CPUs. In this paper, we extend All-Pairs to multicore systems, and introduce Wavefront, which represents a number of problems in economics and bioinformatics. We demonstrate good scaling of both abstractions up to 32-cores on one machine and hundreds of cores in a distributed system.
Karpjoo Jeong January, 1996
"... Fault-tolerant Parallel Processing Combining Linda, Checkpointing, and Transactions Karpjoo Jeong Research Advisor: Professor Dennis Shasha With the advent of high performance workstations and fast LANs, networks of workstations have recently emerged as a promising computing platform for long-ru ..."
Abstract
- Add to MetaCart
Fault-tolerant Parallel Processing Combining Linda, Checkpointing, and Transactions Karpjoo Jeong Research Advisor: Professor Dennis Shasha With the advent of high performance workstations and fast LANs, networks of workstations have recently emerged as a promising computing platform for long-running coarse grain parallel applications. Their advantages are wide availability and costeffectiveness, as compared to massively parallel computers. Long-running computation in the workstation environment, however, requires both fault tolerance and the effective utilization of idle workstations. In this dissertation, we present a variant of Linda, called Persistent Linda (PLinda), that treats these two issues uniformly: specifically, PLinda treats non-idleness as failure. PLinda provides a combination of checkpointing and transaction support on both data and program state (an encoding of continuations). The traditional transaction model is optimized and extended to support robust parallel c...
Scalable Modular Genome Assembly on Campus Grids
"... Bioinformatics researchers need efficient means to process large collections of sequence data. One application of interest, genome assembly, is naturally parallel; however, most current implementations are tied to uncommon high end hardware. We solve this problem by introducing a modular, scalable f ..."
Abstract
- Add to MetaCart
Bioinformatics researchers need efficient means to process large collections of sequence data. One application of interest, genome assembly, is naturally parallel; however, most current implementations are tied to uncommon high end hardware. We solve this problem by introducing a modular, scalable framework for genome assembly that runs on a wide variety of distributed environments without forcing end users to purchase specialized hardware or become experts in parallel programming. For large problems, the framework carefully handles task and data management while also achieving fault-tolerant speedup with good efficiency on several scales of resources. We show results for several assembly-related problems ranging from 738 thousand to over 84 million alignments using campus grid resources ranging from a small cluster to several hundred nodes at each of three institutions. These results show strong scaling beyond 512 nodes using a custom alignment module. 1
to appear in:
, 2009
"... A cloud computer provides a simple interface that allows end users to allocate large amounts of computing power and storage space at the touch of a button. However, many potential users of cloud computers have needs much more complex than simply the ability to allocate resources. In scientific domai ..."
Abstract
- Add to MetaCart
A cloud computer provides a simple interface that allows end users to allocate large amounts of computing power and storage space at the touch of a button. However, many potential users of cloud computers have needs much more complex than simply the ability to allocate resources. In scientific domains, it is easy to find examples of workloads that consist of hundreds or thousands

