Results 1 - 10
of
26
ReVive: CostEffective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
- In ISCA-02
, 2002
"... This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, performance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all me ..."
Abstract
-
Cited by 120 (13 self)
- Add to MetaCart
(Show Context)
This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, performance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all memory-based. It enables recovery from a wide class of errors, including the permanent loss of an entire node. To maintain high performance, ReVive includes specialized hardware that performs frequent operations in the background, such as log and parity updates. To keep the cost low, more complex checkpointing and recovery functions are performed in software, while the hardware modifications are limited to the directory controllers of the machine. Our simulation results on a 16-processor system indicate that the average error-free execution time overhead of using ReVive is only 6.3%, while the achieved availability is better than 99.999 % even when the errors occur as often as once per day. 1
Process migration
- ACM Computing Surveys
, 2000
"... A process is an operating system abstraction representing an instance of a running computer program. Process migration is the act of transferring a process between two machines during its execution. Several implementations ..."
Abstract
-
Cited by 104 (1 self)
- Add to MetaCart
(Show Context)
A process is an operating system abstraction representing an instance of a running computer program. Process migration is the act of transferring a process between two machines during its execution. Several implementations
PRISM: An Integrated Architecture for Scalable Shared Memory
- IN PROCEEDINGS OF THE FOURTH IEEE SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE
, 1998
"... This paper describes PRISM, a distributed shared-memory architecture that relies on a tightly integrated hardware and operating system design for scalable and reliable performance. PRISM's hardware provides mechanisms for flexible management and dynamic configuration of shared-memory pages with ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
This paper describes PRISM, a distributed shared-memory architecture that relies on a tightly integrated hardware and operating system design for scalable and reliable performance. PRISM's hardware provides mechanisms for flexible management and dynamic configuration of shared-memory pages with different behaviors. As an example, PRISM can provide configure individual sharedmemory pages in both CC-NUMA and Simple-COMA styles, maintaining the advantages of both without incorporating any of their disadvantages. PRISM's operating system is structured as multiple independent kernels, where each kernel manages the resources on its local node. PRISM's system structure minimizes the amount of global coordination when managing shared-memory: page faults do not involve global TLB invalidates, and pages can be replicated and migrated without requiring global coordination. The structure also provides natural fault containment boundaries around each node because physical addresses do not address re...
Parallel SimOS: scalability and performance for large system simulation
, 2007
"... ii ..."
(Show Context)
Hive: Operating System Fault Containment For Shared-Memory Multiprocessors
, 1997
"... Reliability and scalability are major concerns when designing general-purpose operating systems for large-scale shared-memory multiprocessors. This dissertation describes Hive, an operating system with a novel kernel architecture that addresses these issues. Hive is structured as an internal distrib ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Reliability and scalability are major concerns when designing general-purpose operating systems for large-scale shared-memory multiprocessors. This dissertation describes Hive, an operating system with a novel kernel architecture that addresses these issues. Hive is structured as an internal distributed system of independent kernels called cells. This architecture improves reliability because a hardware or software error damages only one cell rather than the whole system. The architecture improves scalability because few kernel resources are shared by processes running on different cells. The Hive prototype is a complete implementation of UNIX SVR4 and is targeted to run on the Stanford FLASH multiprocessor. The research described in the dissertation makes three primary contributions: (1) it demonstrates that distributed system mechanisms can be used to provide fault containment inside a shared-memory multiprocessor; (2) it provides a specification for a set of hardware features, imple...
Simple Deadlock-Free Dynamic Network Reconfiguration
"... Abstract. Dynamic reconfiguration of interconnection networks is defined as the process of changing from one routing function to another while the network remains up and running. The main challenge is in avoiding deadlock anomalies while keeping restrictions on packet injection and forwarding minima ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
Abstract. Dynamic reconfiguration of interconnection networks is defined as the process of changing from one routing function to another while the network remains up and running. The main challenge is in avoiding deadlock anomalies while keeping restrictions on packet injection and forwarding minimal. Current approaches fall in one of two categories. Either they require the existence of extra network resources like e.g. virtual channels, or their complexity is so high that their practical applicability is limited. In this paper we describe a simple and powerful method for dynamic networks reconfiguration. It guarantees a fast and deadlock-free transition from the old to the new routing function, it works for any topology and between any pair of old and new routing functions, and it guarantees in-order packet delivery when used between deterministic routing functions. 1
Parallel computing in the commercial marketplace: Research and innovation at work
- Proceedings of the IEEE
, 1999
"... ..."
LIVE DISTRIBUTED OBJECTS
, 2008
"... Distributed multiparty protocols such as multicast, atomic commit, or gossip are currently underutilized, but we envision that they could be used pervasively, and that developers could work with such protocols similarly to how they work with CORBA/COM/.NET/Java objects. We have created a new program ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Distributed multiparty protocols such as multicast, atomic commit, or gossip are currently underutilized, but we envision that they could be used pervasively, and that developers could work with such protocols similarly to how they work with CORBA/COM/.NET/Java objects. We have created a new programming model and a platform in which protocol instances are represented as objects of a new type called live distributed objects: strongly-typed building blocks that can be composed in a type-safe manner through a drag and drop interface. Unlike most prior object-oriented distributed protocol embeddings, our model appears to be flexible enough to accommodate most popular protocols, and to be applied uniformly to any part of a distributed system, to build not only front-end, but also back-end components, such as multicast channels, naming, or membership services. While the platform is not limited to applications based on multicast, it is replicationcentric, and reliable multicast protocols are important building blocks that can be used to create a variety of scalable components, from shared documents to fault-tolerant storage or scalable role delegation. We propose a new multicast architecture compatible with
Strong partitioning protocol for a multiprocessor VME system
- In Digest of Papers, TwentyEighth Annual International Symposium on Fault-Tolerant Computing
, 1998
"... The trend in implementing today’s embedded applications is toward the use of commercial-off-the-shelf open architecture. Reducing costs and facilitating systems integration are among the motives for that trend. The use of the VME bus becomes very common in many industrial applications. The VME bus a ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
The trend in implementing today’s embedded applications is toward the use of commercial-off-the-shelf open architecture. Reducing costs and facilitating systems integration are among the motives for that trend. The use of the VME bus becomes very common in many industrial applications. The VME bus attracts developers with its rigorous specifications, multiprocessing support and boards availability through multiple vendors. However, VME bus standard supports multiprocessing through shared memory, which does not impose strong function partitioning and allows fault propagation from one board to another. Such weakness limits the use of the VME bus in highly critical applications such as avionics. This paper presents techniques for strong partitioning of multiprocessor applications that maintains fault containment on the VME bus. The suggested techniques do not require any modification in the standard and the existing boards, and consequently maintains the plugand-play advantage of the VME bus hardware products. The techniques are equally applicable to other tightly coupled multiprocessor systems. In addition, the paper describes the implementation of these techniques and reports performance results. Finally, the benefits of this technology for a space vehicle and commercial avionics are discussed. 1.
Communication Across Fault-Containment Firewalls on the SGI Origin
"... Scalability and reliability are inseparable in high-performance computing. Fault-isolation through hardware is a popular means of providing reliability. Unfortunately, such isolation also increases communication latencies: typically, one has to drop into and out of the kernel to communicate between ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Scalability and reliability are inseparable in high-performance computing. Fault-isolation through hardware is a popular means of providing reliability. Unfortunately, such isolation also increases communication latencies: typically, one has to drop into and out of the kernel to communicate between failure domains. On the other hand, relaxing fault isolation domains allows e cient communication, but at the risk of failure propagation, and thus reduced reliability. We are concerned with nding a middle ground between these extremes. We rst review a few salient aspects of the SGI Origin-2000 architecture, mentioning the hardware features germane to e cient communication, and building protection- rewalls. Then, we describe amechanism for risk-free, point-to-point communication between processes on distinct failure domains. Quoting performance numbers, we show that the overheads of crossing domains render this mechanism unattractive for small messages. To address this issue, we describe amechanism for controlled opening of the rewalls, thereby achieving explicit inter-partition shared-memory for communication. We describe the kernel software that addresses the resulting reliability issues, and discuss how familiar IPC mechanisms such as MPI and SysV shared-memory can use the explicit sharedmemory to advantage. Finally, based on the lessons learnt, we discuss some future directions, and draw concluding remarks.