Results 1 - 10
of
10
Symbiotic Routing in Future Data Centers
"... Building distributed applications that run in data centers is hard. The CamCube project explores the design of a shipping container sized data center with the goal of building an easier platform on which to build these applications. Cam-Cube replaces the traditional switch-based network with a 3D to ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
(Show Context)
Building distributed applications that run in data centers is hard. The CamCube project explores the design of a shipping container sized data center with the goal of building an easier platform on which to build these applications. Cam-Cube replaces the traditional switch-based network with a 3D torus topology, with each server directly connected to six other servers. As in other proposals, e.g. DCell and BCube, multi-hop routing in CamCube requires servers to participate in packet forwarding. To date, as in existing data centers, these approaches have all provided a single routing protocol for the applications. In this paper we explore if allowing applications to implement their own routing services is advantageous, and if we can support it efficiently. This is based on the observation that, due to the flexibility offered by the CamCube API, many applications implemented their own routing protocol in order to achieve specific application-level characteristics, such as trading off higher-latency for better path convergence. Using large-scale simulations we demonstrate the benefits and network-level impact of running multiple routing protocols. We demonstrate that applications are more efficient and do not generate additional control traffic overhead. This motivates us to design an extended routing service allowing easy implementation of application-specific routing protocols on CamCube. Finally, we demonstrate that the additional performance overhead incurred when using the extended routing service on a prototype CamCube is very low.
Immunet: Dependable Routing for Interconnection Networks with Arbitrary Topology
- IEEE TRANSACTIONS ON COMPUTER, TC-2007-07-0304 1
, 2007
"... A complete mechanism for tolerating multiple failures in parallel computer systems, denoted as Immunet, is described in this paper. Immunet can be applied to arbitrary topologies, either regular or irregular, exhibiting in both cases graceful performance degradation. Provided that the network remai ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
A complete mechanism for tolerating multiple failures in parallel computer systems, denoted as Immunet, is described in this paper. Immunet can be applied to arbitrary topologies, either regular or irregular, exhibiting in both cases graceful performance degradation. Provided that the network remains connected, Immunet is able to deal with any number of failures regardless of their spatial and temporal distribution. Our mechanism operates on the basis of a dynamic network reconfiguration in response to failures. The network reconfiguration only employs local information recorded at the router nodes which leads to a highly scalable system. In addition, its low cost and overhead permit a practicable hardware implementation. Finally, as Immunet does not require in-flight traffic to be discarded, the parallel applications running in the system can transparently circumvent network failures. Only packets stored in or traveling through a broken component need to be recovered by higher system levels.
Practical Deadlock-Free Fault-Tolerant Routing in Meshes Based on the Planar Network Fault Model
, 2009
"... The number of virtual channels required for deadlock-free routing is important for cost-effective and high-performance system design. The planar adaptive routing scheme is an effective deadlock avoidance technique using only three virtual channels for each physical channel in 3D or higher dimension ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The number of virtual channels required for deadlock-free routing is important for cost-effective and high-performance system design. The planar adaptive routing scheme is an effective deadlock avoidance technique using only three virtual channels for each physical channel in 3D or higher dimensional mesh networks with a very simple deadlock avoidance scheme. However, there exist one idle virtual channel for all physical channels along the first dimension and two idle virtual channels for channels along the last dimension in a mesh network based on the planar adaptive routing algorithm. A new deadlock avoidance technique is proposed for 3D meshes using only two virtual channels by making full use of the idle channels. The deadlock-free adaptive routing scheme is then modified to a deadlock-free adaptive fault-tolerant routing scheme based on a planar network (PN) fault model. The proposed deadlock-free adaptive routing scheme is also extended to n-dimensional meshes still using two virtual channels. Sufficient simulation results are presented to demonstrate the effectiveness of the proposed algorithm.
A probabilistic characterization of fault rings in adaptively-routed mesh interconnection networks. Paper presented at the
, 2008
"... With increase in concern for reliability in the current and next generation of Multiprocessors System-on-Chip (MP-SoCs), multi-computers, cluster computers, and peer-to-peer communication networks, fault-tolerance has become an integral part of these systems. One of the fundamental issues regarding ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
With increase in concern for reliability in the current and next generation of Multiprocessors System-on-Chip (MP-SoCs), multi-computers, cluster computers, and peer-to-peer communication networks, fault-tolerance has become an integral part of these systems. One of the fundamental issues regarding fault-tolerance is how to efficiently route a faulty network where each component is associated with some probability of failure. Adaptive fault-tolerant routing algorithms have been frequently suggested in the literature as means of improving communication performance and fault-tolerant demands in computer systems. Also, several results have been reported on usage of fault rings in providing detours to messages blocked by faults and in routing messages adaptively around the rectangular faulty regions. In order to analyze the performance of such routing schemes, one must investigate the characteristics of fault rings. In this paper, we derive mathematical expressions to compute the probability of message facing the fault rings in the well-known mesh interconnection network. We also conduct extensive simulation experiments using a variety of faults, the results of which are used to confirm the accuracy of the proposed models.
CamCubeOS: A Key-based Network Stack for 3D Torus Cluster Topologies
"... Cluster fabric interconnects that use 3D torus topologies are increasingly being deployed in data center clusters. In our prior work, we demonstrated that by using these topologies and letting applications implement custom routing protocols and perform operations on path, it is possible to increase ..."
Abstract
- Add to MetaCart
(Show Context)
Cluster fabric interconnects that use 3D torus topologies are increasingly being deployed in data center clusters. In our prior work, we demonstrated that by using these topologies and letting applications implement custom routing protocols and perform operations on path, it is possible to increase performance and simplify development. However, these benefits cannot be achieved using mainstream point-to-point networking stacks such as TCP/IP or MPI, which hide the underlying topology and do not allow the implementation of any in-network operations. In this paper we describe CamCubeOS, a novel key-based communication stack, purposely designed from scratch for 3D torus fabric interconnects. We note that many of the applications used in clusters are key-based. Therefore, we designed CamCubeOS to natively support key-based operations. We select a virtual topology that perfectly matches the underlying physical topology and we use the keyspace to expose the physical locality, thus avoiding the typical overhead incurred by overlay-based approaches. We report on our experience in building several applications on top of CamCubeOS and we evaluate their performance and feasibility using a prototype and large-scale simulations.
Declaration
, 2006
"... These doctoral studies were conducted under the supervision of Prof. Kenneth G. ..."
Abstract
- Add to MetaCart
(Show Context)
These doctoral studies were conducted under the supervision of Prof. Kenneth G.
Understanding the Interconnection Network of SpiNNaker
"... SpiNNaker is a massively parallel architecture designed to model large-scale spiking neural networks in (biological) real-time. Its design is based around ad-hoc multi-core System-on-Chips which are interconnected using a two-dimensional toroidal triangular mesh. Neurons are modeled in software and ..."
Abstract
- Add to MetaCart
(Show Context)
SpiNNaker is a massively parallel architecture designed to model large-scale spiking neural networks in (biological) real-time. Its design is based around ad-hoc multi-core System-on-Chips which are interconnected using a two-dimensional toroidal triangular mesh. Neurons are modeled in software and their spikes generate packets that propagate through the on- and inter-chip communication fabric relying on custom-made on-chip multicast routers. This paper models and evaluates large-scale instances of its novel interconnect (more than 65 thousand nodes, or over one million computing cores), focusing on real-time features and fault-tolerance. The key contribution can be summarized as understanding the properties of the feasible topologies and establishing the stable operation of the SpiNNaker under different levels of degradation. First we derive analytically the topological
A Multipath Routing Method for Tolerating Permanent and Non-Permanent Faults?
"... Abstract. The intensive and continuous use of high-performance com-puters for executing computationally intensive applications, coupled with the large number of elements that make them up, dramatically increase the likelihood of failures during their operation. The interconnection network is a criti ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. The intensive and continuous use of high-performance com-puters for executing computationally intensive applications, coupled with the large number of elements that make them up, dramatically increase the likelihood of failures during their operation. The interconnection network is a critical part of such systems, therefore, network faults have an extremely high impact because most routing al-gorithms are not designed to tolerate faults. In such algorithms, just a single fault may stall messages in the network, preventing the finalization of applications, or may lead to deadlocked configurations. This work focuses on the problem of fault tolerance for high-speed in-terconnection networks by designing a fault-tolerant routing method to solve an unbounded number of dynamic faults (permanent and non-permanent). To accomplish this task we take advantage of the commu-nication path redundancy, by means of a multipath routing approach. Experiments show that our method allows applications to finalize their execution in the presence of several number of faults, with an average performance value of 97 % compared to the fault-free scenarios. 1
Simulation Modelling Practice and Theory
"... e res eling putin ical r ss con s, wit s of st a f ating system, high performance libraries and parallel applications. The performance of all these components has to be prop-erly evaluated in order to select the most effective (again, the exact meaning of effective depends on the context) taking int ..."
Abstract
- Add to MetaCart
(Show Context)
e res eling putin ical r ss con s, wit s of st a f ating system, high performance libraries and parallel applications. The performance of all these components has to be prop-erly evaluated in order to select the most effective (again, the exact meaning of effective depends on the context) taking into account the purpose of the system and the workloads that are planned to be executed on them. Furthermore, the complete