Results 1 - 10
of
36
VL2: Scalable and Flexible Data Center Network”,
- ACM SIGCOMM Computer Communication Review,
, 2009
"... Abstract To be agile and cost e ective, data centers should allow dynamic resource allocation across large server pools. In particular, the data center network should enable any server to be assigned to any service. To meet these goals, we present VL, a practical network architecture that scales t ..."
Abstract
-
Cited by 461 (12 self)
- Add to MetaCart
(Show Context)
Abstract To be agile and cost e ective, data centers should allow dynamic resource allocation across large server pools. In particular, the data center network should enable any server to be assigned to any service. To meet these goals, we present VL, a practical network architecture that scales to support huge data centers with uniform high capacity between servers, performance isolation between services, and Ethernet layer- semantics. VL uses () at addressing to allow service instances to be placed anywhere in the network, () Valiant Load Balancing to spread tra c uniformly across network paths, and () end-system based address resolution to scale to large server pools, without introducing complexity to the network control plane. VL's design is driven by detailed measurements of tra c and fault data from a large operational cloud service provider. VL's implementation leverages proven network technologies, already available at low cost in high-speed hardware implementations, to build a scalable and reliable network architecture. As a result, VL networks can be deployed today, and we have built a working prototype. We evaluate the merits of the VL design using measurement, analysis, and experiments. Our VL prototype shu es . TB of data among servers in seconds -sustaining a rate that is of the maximum possible.
Better Never than Late: Meeting Deadlines in Datacenter Networks
"... The soft real-time nature of large scale web applications in today’s datacenters, combined with their distributed workflow, leads to deadlines being associated with the datacenter application traffic. A network flow is useful, and contributes to application throughput and operator revenue if, and on ..."
Abstract
-
Cited by 104 (5 self)
- Add to MetaCart
(Show Context)
The soft real-time nature of large scale web applications in today’s datacenters, combined with their distributed workflow, leads to deadlines being associated with the datacenter application traffic. A network flow is useful, and contributes to application throughput and operator revenue if, and only if, it completes within its deadline. Today’s transport protocols (TCP included), given their Internet origins, are agnostic to such flow deadlines. Instead, they strive to share network resources fairly. We show that this can hurt application performance. Motivated by these observations, and other (previously known) deficiencies of TCP in the datacenter environment, this paper presents the design and implementation of D 3, a deadline-aware control protocol that is customized for the datacenter environment. D 3 uses explicit rate control to apportion bandwidth according to flow deadlines. Evaluation from a 19-node, two-tier datacenter testbed shows that D 3, even without any deadline information, easily outperforms TCP in terms of short flow latency and burst tolerance. Further, by utilizing deadline information, D 3 effectively doubles the peak load that the datacenter network can support.
Managing data transfers in computer clusters . . .
, 2011
"... Cluster computing applications like MapReduce and Dryad transfer massive amounts of data between their computation stages. These transfers can have a significant impact on job performance, accounting for more than 50 % of job completion times. Despite this impact, there has been relatively little wo ..."
Abstract
-
Cited by 61 (6 self)
- Add to MetaCart
Cluster computing applications like MapReduce and Dryad transfer massive amounts of data between their computation stages. These transfers can have a significant impact on job performance, accounting for more than 50 % of job completion times. Despite this impact, there has been relatively little work on optimizing the performance of these data transfers, with networking researchers traditionally focusing on per-flow traffic management. We address this limitation by proposing a global management architecture and a set of algorithms that (1) improve the transfer times of common communication patterns, such as broadcast and shuffle, and (2) allow scheduling policies at the transfer level, such as prioritizing a transfer over other transfers. Using a prototype implementation, we show that our solution improves broadcast completion times by up to 4.5 × compared to the status quo in Hadoop. We also show that transfer-level scheduling can reduce the completion time of highpriority transfers by 1.7×.
Understanding network failures in data centers: measurement, analysis, and implications
- In Proc. of SIGCOMM. ACM
, 2011
"... We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We ans ..."
Abstract
-
Cited by 49 (3 self)
- Add to MetaCart
We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults, (4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40 % effective in reducing the median impact of failure.
Programming Your Network at Run-time for Big Data Applications
"... Recent advances of software defined networking and optical switching technology make it possible to program the network stack all the way from physical topology to flow level traffic control. In this paper, we leverage the combination of SDN controller with optical switching to explore the tight int ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
(Show Context)
Recent advances of software defined networking and optical switching technology make it possible to program the network stack all the way from physical topology to flow level traffic control. In this paper, we leverage the combination of SDN controller with optical switching to explore the tight integration of application and network control. We particularly study the run-time network configuration for big data applications to jointly optimize application performance and network utilization. We use Hadoop as an example to discuss the integrated network control architecture, job scheduling, topology and routing configuration mechanisms for Hadoop jobs. Our analysis suggests that such an integrated control has great potential to improve application performance with relatively small configuration overhead. We believe our study shows early promise of achieving the long-term goal of tight network and application integration using SDN.
Small-world Datacenters
- In SOCC
, 2011
"... In this paper, we propose an unorthodox topology for datacenters that eliminates all hierarchical switches in favor of connecting nodes at random according to a small-worldinspired distribution. Specifically, we examine topologies where the underlying nodes are connected at the small scale in a regu ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
(Show Context)
In this paper, we propose an unorthodox topology for datacenters that eliminates all hierarchical switches in favor of connecting nodes at random according to a small-worldinspired distribution. Specifically, we examine topologies where the underlying nodes are connected at the small scale in a regular pattern, such as a ring, torus or cube, such that every node can route efficiently to nodes in its immediate vicinity, and amended by the addition of random links to nodes throughout the datacenter, such that a greedy algorithm can route packets to far away locations efficiently. Coupled with geographical address assignment, the resulting network can provide content routing in addition to traditional routing, and thus efficiently implement key-value stores. The irregular but self-similar nature of the network facilitates constructing large networks easily using prewired, commodity racks. We show that Small-World Datacenters can achieve higher bandwidth and fault tolerance compared to both conventional hierarchical datacenters as well as the recently proposed CamCube topology. Coupled withhardware acceleration for packetswitching, small-world datacenters can achieve an order of magnitude higher bandwidth than a conventional datacenter, dependingon the network traffic.
Decentralized Task-aware Scheduling for Data Center Networks.
, 2013
"... ABSTRACT Many data center applications perform rich and complex tasks (e.g., executing a search query or generating a user's news-feed). From a network perspective, these tasks typically comprise multiple flows, which traverse different parts of the network at potentially different times. Most ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
(Show Context)
ABSTRACT Many data center applications perform rich and complex tasks (e.g., executing a search query or generating a user's news-feed). From a network perspective, these tasks typically comprise multiple flows, which traverse different parts of the network at potentially different times. Most network resource allocation schemes, however, treat all these flows in isolation -rather than as part of a task -and therefore only optimize flow-level metrics. In this paper, we show that task-aware network scheduling, which groups flows of a task and schedules them together, can reduce both the average as well as tail completion time for typical data center applications. To achieve these benefits in practice, we design and implement Baraat, a decentralized task-aware scheduling system. Baraat schedules tasks in a FIFO order but avoids head-of-line blocking by dynamically changing the level of multiplexing in the network. Through experiments with Memcached on a small testbed and large-scale simulations, we show that Baraat outperforms state-of-the-art decentralized schemes (e.g., pFabric) as well as centralized schedulers (e.g., Orchestra) for a wide range of workloads (e.g., search, analytics, etc).
On the feasibility of completely wireless data centers. Cornell CIS
, 2011
"... Conventional datacenters, based on wired networks, entail high wiring costs, suffer from performance bottlenecks, and have low resilience to network failures. In this paper, we in-vestigate a radically new methodology for building wire-free datacenters based on emerging 60GHz RF technology. We propo ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
(Show Context)
Conventional datacenters, based on wired networks, entail high wiring costs, suffer from performance bottlenecks, and have low resilience to network failures. In this paper, we in-vestigate a radically new methodology for building wire-free datacenters based on emerging 60GHz RF technology. We propose a novel rack design and a resulting network topology inspired by Cayley graphs that provide a dense interconnect. Our exploration of the resulting design space shows that wireless datacenters built with this methodology can po-tentially attain higher aggregate bandwidth, lower latency, and substantially higher fault tolerance than a conventional wired datacenter while improving ease of construction and maintenance.
On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects
"... In this paper, we present network-on-chip (NoC) design and contrast it to traditional network design, highlighting similarities and differences between the two. As an initial case study, we examine network congestion in bufferless NoCs. We show that congestion manifests itself differently in a NoC t ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
(Show Context)
In this paper, we present network-on-chip (NoC) design and contrast it to traditional network design, highlighting similarities and differences between the two. As an initial case study, we examine network congestion in bufferless NoCs. We show that congestion manifests itself differently in a NoC than in traditional networks. Network congestion reduces system throughput in congested workloads for smaller NoCs (16 and 64 nodes), and limits the scalability of larger bufferless NoCs (256 to 4096 nodes) even when traffic has locality (e.g., when an application’s required data is mapped nearby to its core in the network). We propose a new source throttlingbased congestion control mechanism with application-level awareness that reduces network congestion to improve system performance. Our mechanism improves system performance by up to 28 % (15 % on average in congested workloads) in smaller NoCs, achieves linear throughput scaling in NoCs up to 4096 cores (attaining similar performance scalability to a NoC with large buffers), and reduces power consumption by up to 20%. Thus, we show an effective application of a network-level concept, congestion control, to a class of networks – bufferless on-chip networks – that has not been studied before by the networking community.