Results 1 -
4 of
4
WANalytics: Analytics for a Geo-distributed Data-intensive World
- In CIDR
, 2015
"... ABSTRACT Large organizations today operate data centers around the globe where massive amounts of data are produced and consumed by local users. Despite their geographically diverse origin, such data must be analyzed/mined as a whole. We call the problem of supporting rich DAGs of computation acros ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
ABSTRACT Large organizations today operate data centers around the globe where massive amounts of data are produced and consumed by local users. Despite their geographically diverse origin, such data must be analyzed/mined as a whole. We call the problem of supporting rich DAGs of computation across geographically distributed data Wide-Area Big-Data (WABD). To the best of our knowledge, WABD is not supported by currently deployed systems nor sufficiently studied in literature; it is addressed today by continuously copying raw data to a central location for analysis. We observe from production workloads that WABD is important for large organizations, and that centralized solutions incur substantial cross-data center network costs. We argue that these trends will only worsen as the gap between data volumes and transoceanic bandwidth widens. Further, emerging concerns over data sovereignty and privacy may trigger government regulations that can threaten the very viability of centralized solutions. To address WABD we propose WANalytics, a system that pushes computation to edge data centers, automatically optimizing workflow execution plans and replicating data when needed. Our Hadoop-based prototype delivers 257× reduction in WAN bandwidth on a production workload from Microsoft. We round out our evaluation by also demonstrating substantial gains for three standard benchmarks: TPC-CH, Berkeley Big Data, and BigBench.
Toward High-Performance Distributed Stream Processing via Approximate Fault Tolerance
"... Abstract Fault tolerance is critical for distributed stream processing systems, yet achieving error-free fault tolerance often incurs substantial performance overhead. We present AF-Stream, a distributed stream processing system that addresses the trade-off between performance and accuracy in fault ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract Fault tolerance is critical for distributed stream processing systems, yet achieving error-free fault tolerance often incurs substantial performance overhead. We present AF-Stream, a distributed stream processing system that addresses the trade-off between performance and accuracy in fault tolerance. AF-Stream builds on a notion called approximate fault tolerance, whose idea is to mitigate backup overhead by adaptively issuing backups, while ensuring that the errors upon failures are bounded with theoretical guarantees. Our AF-Stream design provides an extensible programming model for incorporating general streaming algorithms, and also exports only few threshold parameters for configuring approximation fault tolerance. Experiments on Amazon EC2 show that AF-Stream maintains high performance (compared to no fault tolerance) and high accuracy after multiple failures (compared to no failures) under various streaming algorithms.
WANalytics: Geo-Distributed Analytics for a Data Intensive World
"... ABSTRACT Many large organizations collect massive volumes of data each day in a geographically distributed fashion, at data centers around the globe. Despite their geographically diverse origin the data must be processed and analyzed as a whole to extract insight. We call the problem of supporting ..."
Abstract
- Add to MetaCart
(Show Context)
ABSTRACT Many large organizations collect massive volumes of data each day in a geographically distributed fashion, at data centers around the globe. Despite their geographically diverse origin the data must be processed and analyzed as a whole to extract insight. We call the problem of supporting large-scale geo-distributed analytics Wide-Area Big Data (WABD). To the best of our knowledge, WABD is currently addressed by copying all the data to a central data center where the analytics are run. This approach consumes expensive cross-data center bandwidth and is incompatible with data sovereignty restrictions that are starting to take shape. We instead propose WANalytics, a system that solves the WABD problem by orchestrating distributed query execution and adjusting data replication across data centers in order to minimize bandwidth usage, while respecting sovereignty requirements. WANalytics achieves an up to 360× reduction in data transfer cost when compared to the centralized approach on both real Microsoft production workloads and standard synthetic benchmarks, including TPC-CH and Berkeley Big-Data. In this demonstration, attendees will interact with a live geo-scale multi-data center deployment of WANalytics, allowing them to experience the data transfer reduction our system achieves, and to explore how it dynamically adapts execution strategy in response to changes in the workload and environment.
unknown title
"... In the last fifteen years, systems software has increased dramatically in complexity and power. Not so long ago, most computer programs ran on one machine. Distributed computing was a specialized problem for developers of networked services or supercomputer software. Today, cloud computing has made ..."
Abstract
- Add to MetaCart
(Show Context)
In the last fifteen years, systems software has increased dramatically in complexity and power. Not so long ago, most computer programs ran on one machine. Distributed computing was a specialized problem for developers of networked services or supercomputer software. Today, cloud computing has made it convenient to rent computational resources as needed, spread across hundreds or thousands of nodes. Modern systems software has made it feasible to harness these resources for data-intensive computing. As a result, many organizations are starting to rely on complex systems software to process big data. Past work Complex software systems can be dauntingly hard to design, manage, and use. These challenges can all be classed under the heading of “reliability. ” My past work has addressed reliability challenges from a variety of perspectives. I have built systems that cope gracefully with failures or resource shortages. I have used static analysis to understand how programs use and misuse configuration. And I have examined why developers use the tools they do, which often contribute to reliability challenges. These topics are interlinked. For example, resource management is one of the major sources of configuration trouble [14].