Results 1 - 10
of
16
Decentralized Task-Aware Scheduling for Data Center Networks
"... Many data center applications perform rich and complex tasks (e.g., executing a search query or generating a user’s news-feed). From a network perspective, these tasks typi-cally comprise multiple flows, which traverse different parts of the network at potentially different times. Most network resou ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
Many data center applications perform rich and complex tasks (e.g., executing a search query or generating a user’s news-feed). From a network perspective, these tasks typi-cally comprise multiple flows, which traverse different parts of the network at potentially different times. Most network resource allocation schemes, however, treat all these flows in isolation – rather than as part of a task – and therefore only optimize flow-level metrics. In this paper, we show that task-aware network schedul-ing, which groups flows of a task and schedules them to-gether, can reduce both the average as well as tail completion time for typical data center applications. To achieve these benefits in practice, we design and implement Baraat, a de-centralized task-aware scheduling system. Baraat schedules tasks in a FIFO order but avoids head-of-line blocking by dy-namically changing the level of multiplexing in the network. Through experiments with Memcached on a small testbed and large-scale simulations, we show that Baraat outper-forms state-of-the-art decentralized schemes (e.g., pFabric) as well as centralized schedulers (e.g., Orchestra) for a wide range of workloads (e.g., search, analytics, etc).
Achieving Cost-efficient, Data-intensive Computing in the Cloud
"... Cloud computing providers have recently begun to offer high-performance virtualized flash storage and virtualized network I/O capabilities, which have the potential to increase application performance. Since users pay for only the re-sources they use, these new resources have the potential to lower ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Cloud computing providers have recently begun to offer high-performance virtualized flash storage and virtualized network I/O capabilities, which have the potential to increase application performance. Since users pay for only the re-sources they use, these new resources have the potential to lower overall cost. Yet achieving low cost requires choosing the right mixture of resources, which is only possible if their performance and scaling behavior is known. In this paper, we present a systematic measurement of re-cently introduced virtualized storage and network I/O within Amazon Web Services (AWS). Our experience shows that there are scaling limitations in clusters relying on these new features. As a result, provisioning for a large-scale cluster differs substantially from small-scale deployments. We de-scribe the implications of this observation for achieving ef-ficiency in large-scale cloud deployments. To confirm the value of our methodology, we deploy cost-efficient, high-performance sorting of 100 TB as a large-scale evaluation.
NetAgg: Using Middleboxes for Application-specific On-path Aggregation in Data Centres
"... Data centre applications for batch processing (e.g. map/reduce frameworks) and online services (e.g. search engines) scale by dis-tributing data and computation across many servers. They typically follow a partition/aggregation pattern: tasks are first partitioned across servers that process data lo ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Data centre applications for batch processing (e.g. map/reduce frameworks) and online services (e.g. search engines) scale by dis-tributing data and computation across many servers. They typically follow a partition/aggregation pattern: tasks are first partitioned across servers that process data locally, and then those partial re-sults are aggregated. This data aggregation step, however, shifts the performance bottleneck to the network, which typically struggles to support many-to-few, high-bandwidth traffic between servers. Instead of performing data aggregation at edge servers, we show that it can be done more efficiently along network paths. We de-scribe NETAGG, a software platform that supports on-path aggre-gation for network-bound partition/aggregation applications. NET-AGG exploits a middlebox-like design, in which dedicated servers (agg boxes) are connected by high-bandwidth links to network swi-tches. Agg boxes execute aggregation functions provided by ap-plications, which alleviates network hotspots because only a frac-tion of the incoming traffic is forwarded at each hop. NETAGG requires only minimal application changes: it uses shim layers on edge servers to redirect application traffic transparently to the agg boxes. Our experimental results show that NETAGG improves substantially the throughput of two sample applications, the Solr distributed search engine and the Hadoop batch processing frame-work. Its design allows for incremental deployment in existing data centres and incurs only a modest investment cost.
A Parallel Distributed Weka Framework for Big Data Mining using Spark
"... Abstract—Effective Big Data Mining requires scalable and efficient solutions that are also accessible to users of all levels of expertise. Despite this, many current efforts to provide effective knowledge extraction via large-scale Big Data Mining tools focus more on performance than on use and tuni ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Effective Big Data Mining requires scalable and efficient solutions that are also accessible to users of all levels of expertise. Despite this, many current efforts to provide effective knowledge extraction via large-scale Big Data Mining tools focus more on performance than on use and tuning which are complex problems even for experts. Weka is a popular and comprehensive Data Mining work-bench with a well-known and intuitive interface; nonetheless it supports only sequential single-node execution. Hence, the size of the datasets and processing tasks that Weka can handle within its existing environment is limited both by the amount of memory in a single node and by sequential execution. This work discusses DistributedWekaSpark, a dis-tributed framework for Weka which maintains its existing user interface. The framework is implemented on top of Spark, a Hadoop-related distributed framework with fast in-memory processing capabilities and support for iterative computations. By combining Weka’s usability and Spark’s processing power, DistributedWekaSpark provides a usable prototype distributed Big Data Mining workbench that achieves near-linear scaling in executing various real-world scale workloads- 91.4 % weak scaling efficiency on average and up to 4x faster on average than Hadoop.
HFSP: Bringing Size-Based Scheduling To Hadoop
"... Abstract—Size-based scheduling with aging has been recog-nized as an effective approach to guarantee fairness and near-optimal system response times. We present HFSP, a scheduler introducing this technique to a real, multi-server, complex and widely used system such as Hadoop. Size-based scheduling ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Size-based scheduling with aging has been recog-nized as an effective approach to guarantee fairness and near-optimal system response times. We present HFSP, a scheduler introducing this technique to a real, multi-server, complex and widely used system such as Hadoop. Size-based scheduling requires a priori job size information, which is not available in Hadoop: HFSP builds such knowledge by estimating it on-line during job execution. Our experiments, which are based on realistic workloads generated via a standard benchmarking suite, pinpoint at a significant decrease in system response times with respect to the widely used Hadoop Fair scheduler, without impacting the fairness of the scheduler, and show that HFSP is largely tolerant to job size estimation errors.
CAST: Tiering Storage for Data Analytics in the
"... Enterprises are increasingly moving their big data analytics to the cloud with the goal of reducing costs without sacrific-ing application performance. Cloud service providers offer their tenants a myriad of storage options, which while flex-ible, makes the choice of storage deployment non trivial. ..."
Abstract
- Add to MetaCart
(Show Context)
Enterprises are increasingly moving their big data analytics to the cloud with the goal of reducing costs without sacrific-ing application performance. Cloud service providers offer their tenants a myriad of storage options, which while flex-ible, makes the choice of storage deployment non trivial. Crafting deployment scenarios to leverage these choices in a cost-effective manner — under the unique pricing mod-els and multi-tenancy dynamics of the cloud environment — presents unique challenges in designing cloud-based data analytics frameworks. In this paper, we proposeCast, a Cloud Analytics Storage Tiering solution that cloud tenants can use to reduce mon-etary cost and improve performance of analytics workloads. The approach takes the first step towards providing stor-age tiering support for data analytics in the cloud. Cast performs offline workload profiling to construct job perfor-mance prediction models on different cloud storage services, and combines these models with workload specifications and high-level tenant goals to generate a cost-effective data place-ment and storage provisioning plan. Furthermore, we build Cast++ to enhance Cast’s optimization model by incorpo-rating data reuse patterns and across-jobs interdependencies common in realistic analytics workloads. Tests with produc-tion workload traces from Facebook and a 400-core Google Cloud based Hadoop cluster demonstrate thatCast++ achieves 1.21 × performance and reduces deployment costs by 51.4% compared to local storage configuration.
1RCMP: Enabling Efficient Recomputation Based Failure Resilience for Big Data Analytics
"... Abstract—Data replication, the main failure resilience strategy used for big data analytics jobs, can be unnecessarily inefficient. It can cause serious performance degradation when applied to intermediate job outputs in multi-job computations. For instance, for I/O-intensive big data jobs, data rep ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Data replication, the main failure resilience strategy used for big data analytics jobs, can be unnecessarily inefficient. It can cause serious performance degradation when applied to intermediate job outputs in multi-job computations. For instance, for I/O-intensive big data jobs, data replication is especially expensive because very large datasets need to be replicated. Reducing the number of replicas is not a satisfactory solution as it only aggravates a fundamental limitation of data replication: its failure resilience guarantees are limited by the number of available replicas. When all replicas of some piece of intermediate job output are lost, cascading job recomputations may be required for recovery. In this paper we show how job recomputation can be made a first-order failure resilience strategy for big data analytics. The need for data replication can thus be significantly re-duced. We present RCMP, a system that performs efficient job recomputation. RCMP can persist task outputs across jobs and leverage them to minimize the work performed during job recomputations. More importantly, RCMP addresses two important challenges that appear during job recomputations. The first is efficiently utilizing the available compute node parallelism. The second is dealing with hot-spots. RCMP handles both by switching to a finer-grained task scheduling granularity for recomputations. Our experiments show that RCMP’s benefits hold across two different clusters, for job inputs as small as 40GB or as large as 1.2TB. Compared to RCMP, data replication is 30%-100 % worse during failure-free periods. More importantly, by efficiently performing recomputations, RCMP is comparable or better even under single and double data loss events. I.
AJIRA: a Lightweight Distributed Middleware for MapReduce and Stream Processing
"... Abstract—Currently, MapReduce is the most popular pro-gramming model for large-scale data processing and this mo-tivated the research community to improve its efficiency either with new extensions, algorithmic optimizations, or hardware. In this paper we address two main limitations of MapReduce: on ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Currently, MapReduce is the most popular pro-gramming model for large-scale data processing and this mo-tivated the research community to improve its efficiency either with new extensions, algorithmic optimizations, or hardware. In this paper we address two main limitations of MapReduce: one relates to the model’s limited expressiveness, which prevents the implementation of complex programs that require multiple steps or iterations. The other relates to the efficiency of its most popular implementations (e.g., Hadoop), which provide good resource utilization only for massive volumes of input, operating suboptimally for smaller or rapidly changing input. To address these limitations, we present AJIRA, a new mid-dleware designed for efficient and generic data processing. At a conceptual level, AJIRA replaces the traditional map/reduce primitives by generic operators that can be dynamically allocated, allowing the execution of more complex batch and stream process-ing jobs. At a more technical level, AJIRA adopts a distributed, multi-threaded architecture that strives at minimizing overhead for non-critical functionality. These characteristics allow AJIRA to be used as a single programming model for both batch and stream processing. To this end, we evaluated its performance against Hadoop, Spark, Esper, and Storm, which are state of the art systems for both batch and stream processing. Our evaluation shows that AJIRA is competitive in a wide range of scenarios both in terms of processing time and scalability, making it an ideal choice where flexibility, extensibility, and the processing of both large and dynamic data with a single programming model are either desirable or even mandatory requirements. I.
Scale up Vs. Scale out in Cloud Storage and Graph Processing Systems
"... Abstract—Deployers of cloud storage and iterative processing systems typically have to deal with either dollar budget constraints or throughput requirements. This paper examines the question of whether such cloud storage and iterative processing systems are more cost-efficient when scheduled on a CO ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Deployers of cloud storage and iterative processing systems typically have to deal with either dollar budget constraints or throughput requirements. This paper examines the question of whether such cloud storage and iterative processing systems are more cost-efficient when scheduled on a COTS (scale out) cluster or a single beefy (scale up) machine. We experimentally evaluate two systems: 1) a distributed key-value store (Cassandra), and 2) a distributed graph processing system (GraphLab). Our studies reveal scenarios where each option is preferable over the other. We provide recommendations for deployers of such systems to decide between scale up vs. scale out, as a function of their dollar or throughput constraints. Our results indicate that there is a need for adaptive scheduling in heterogeneous clusters containing scale up and scale out nodes.
Modeling the Impact of Workload on Cloud Resource Scaling
"... Abstract—Cloud computing offers the flexibility to dynamically size the infrastructure in response to changes in workload de-mand. While both horizontal and vertical scaling of infrastructure is supported by major cloud providers, these scaling options differ significantly in terms of their cost, pr ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Cloud computing offers the flexibility to dynamically size the infrastructure in response to changes in workload de-mand. While both horizontal and vertical scaling of infrastructure is supported by major cloud providers, these scaling options differ significantly in terms of their cost, provisioning time, and their impact on workload performance. Importantly, the efficacy of horizontal and vertical scaling critically depends on the workload characteristics, such as the workload’s parallelizability and its core scalability. In today’s cloud systems, the scaling decision is left to the users, requiring them to fully understand the tradeoffs associated with the different scaling options. In this paper, we present our solution for optimizing the resource scaling of cloud deployments via implementation in OpenStack. The key component of our solution is the modeling engine that characterizes the workload and then quantitatively evaluates different scaling options for that workload. Our modeling engine leverages Amdahl’s Law to model service time scaling in scale-up environments and queueing-theoretic concepts to model per-formance scaling in scale-out environments. We further employ Kalman filtering to account for inaccuracies in the model-based methodology, and to dynamically track changes in the workload and cloud environment. I.