Results 1 -
8 of
8
Falkon: a Fast and Light-weight tasK executiON framework
- IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (SC07
, 2007
"... To enable the rapid execution of many tasks on compute clusters, we have developed Falkon, a Fast and Light-weight tasK executiON framework. Falkon integrates (1) multi-level scheduling to separate resource acquisition (via, e.g., requests to batch schedulers) from task dispatch, and (2) a streamlin ..."
Abstract
-
Cited by 44 (20 self)
- Add to MetaCart
To enable the rapid execution of many tasks on compute clusters, we have developed Falkon, a Fast and Light-weight tasK executiON framework. Falkon integrates (1) multi-level scheduling to separate resource acquisition (via, e.g., requests to batch schedulers) from task dispatch, and (2) a streamlined dispatcher. Falkon’s integration of multi-level scheduling and streamlined dispatchers delivers performance not provided by any other system. We describe Falkon architecture and implementation, and present performance results for both microbenchmarks and applications. Microbenchmarks show that Falkon throughput (487 tasks/sec) and scalability (to 54,000 executors and 2,000,000 tasks processed in just 112 minutes) are one to two orders of magnitude better than other systems used in production Grids. Large-scale astronomy and medical applications executed under Falkon by the Swift parallel programming system achieve up to 90 % reduction in end-to-end run time, relative to versions that execute tasks via separate scheduler submissions.
Accelerating Large-Scale Data Exploration through Data Diffusion
- ACM International Workshop on Data-Aware Distributed Computing 2008
"... Data-intensive applications often require exploratory analysis of large datasets. If analysis is performed on distributed resources, data locality can be crucial to high throughput and performance. We propose a “data diffusion ” approach that acquires compute and storage resources dynamically, repli ..."
Abstract
-
Cited by 15 (12 self)
- Add to MetaCart
Data-intensive applications often require exploratory analysis of large datasets. If analysis is performed on distributed resources, data locality can be crucial to high throughput and performance. We propose a “data diffusion ” approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. The approach is reminiscent of cooperative caching, web-caching, and peer-to-peer storage systems, but addresses different application demands. Other data-aware scheduling approaches assume dedicated resources, which can be expensive and/or inefficient if load varies significantly. To explore the feasibility of the data diffusion approach, we have extended the Falkon resource provisioning and task scheduling system to support data caching and data-aware scheduling. Performance results from both microbenchmarks and a large scale astronomy application demonstrate that our approach improves performance relative to alternative approaches, as well as provides improved scalability as aggregated I/O bandwidth scales linearly with the number of data cache nodes.
Harnessing Grid Resources with Data-Centric Task Farms
, 2007
"... As the size of scientific data sets and the resources required for analysis increase, data locality becomes crucial to the efficient use of large scale distributed systems for scientific and data-intensive applications. In order to support interactive analysis of large quantities of data in many sci ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
As the size of scientific data sets and the resources required for analysis increase, data locality becomes crucial to the efficient use of large scale distributed systems for scientific and data-intensive applications. In order to support interactive analysis of large quantities of data in many scientific disciplines, we propose a data diffusion approach, in which the resources required for data analysis are acquired dynamically, in response to demand. Acquired resources (compute and storage) can be “cached ” for some time, thus allowing more rapid responses to subsequent requests. We define an abstract model for data-centric task farms as a common parallel pattern that drives the independent computational tasks, taking into consideration the data locality in order to optimize the performance of the analysis of large datasets. This approach can provide the benefits of dedicated hardware without the associated high costs. We will validate our abstract model through discrete-event simulations; we expect simulations to show the model is both efficient and scalable given a wide range of simulation parameters. To explore the practical realization of our abstract model, we have developed a Fast and Light-weight tasK executiON framework (Falkon). Falkon provides for dynamic acquisition and release of resources, data management capabilities, and the dispatch of analysis tasks via a data-aware scheduler. We have
Dynamic Resource Provisioning in Grid Environments
, 2007
"... Batch schedulers commonly used to manage access to parallel computing clusters are not typically configured to enable easy configuration of application-specific scheduling policies. In addition, their sophisticated scheduling algorithms can be relatively expensive to execute. Thus, for example, appl ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Batch schedulers commonly used to manage access to parallel computing clusters are not typically configured to enable easy configuration of application-specific scheduling policies. In addition, their sophisticated scheduling algorithms can be relatively expensive to execute. Thus, for example, applications that require the rapid execution of many small tasks often do not perform well. It has been proposed that these problems be overcome by separating the two tasks of provisioning and scheduling. This paper focuses on resource provisioning, the various allocation and de-allocation policies, and how dynamic and adaptive provisioning can be in light of varying workloads. We couple the proposed dynamic resource provisioning (DRP) with an existing system, Falkon, which is used for the scheduling of tasks to the provisioned resources. We describe the DRP architecture and implementation, and present performance results for both microbenchmarks and applications. Microbenchmarks show that DRP can allocate resources on the order of 10s of seconds across multiple Grid sites and can reduce average queue wait times by up to 95 % (effectively yielding queue wait times within 3 % of ideal); furthermore, applications (executed by the Swift parallel programming system) reduce end-to-end run time of up to 90 % for large-scale astronomy and medical applications, relative to versions that execute tasks via separate scheduler submissions.
Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore
, 2006
"... Both the industry and academia have an increase demand for good policies and mechanisms to efficiently manage large data sets and large pool of compute resources that ultimately perform computations on the data sets. We define large datasets to contain millions of objects and terabytes of data; thes ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Both the industry and academia have an increase demand for good policies and mechanisms to efficiently manage large data sets and large pool of compute resources that ultimately perform computations on the data sets. We define large datasets to contain millions of objects and terabytes of data; these kinds of datasets are common in the Astronomy domain [1, 2, 3], Medical Imaging domain [4], and many other science domains. Large pool of compute resources could be anything from 10s to 1000s of separate physical resources geographically distributed. In order to enable the efficient dynamic analysis of large datasets, we propose three different systems that can interoperate with each other in order to offer a complete storage and resource management solution. The first system is DYRE, DYnamic Resource pool Engine, which is an AstroPortal specific implementation of dynamic resource provisioning [9]. DYRE essentially handles all the necessary tasks associated with state monitoring, resource allocation based on observed state, resource de-allocation based on observed state, and exposing relevant information to other systems. The main motivations behind dynamic resource provisioning are: • Allows for finer grained resource management, including the control of priorities and usage policies • Optimize for the grid user’s perspective: reduces delays on per job scheduling by utilizing pre-reserved resources • Give the Resource Provider the perception that resource utilization is higher than it would normally be
Accelerating Large Scale Scientific Exploration through Data Diffusion
- IEEE INTERNATIONAL WORKSHOP ON DATA-AWARE DISTRIBUTED COMPUTING
, 2008
"... Scientific and data-intensive applications often require exploratory analysis on large datasets, which is often carried out on large scale distributed resources where data locality is crucial to achieve high system throughput and performance. We propose a “data diffusion” approach that acquires reso ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Scientific and data-intensive applications often require exploratory analysis on large datasets, which is often carried out on large scale distributed resources where data locality is crucial to achieve high system throughput and performance. We propose a “data diffusion” approach that acquires resources for data analysis dynamically, schedules computations as close to data as possible, and replicates data in response to workloads. As demand increases, more resources are acquired and “cached” to allow faster response to subsequent requests; resources are released when demand drops. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on the application workloads and the performance characteristics of the underlying infrastructure. This data diffusion concept is reminiscent of cooperative Web-caching and peer-to-peer storage systems. Other data-aware scheduling approaches assume static or dedicated resources, which can be expensive and inefficient if load varies significantly. The challenges to our approach are that we need to co-allocate storage resources with computation resources in order to enable the efficient analysis of possibly terabytes of data without prior knowledge of the characteristics of application workloads. To explore the proposed data diffusion, we have developed Falkon, which provides dynamic acquisition and release of resources and the dispatch of analysis tasks to those resources. We have extended Falkon to allow the compute resources to cache data to local disks, and perform task dispatch via a data-aware scheduler. The integration of Falkon and the Swift parallel programming system provides us with access to a large number of applications from astronomy, astro-physics, medicine, and other domains, with varying datasets, workloads, and analysis codes.
Data Diffusion Delivers Dynamic Digging
"... We want to support interactive analysis (“digging”) of large quantities of data, a requirement that arises, for example, in many scientific disciplines. Such analyses require turnaround measured in minutes or seconds. Achieving this performance can demand hundreds of computers to process what may be ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We want to support interactive analysis (“digging”) of large quantities of data, a requirement that arises, for example, in many scientific disciplines. Such analyses require turnaround measured in minutes or seconds. Achieving this performance can demand hundreds of computers to process what may be many terabytes of data. As the applications scale, data sets grow, and resources used increase, the importance of data locality will be crucial to the successful and efficient use of large scale distributed systems for many scientific and dataintensive applications. [9] One approach to delivering such performance, adopted, for example, by Google [5, 13], is to build large compute-storage farms dedicated to storing data and responding to user requests for processing. However, such approaches can be expensive (in terms of idle resources) if load varies significantly over the
Data Diffusion: Dynamic Resource Provision and Data-Aware Scheduling for Data-Intensive Applications
"... Data intensive applications often involve the analysis of large datasets that require large amounts of compute and storage resources. While dedicated compute and/or storage farms offer good task/data throughput, they suffer low resource utilization problem under varying workloads conditions. If we i ..."
Abstract
- Add to MetaCart
Data intensive applications often involve the analysis of large datasets that require large amounts of compute and storage resources. While dedicated compute and/or storage farms offer good task/data throughput, they suffer low resource utilization problem under varying workloads conditions. If we instead move such data to distributed computing resources, then we incur expensive data transfer cost. In this paper, we propose a data diffusion approach that combines dynamic resource provisioning, ondemand data replication and caching, and data locality-aware scheduling to achieve improved resource efficiency under varying workloads. We define an abstract “data diffusion model ” that takes into consideration the workload characteristics, data accessing cost, application throughput and resource utilization; we validate the model using a real-world large-scale astronomy application. Our results show that data diffusion can increase the performance index by as much as 34X, and improve application response time by over 506X, while achieving near-optimal throughputs and execution times.

