Results 1 - 10
of
140
DAGuE: A generic distributed DAG engine for high performance computing
, 2010
"... The frenetic development of the current architectures places a strain on the current state-of-the-art programming environments. Harnessing the full potential of such architectures has been a tremendous task for the whole scientific computing community. We present DAGuE a generic framework for archit ..."
Abstract
-
Cited by 67 (21 self)
- Add to MetaCart
(Show Context)
The frenetic development of the current architectures places a strain on the current state-of-the-art programming environments. Harnessing the full potential of such architectures has been a tremendous task for the whole scientific computing community. We present DAGuE a generic framework for architecture aware scheduling and management of micro-tasks on distributed many-core heterogeneous architectures. Applications we consider can be represented as a Direct Acyclic Graph of tasks with labeled edges designating data dependencies. DAGs are represented in a compact, problem-size independent format that can be queried on-demand to discover data dependencies, in a totally distributed fashion. DAGuE assigns computation threads to the cores, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on cache awareness, data-locality and task priority. We demonstrate the efficiency of our approach, using several micro-benchmarks to analyze the performance of different components of the framework, and a Linear Algebra factorization as a use case. I.
The Grid Workloads Archive
, 2008
"... While large grids are currently supporting the work of thousands of scientists, very little is known about their actual use. Because of strict organizational permissions, there are few or no traces of grid workloads available to the grid researcher and practitioner. To address this problem, in this ..."
Abstract
-
Cited by 42 (12 self)
- Add to MetaCart
While large grids are currently supporting the work of thousands of scientists, very little is known about their actual use. Because of strict organizational permissions, there are few or no traces of grid workloads available to the grid researcher and practitioner. To address this problem, in this work we present the Grid Workloads Archive (GWA), which is at the same time a workload data exchange and a meeting point for the grid community. We define the requirements for building a workloads archive, and describe the approach taken to meet these requirements with the GWA. We introduce a format for sharing grid workload information, and tools associated with this format. Using these tools, we collect and analyze data from nine well-known grid environments, with a total content of more than 2000 users submitting more than 7 million jobs over a period of over 13 operational years, and with working environments spanning over 130 sites comprising 10000 resources. We show evidence that grid workloads are very different from those encountered in other large-scale environments, and in particular from the workloads of parallel production environments: they comprise almost exclusively single-node jobs, and jobs arrive in ”bags-of-tasks”. Finally, we present the immediate applications of the GWA and of its content in several critical grid research and practical areas: research in grid resource management, and grid design, operation, and maintenance.
BlobCR: Efficient checkpoint-restart for hpc applications on iaas clouds using virtual disk image snapshots
- in SC ’11: 24th International Conference for High Performance Computing, Networking, Storage and Analysis
, 2011
"... Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpo ..."
Abstract
-
Cited by 24 (7 self)
- Add to MetaCart
(Show Context)
Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpoint-restart mechanism becomes paramount in this context. This paper proposes a solution to the aforementioned challenge that aims at minimizing the storage space and performance overhead of checkpoint-restart. We introduce an approach that leverages virtual machine (VM) disk-image multi-snapshotting and multi-deployment inside checkpoint-restart protocols running at guest level in order to efficiently capture and potentially roll back the complete state of the application, including file system modifications. Experiments on the G5K testbed show substantial improvement for MPI applications over existing approaches, both for the case when customized checkpointing is available at application level and the case when it needs to be handled at process level. Categories andSubjectDescriptors D.3.4 [Systems and Software]: Distributed systems
The characteristics and performance of groups of jobs in grids
- In Euro-Par, volume 4641 of LNCS
, 2007
"... Abstract. Even though with few exceptions, grid workloads are dominated by single-node jobs, not all of these jobs are necessarily independent or unrelated. For instance, sets of jobs may be grouped because they are submitted by users in batches, e.g., to perform parameter sweeps. However, there is ..."
Abstract
-
Cited by 17 (9 self)
- Add to MetaCart
(Show Context)
Abstract. Even though with few exceptions, grid workloads are dominated by single-node jobs, not all of these jobs are necessarily independent or unrelated. For instance, sets of jobs may be grouped because they are submitted by users in batches, e.g., to perform parameter sweeps. However, there is no reported data to confirm the presence and structure of these groupings, despite the large potential impact of such information. To address this lack of information, in this work we present a first investigation into the characteristics of groups of jobs present in grid workloads. First, we define three types of job groupings: batch, continued, and bursty submissions. Then, we analyze the characteristics of these groupings for three long-term traces from currently deployed grid environments. Notably, our results show that the various groupings are responsible for up to 96 % of the total CPU time consumption. Finally, we present insights into the performance of real grids in dealing with grouped jobs. 1
L.: Enabling high data throughput in desktop grids through decentralized data and metadata management: The blobseer approach
"... Abstract. Whereas traditional Desktop Grids rely on centralized servers for data management, some recent progress has been made to enable distributed, large input data, using to peer-to-peer (P2P) protocols and Content Distribution Networks (CDN). We make a step further and propose a generic, yet ef ..."
Abstract
-
Cited by 17 (12 self)
- Add to MetaCart
(Show Context)
Abstract. Whereas traditional Desktop Grids rely on centralized servers for data management, some recent progress has been made to enable distributed, large input data, using to peer-to-peer (P2P) protocols and Content Distribution Networks (CDN). We make a step further and propose a generic, yet efficient data storage which enables the use of Desktop Grids for applications with high output data requirements, where the access grain and the access patterns may be random. Our solution builds on a blob management service enabling a large number of concurrent clients to efficiently read/write and append huge data that are fragmented and distributed at a large scale. Scalability under heavy concurrency is achieved thanks to an original metadata scheme using a distributed segment tree built on top of a Distributed Hash Table (DHT). The proposed approach has been implemented and its benefits have successfully been demonstrated within our BlobSeer prototype on the Grid’5000 testbed. 1
Experiments in parallel constraint-based local search
- Evolutionary Computation in Combinatorial Optimization - 11th European Conference, EvoCOP 2011
"... Abstract. We present a parallel implementation of a constraint-based local search algorithm and investigate its performance results on hard-ware with several hundreds of processors. We choose as basic constraint solving algorithm for these experiments the ”adaptive search ” method, an efficient sequ ..."
Abstract
-
Cited by 13 (11 self)
- Add to MetaCart
(Show Context)
Abstract. We present a parallel implementation of a constraint-based local search algorithm and investigate its performance results on hard-ware with several hundreds of processors. We choose as basic constraint solving algorithm for these experiments the ”adaptive search ” method, an efficient sequential local search method for Constraint Satisfaction Problems. The implemented algorithm is a parallel version of adaptive search in a multiple independent-walk manner, that is, each process is an independent search engine and there is no communication between the simultaneous computations. Preliminary performance evaluation on a variety of classical CSPs benchmarks shows that speedups are very good for a few tens of processors, and good up to a few hundreds of processors. 1
Performance Analysis of Allocation Policies for InterGrid Resource Provisioning
, 2008
"... Several Grids have been established and used for varying science applications during the last years. Most of these Grids, however, work in isolation and with different utilisation levels. Previous work introduced an architecture and a mechanism to enable resource sharing amongst Grids. It demonstrat ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
Several Grids have been established and used for varying science applications during the last years. Most of these Grids, however, work in isolation and with different utilisation levels. Previous work introduced an architecture and a mechanism to enable resource sharing amongst Grids. It demonstrated that there can be benefits for a Grid to offload requests or provide spare resources to another Grid, thus reducing the cost of over-provisioning. These benefits derive from the fact that resource utilisation within a Grid has fixed and operational costs such as those with electricity providers and system administrators. In this work, we address the problem of resource provisioning to Grid applications in multiple-Grid environments. The provisioning is carried out based on availability information obtained from queueing-based resource management systems deployed at the provider sites who are the participants of the Grids. We evaluate the performance of different allocation policies. In contrast to existing work on load sharing across Grids, the policies described here take into account the local load of resource providers, imprecise availability information and the monetary compensation of providers. In addition, we evaluate these policies along with mechanism that allows resource sharing amongst Grids. Experimental results obtained through simulation show that the mechanism and policies are effective in redirecting requests thus improving the applications’ average weighted response time.
Scheduling Parallel Task Graphs on (Almost) Homogeneous Multi-cluster Platforms
- In Proceedings of the 10th Workshop on Job Scheduling Strategies for Parallel Processing
, 2004
"... Abstract—Applications structured as parallel task graphs exhibit both data and task parallelism, and arise in many domains. Scheduling these applications efficiently on parallel platforms has been a long-standing challenge. In the case of a single homogeneous platform, such as a cluster, results hav ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
(Show Context)
Abstract—Applications structured as parallel task graphs exhibit both data and task parallelism, and arise in many domains. Scheduling these applications efficiently on parallel platforms has been a long-standing challenge. In the case of a single homogeneous platform, such as a cluster, results have been obtained both in theory, i.e., guaranteed algorithms, and in practice, i.e., pragmatic heuristics. Due to task parallelism these applications are well suited for execution on distributed platforms that span multiple clusters possibly in multiple institutions. However, the only available results in this context are non-guaranteed heuristics. In this paper we develop a scheduling algorithm, MCGAS, which is applicable to multi-cluster platforms that are almost homogeneous. Such platforms are often found as large subsets of multi-cluster platforms. Our novel contribution is that MCGAS computes task allocations so that a (tunable) performance guarantee is provided. Since a performance guarantee does not necessarily imply good average performance in practice, we also compare MCGAS with a recently proposed non-guaranteed algorithm. Using simulation over a wide range of experimental scenarios, we find that MCGAS leads to better average application makespans than its competitor. Index Terms—ixed parallelism, parallel task graph scheduling, performance guarantee, multi-cluster platform ixed parallelism, parallel task graph scheduling, performance guarantee, multi-cluster platform M 1
Investigating self-similarity and heavy-tailed distributions on a large scale experimental facility
- IEEE/ACM Transactions on Networking
"... Abstract—After the seminal work by Taqqu et al. relating selfsimilarity to heavy-tailed distributions, a number of research articles verified that aggregated Internet traffic time series show self-similarity and that Internet attributes, like Web file sizes and flow lengths, were heavy-tailed. Howev ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
(Show Context)
Abstract—After the seminal work by Taqqu et al. relating selfsimilarity to heavy-tailed distributions, a number of research articles verified that aggregated Internet traffic time series show self-similarity and that Internet attributes, like Web file sizes and flow lengths, were heavy-tailed. However, the validation of the theoretical prediction relating self-similarity and heavy tails remains unsatisfactorily addressed, being investigated either using numerical or network simulations, or from uncontrolled Web traffic data. Notably, this prediction has never been conclusively verified on real networks using controlled and stationary scenarii, prescribing specific heavy-tailed distributions, and estimating confidence intervals. With this goal in mind, we use the potential and facilities offered by the large-scale, deeply reconfigurable and fully controllable experimental Grid5000 instrument, to investigate the prediction observability on real networks. To this
Cluster-wide context switch of virtualized jobs
- in: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC ’10
, 2010
"... apport de recherche ..."
(Show Context)