Results 1 - 10
of
15
ABSTRACT File Grouping for Scientific Data Management: Lessons from Experimenting with Real Traces
"... The analysis of data usage in a large set of real traces from a highenergy physics collaboration revealed the existence of an emergent grouping of files that we coined “filecules”. This paper presents the benefits of using this file grouping for prestaging data and compares it with previously propos ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
(Show Context)
The analysis of data usage in a large set of real traces from a highenergy physics collaboration revealed the existence of an emergent grouping of files that we coined “filecules”. This paper presents the benefits of using this file grouping for prestaging data and compares it with previously proposed file grouping techniques along a range of performance metrics. Our experiments with real workloads demonstrate that filecule grouping is a reliable and useful abstraction for data management in science Grids; that preserving time locality for data prestaging is highly recommended; that job reordering with respect to data availability has significant impact on throughput; and finally, that a relatively short history of traces is a good predictor for filecule grouping. Our experimental results provide lessons for workload modeling and suggest design guidelines for data management in dataintensive resource-sharing environments.
The Small World of File Sharing
"... Web caches, content distribution networks, peer-to-peer file sharing networks, distributed file systems, and data grids all have in common that they involve a community of users who use shared data. In each case, overall system performance can be improved significantly by first identifying and the ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Web caches, content distribution networks, peer-to-peer file sharing networks, distributed file systems, and data grids all have in common that they involve a community of users who use shared data. In each case, overall system performance can be improved significantly by first identifying and then exploiting the structure of community’s data access patterns. We propose a novel perspective for analyzing data access workloads that considers the implicit relationships that form among users based on the data they access. We propose a new structure —the interest-sharing graph — that captures common user interests in data and justify its utility with studies on four data-sharing systems: a high-energy physics collaboration, the Web, the Kazaa peer-to-peer network, and a BitTorrent file-sharing community. We find small-world patterns in the interest-sharing graphs of all four communities. We investigate analytically and experimentally some of the potential causes that lead to this pattern and conclude that user preferences play a major role. The significance of small-world patterns is twofold: it provides a rigorous support to intuition and it suggests the potential to exploit these naturally emerging patterns. As a proof of concept, we design and evaluate an information dissemination system that exploits the small-world interest-sharing graphs by building an interest-aware network overlay. We show that this approach leads to improved information dissemination performance.
A unified format for traces of peer-to-peer systems
- In: LSAP, ACM
"... Peer-to-Peer (P2P) systems have recently emerged as a scalable platform for which costs are shared between the system users. Today, P2P technology is serving millions of users world-wide, with applications such as file sharing, video streaming, grid computing, and massively multiplayer online games. ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Peer-to-Peer (P2P) systems have recently emerged as a scalable platform for which costs are shared between the system users. Today, P2P technology is serving millions of users world-wide, with applications such as file sharing, video streaming, grid computing, and massively multiplayer online games. Such diversity and scale pose important research and technical problems, which in turn require a much better understanding of the usage patterns and of the performance bottlenecks. However, the large amounts of P2P monitoring and measurement data that already exist have not been made public, for fear of lack of anonymity and in lack of a standard format. To address this problem, in this work we propose a unified format for workloads of P2P systems. Our format stores information coming from many types of P2P applications at several levels of detail, has a structure that balances generic and application-specific data, and protects the anonymity of the peers whose personal information was captured in monitoring and measurement data. Using two large traces taken from real P2P systems we show evidence of the usefulness of the proposed format, and substantiate the hope that our unified format has the potential to become a standard for sharing P2P traces.
New worker-centric scheduling strategies for data-intensive grid applications
- in: Proc. ACM/IFIP/USENIX Int’l Conference on Middleware, 2007
"... Abstract In this paper we argue that a worker-centric scheduler design is more desirable for data-intensive applications in Grid environments. Previous research on task-centric scheduling for dataintensive applications has identified that reusing the data present in a Grid site improves performance ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Abstract In this paper we argue that a worker-centric scheduler design is more desirable for data-intensive applications in Grid environments. Previous research on task-centric scheduling for dataintensive applications has identified that reusing the data present in a Grid site improves performance. However, task-centric scheduling bears two problems -unbalanced task assignments and premature scheduling decisions. On the contrary, both of these problems can be avoided by using worker-centric scheduling, thus worker-centric scheduling leads to a simpler scheduler design and better performance. Therefore, we propose a series of workercentric scheduling strategies for data-intensive applications and evaluate, with a real application (Coadd), how each strategy performs compared to a task-centric one. Our results show that worker-centric strategies improve the performance in terms of makespan and bandwidth usage.
Grid Computing Workloads: Bags of Tasks, Workflows, Pilots, and Others
"... In the mid 1990s, the grid computing community promised the ”compute power grid,” a utility computing infrastructure for scientists and engineers. Since then, a variety of grids have been built world-wide—for academic purposes, for specific application domains, for general production work. Understa ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
In the mid 1990s, the grid computing community promised the ”compute power grid,” a utility computing infrastructure for scientists and engineers. Since then, a variety of grids have been built world-wide—for academic purposes, for specific application domains, for general production work. Understanding the workloads of grids is important for the design and tuning of future grid resource managers and applications, especially in the recent wake of commercial grids and clouds. This article presents an overview of the most important characteristics of grid workloads in the past seven years (2003-2010). Starting from the data collected by the authors in the Grid Workloads Archive, this study focuses on four main axes of characterization: system usage, user population, general application characteristics, and characteristics of grid-specific application types. The utilizations of grids vary widely, but are stable in the long term. Although grid user populations range from tens to hundreds of individuals, a few users dominate each grid’s workload both in terms of consumed resources and of number of jobs submitted to the system. Real grid workloads include very few parallel jobs but many independent single-machine jobs (tasks) grouped into single ”bags of tasks.”
GridTorrent Framework: A High-performance Data Transfer and Data Sharing Framework for Scientific Computing
"... Large amount of data that is often stored in many thousands of files is created as part of today’s geographically distributed scientific computation and collaboration environments. Managing and transferring large volumes of data sets present a significant challenge and are often a bottleneck in the ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Large amount of data that is often stored in many thousands of files is created as part of today’s geographically distributed scientific computation and collaboration environments. Managing and transferring large volumes of data sets present a significant challenge and are often a bottleneck in the scientific computing community. In this paper, we introduce an architecture to manage data distributions in a collaborative fashion through a GridTorrent Framework (GTF) whose data transfer mechanism inspired by Bittorrent. We present performance experiment data that compares our framework to parallel TCP (PTCP) and Bittorrent. Experimental results conducted suggest that using GridTorrent for large data set has significant advantages over parallel TCP in LAN and WAN type of computer networks.
Data Transfers in the Grid: Workload Analysis of Globus GridFTP
"... One of the basic services in grids is the transfer of data between remote machines. Files may be transferred at the explicit request of the user or as part of delegated resource management services, such as data replication or job scheduling. GridFTP is an important tool for such data transfers sinc ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
One of the basic services in grids is the transfer of data between remote machines. Files may be transferred at the explicit request of the user or as part of delegated resource management services, such as data replication or job scheduling. GridFTP is an important tool for such data transfers since it builds on the common FTP protocol, has a large user base with multiple implementations, and it uses the GSI security model that allows delegated operations. This paper presents a workload analysis of the implementation of the GridFTP protocol provided by the Globus Toolkit. We studied more than 1.5 years of traces reported from all over the world by Globus GridFTP installed components. Our study focuses on three dimensions: first, it quantifies the volume of data transferred and characterizes user behavior. Second, it attempts to show how tuning capabilities are used in practice. Finally, it quantifies the user base as recorded in the database and highlights the usage trends of this software component.
Revisiting Locality of Reference in Scientific Grid Workloads
"... Abstract — This paper revisits a basic question in data management, namely whether locality of reference is an important factor for the performance of caches in grid workloads. We answer this question by experimental evaluations using more than two years of real workloads from a science collaboratio ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract — This paper revisits a basic question in data management, namely whether locality of reference is an important factor for the performance of caches in grid workloads. We answer this question by experimental evaluations using more than two years of real workloads from a science collaboration. Our results show that: (1) locality of reference is significant for these particular workloads and thus it is beneficial to consider it in cache replacement algorithms; and (2) using locality of reference and data request reordering gives better performance along multiple performance metrics than either one of these techniques. We support the latter conclusion by proposing and evaluating a cache management algorithm that combines locality of reference with request reordering while accounting for multifile requests. Experimental evaluation on real traces shows that this technique leads to significantly lower costs in wide-area data transfers than previous solutions for multi-file caching, while achieving increased hit rate. I.