Results 1 - 10
of
22
A taxonomy of Data Grids for distributed data sharing, management, and processing
- ACM Computing Surveys
"... Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. I ..."
Abstract
-
Cited by 61 (9 self)
- Add to MetaCart
(Show Context)
Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. In this paper, we discuss the key concepts behind Data Grids and compare them with other data sharing and distribution paradigms such as content delivery networks, peer-to-peer networks and distributed databases. We then provide comprehensive taxonomies that cover various aspects of architecture, data transportation, data replication and resource allocation and scheduling. Finally, we map the proposed taxonomy to various Data Grid systems not only to validate the taxonomy but also to identify areas for future exploration. Through this taxonomy, we aim to categorise existing systems to better understand their goals and their methodology. This would help evaluate their applicability for solving similar problems. This taxonomy also provides a ”gap analysis ” of this area through which researchers can potentially identify new issues for investigation. Finally, we hope that the proposed taxonomy and mapping also helps to provide an easy way for new practitioners to understand this complex area of research. 1
ABSTRACT File Grouping for Scientific Data Management: Lessons from Experimenting with Real Traces
"... The analysis of data usage in a large set of real traces from a highenergy physics collaboration revealed the existence of an emergent grouping of files that we coined “filecules”. This paper presents the benefits of using this file grouping for prestaging data and compares it with previously propos ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
The analysis of data usage in a large set of real traces from a highenergy physics collaboration revealed the existence of an emergent grouping of files that we coined “filecules”. This paper presents the benefits of using this file grouping for prestaging data and compares it with previously proposed file grouping techniques along a range of performance metrics. Our experiments with real workloads demonstrate that filecule grouping is a reliable and useful abstraction for data management in science Grids; that preserving time locality for data prestaging is highly recommended; that job reordering with respect to data availability has significant impact on throughput; and finally, that a relatively short history of traces is a good predictor for filecule grouping. Our experimental results provide lessons for workload modeling and suggest design guidelines for data management in dataintensive resource-sharing environments.
BM: Scientific data repositories on the Web: An initial survey
- J Am Soc Inf Sci
"... Science Data Repositories (SDRs) have been recognized as both critical to science, and undergoing a fundamental change. A websample study was conducted of 100 SDRs. Information on the websites and from administrators of the SDRs was reviewed to determine salient characteris-tics of the SDRs, which w ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
Science Data Repositories (SDRs) have been recognized as both critical to science, and undergoing a fundamental change. A websample study was conducted of 100 SDRs. Information on the websites and from administrators of the SDRs was reviewed to determine salient characteris-tics of the SDRs, which were used to classify SDRs into groups using a combination of cluster analysis and logis-tic regression. Characteristics of the SDRs were explored for their role in determining groupings and for their relationship to the success of SDRs. Four of these char-acteristics were identified as important for further investi-gation: whether the SDR was supported with grants and contracts, whether support comes from multiple spon-sors, what the holding size of the SDR is and whether a preservation policy exists for the SDR. An inferential framework for understanding SDR composition, guided by observations, characteristic collection and refinement and subsequent analysis on elements of group member-ship, is discussed. The development of SDRs is further examined from a business standpoint, and in compari-son to its most similar form, institutional repositories. Because this work identifies important characteristics of SDRs and which characteristics potentially impact the sustainability and success of SDRs, it is expected to be helpful to SDRs.
Access control for a replica management database
- In Proc. Workshop on Storage Security and Survivability
, 2006
"... Distributed computation systems have become an important tool for scientific simulation, and a similarly distributed replica management system may be employed to increase the locality and availability of storage services. While users of such systems may have low expectations regarding the security a ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
Distributed computation systems have become an important tool for scientific simulation, and a similarly distributed replica management system may be employed to increase the locality and availability of storage services. While users of such systems may have low expectations regarding the security and reliability of the computation involved, they expect that committed data sets resulting from complete jobs will be protected against storage faults, accidents and intrusion. We offer a solution to the distributed storage security problem that has no global view on user names or authentication specifics. Access control is handled by a rendition protocol, which is similar to a rendezvous protocol but is driven by the capability of the client user to effect change in the data on the underlying storage. In this paper, we discuss the benefits and liabilities of such a system 1.
Biomolecular path sampling enabled by processing in network storage
- In Proc. Workshop on High Performance Computational Biology
, 2007
"... Computationally complex and data intensive atomic scale biomolecular simulation is enabled via Processing in Network Storage (PINS): a novel distributed system framework to overcome bandwidth, compute, storage, and security challenges inherent to the wide area computation and storage grid. High thro ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Computationally complex and data intensive atomic scale biomolecular simulation is enabled via Processing in Network Storage (PINS): a novel distributed system framework to overcome bandwidth, compute, storage, and security challenges inherent to the wide area computation and storage grid. High throughput data generation requirements for our scientific target are overcome through novel aggregate bandwidth capabilities. Biomolecular simulation methods are correlated with the client tools, hybrid database/file server (GEMS), computation engine (Condor), virtual file system adapter (Parrot), and local file servers (Chirp). PINS performance is reported for the path sampling of a solvated protein domain requiring over 1000 simulations with total output data generation on the order of 1TB. 1
A FRAMEWORK FOR THE DYNAMIC RECONFIGURATION OF SCIENTIFIC APPLICATIONS IN GRID ENVIRONMENTS
, 2007
"... ..."
Inter-node Communication in Peer-to-Peer Storage Clusters *
"... Storage clusters try to transfer the idea of cluster computing into the storage domain and to scale capacity and performance by simply adding new cluster components. This paper presents analytical considerations on the scalability of storage clusters and presents a storage cluster architecture based ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Storage clusters try to transfer the idea of cluster computing into the storage domain and to scale capacity and performance by simply adding new cluster components. This paper presents analytical considerations on the scalability of storage clusters and presents a storage cluster architecture based on peer-to-peer computing that is able to scale up to hundreds of servers and clients. The resulting storage cluster environment has been successfully implemented and tested on a Linux based HPC-cluster. The measurement results presented in this paper demonstrate the feasibility and scalability ofthis architecture. 1.