Results 1 - 10
of
80
Chimera: A Virtual Data System For Representing, Querying, and Automating Data Derivation
- In Proceedings of the 14th Conference on Scientific and Statistical Database Management
, 2002
"... Much scientific data is not obtained from measurements' but rather derived from other data by the application of computational procedures. We hypothesize that explicit representation of these procedures can enable documentation of data provenance, discovery of available methods', and on-de ..."
Abstract
-
Cited by 282 (28 self)
- Add to MetaCart
(Show Context)
Much scientific data is not obtained from measurements' but rather derived from other data by the application of computational procedures. We hypothesize that explicit representation of these procedures can enable documentation of data provenance, discovery of available methods', and on-demand data generation (socalled "virtual data"). To explore this' idea, we have developed the Chimera virtual data system, which combines a virtual data catalog, for representing data derivation procedures and derived data, with a virtual data language interpreter that translates user requests' into data definition and query operations on the database. We couple the Chimera system with distributed "Data Grid" services to enable on-demand execution of computation schedules constructed from database queries. We have applied this system to two challenge problems, the reconstruction of simulated collision event data from a high-energy physics experiment, and the search of digital sky survey data for galactic clusters', with promising results'.
Giggle: A Framework for Constructing Scalable Replica Location Services
, 2002
"... In wide area computing systems, it is often desirable to create remote read-only copies (replicas) of files. Replication can be used to reduce access latency, improve data locality, and/or increase robustness, scalability and performance for distributed applications. We define a replica location ser ..."
Abstract
-
Cited by 158 (37 self)
- Add to MetaCart
(Show Context)
In wide area computing systems, it is often desirable to create remote read-only copies (replicas) of files. Replication can be used to reduce access latency, improve data locality, and/or increase robustness, scalability and performance for distributed applications. We define a replica location service (RLS) as a system that maintains and provides access to information about the physical locations of copies. An RLS typically functions as one component of a data grid architecture. This paper makes the following contributions. First, we characterize RLS requirements. Next, we describe a parameterized architectural framework, which we name Giggle (for GIGa-scale Global Location Engine), within which a wide range of RLSs can be defined. We define several concrete instantiations of this framework with different performance characteristics. Finally, we present initial performance results for an RLS prototype, demonstrating that RLS systems can be constructed that meet performance goals.
The design and implementation of Grid database services
- in OGSA-DAI. Concurrency - Practice and Experience
, 2005
"... Initially, Grid technologies were principally associated with supercomputer centres and large-scale scientific applications in physics and astronomy. They are now increasingly seen as being relevant to many areas of e-Science and e-Business. The emergence of the Open Grid Services Architecture (OGSA ..."
Abstract
-
Cited by 68 (4 self)
- Add to MetaCart
(Show Context)
Initially, Grid technologies were principally associated with supercomputer centres and large-scale scientific applications in physics and astronomy. They are now increasingly seen as being relevant to many areas of e-Science and e-Business. The emergence of the Open Grid Services Architecture (OGSA), to complement the ongoing activity on Web Services standards, promises to provide a service-based platform that can meet the needs of both business and scientific applications. Early Grid applications focused principally on the storage, replication and movement of file-based data. Now the need for the full integration of database technologies with Grid middleware is widely recognized. Not only do many Grid applications already use databases for managing metadata, but increasingly many are associated with large databases of domain-specific information (e.g. biological or astronomical data). This paper describes the design and implementation of OGSA-DAI, a service-based architecture for database access over the Grid. The approach involves the design of Grid Data Services that allow consumers to discover the properties of structured data stores and to access their contents. The initial focus has been on support for access to Relational and XML data, but the overall architecture has been designed to be extensible to accommodate
Applying Chimera virtual data concepts to cluster finding
- in the Sloan Sky Survey. Proceedings of Supercomputing 2002 (SC2002
, 2002
"... The GriPhyN project [1] is one of several major efforts [2-4] working to enable large-scale data-intensive computation as a routine scientific tool. GriPhyN focuses in particular on virtual data technologies that allow computational procedures and results to be exploited as community resources so th ..."
Abstract
-
Cited by 64 (13 self)
- Add to MetaCart
(Show Context)
The GriPhyN project [1] is one of several major efforts [2-4] working to enable large-scale data-intensive computation as a routine scientific tool. GriPhyN focuses in particular on virtual data technologies that allow computational procedures and results to be exploited as community resources so that, for example, scientists can not only run their own computations on raw data, but also discover computational procedures
A taxonomy of Data Grids for distributed data sharing, management, and processing
- ACM Computing Surveys
"... Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. I ..."
Abstract
-
Cited by 61 (9 self)
- Add to MetaCart
(Show Context)
Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. In this paper, we discuss the key concepts behind Data Grids and compare them with other data sharing and distribution paradigms such as content delivery networks, peer-to-peer networks and distributed databases. We then provide comprehensive taxonomies that cover various aspects of architecture, data transportation, data replication and resource allocation and scheduling. Finally, we map the proposed taxonomy to various Data Grid systems not only to validate the taxonomy but also to identify areas for future exploration. Through this taxonomy, we aim to categorise existing systems to better understand their goals and their methodology. This would help evaluate their applicability for solving similar problems. This taxonomy also provides a ”gap analysis ” of this area through which researchers can potentially identify new issues for investigation. Finally, we hope that the proposed taxonomy and mapping also helps to provide an easy way for new practitioners to understand this complex area of research. 1
Simulation of Dynamic Data Replication Strategies in Data Grids
- In Proc. 12th Heterogeneous Computing Workshop (HCW2003
"... Data Grids provide geographically distributed resources for large-scale data-intensive applications that generate large data sets. However, ensuring efficient access to such huge and widely distributed data is hindered by the high latencies of the Internet. To address these problems, we developed a ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
(Show Context)
Data Grids provide geographically distributed resources for large-scale data-intensive applications that generate large data sets. However, ensuring efficient access to such huge and widely distributed data is hindered by the high latencies of the Internet. To address these problems, we developed a Data Grid simulator GridNet that simulates a cost-driven replication technique to dynamically replicate data on the Grid. Replication decisions are driven by the estimation of the data access gains and the replica's creation and maintenance costs that in turn are based on factors such as runtime accumulated read/write statistics, network latency and bandwidth, or replica size. Simulation results demonstrate that replication improves the data access time in Data Grids, and that the gain increases with the size of the datasets involved.
Practical Heterogeneous Placeholder Scheduling In Overlay Metacomputers: Early Experiences
- In Proc. 8th Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP
, 2002
"... A practical problem faced by users of high-performance computers is: How can I automatically load balance my jobs across different batch queues, which are in different administrative domains, if there is no existing grid infrastructure? It is common to have user accounts for a number of individual h ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
(Show Context)
A practical problem faced by users of high-performance computers is: How can I automatically load balance my jobs across different batch queues, which are in different administrative domains, if there is no existing grid infrastructure? It is common to have user accounts for a number of individual high-performance systems (e.g., departmental, university, regional) that are administered by different groups. Without an administration-deployed grid infrastructure, one can still create a purely user-level aggregation of individual computing systems. The Trellis Project is developing the techniques and tools to take advantage of a user-level overlay metacomputer. Because placeholder scheduling does not require superuser permissions to set up or configure, it is well-suited to overlay metacomputers. This paper contributes to the practical side of grid computing by empirically demonstrating that placeholder scheduling can work across different administrative domains, across different local schedulers (i.e., PBS and Sun Grid Engine), and across different programming models (i.e., Pthreads, MPI, and sequential). We also describe a new metaqueue system to manage jobs with explicit workflow dependencies.
Making a Case for Distributed File Systems at Exascale
- Invited Paper, ACM Workshop on Large-scale System and Application Performance (LSAP), 2011
"... Exascale computers will enable the unraveling of significant scientific mysteries. Predictions are that 2019 will be the year of exascale, with millions of compute nodes and billions of threads of execution. The current architecture of high-end computing systems is decades-old and has persisted as w ..."
Abstract
-
Cited by 24 (13 self)
- Add to MetaCart
(Show Context)
Exascale computers will enable the unraveling of significant scientific mysteries. Predictions are that 2019 will be the year of exascale, with millions of compute nodes and billions of threads of execution. The current architecture of high-end computing systems is decades-old and has persisted as we scaled from gigascales to petascales. In this architecture, storage is completely segregated from the compute resources and are connected via a network interconnect. This approach will not scale several orders of magnitude in terms of concurrency and throughput, and will thus prevent the move from petascale to exascale. At exascale, basic functionality at high concurrency levels will suffer poor performance, and combined with system mean-time-to-failure in hours, will lead to a performance collapse for large-scale heroic applications. Storage has the potential to be
User-Level Remote Data Access in Overlay Metacomputers
- In Proceedings of the 4th IEEE International Conference on Cluster Computing
, 2002
"... A practical problem faced by users of metacomputers and computational grids is: If my computation can move from one system to another, how can I ensure that my data will still be available to my computation? Depending on the level of software, technical, and administrative support available, a data ..."
Abstract
-
Cited by 15 (5 self)
- Add to MetaCart
(Show Context)
A practical problem faced by users of metacomputers and computational grids is: If my computation can move from one system to another, how can I ensure that my data will still be available to my computation? Depending on the level of software, technical, and administrative support available, a data grid or a distributed file system would be reasonable solutions. However, it is not always possible (or practical) to have a diverse group of systems administrators agree to adopt a common infrastructure to support remote data access. Yet, having transparent access to any remote data is an important, practical capability. We have developed the Trellis File System (Trellis FS) to allow programs to access data files on any file system and on any host on a network that can be named by a Secure Copy Locator (SCL) or a Uniform Resource Locator (URL). Without requiring any new protocols or infrastructure, Trellis can be used on practically any POSIX-based system on the Internet. Read access, write access, sparse access, local caching of data, prefetching, and authentication are supported. Trellis is implemented as a user-level C library, which mimics the standard stream I/O functions, and is highly portable. Trellis is not a replacement for traditional file systems or data grids; it provides new capabilities by overlaying on top of other file systems, including grid-based file systems. And, by building upon an alreadyexisting infrastructure (i.e., Secure Shell and Secure Copy), Trellis can be used in situations where a suitable data grid or distributed file system does not yet exist.