| R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi. Passion: Optimized i/o for parallel applications. IEEE Computer, 29(6):70--78, June 1996. |
....the boundaries of the requested section align exactly with disk array chunk boundaries. Otherwise, the request is nonaligned (Figure 1 (b) Aligned accesses are handled in a similar fashion to operations on entire arrays. Nonaligned write requests are implemented via the strategy used in PASSION [15]. An entire disk array chunk is always written. When the requested section covers only parts of the disk array chunks, each I O node brings the entire disk array chunk into its I O buffers, overwrites the buffer with the overlapping sections, and then writes the entire disk array chunk back to ....
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi. Passion: Optimized i/o for parallel applications. IEEE Computer, 29(6):70--78, June 1996.
....Accessing a Portion of a Strided Region ing what would have been multiple small messages to be combined into a single, larger request. Other parallel file systems and interfaces supporting accesses to noncontiguous regions with single requests include Vesta [7] Panda [8] MPI IO [9] and PASSION [10]. Our I O daemon accepts strided requests in order to take advantage of this typical access pattern. Each read or write request consists of a set of six parameters: ffl request location (rl) location of start of request ffl first size (fs) size of starting partial block ffl group size (gs) ....
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi, "Passion: Optimized I/O for parallel applications," IEEE Computer, vol. 29, pp. 70--78, June 1996.
....across different applications and different number of processors, and look into approaches to estimate such changes to make our cost models more accurate. 5 Related Work Several runtime support libraries and file systems have been developed to support efficient I O in a parallel environment [2, 8, 12, 13, 19, 21, 24, 25]. These systems mainly focus on supporting regular strided access to uniformly distributed datasets, such as images, maps, and dense multidimensional arrays. ADR differs from these systems in several ways. First, ADR is able to carry out range queries directed at irregular spatially indexed ....
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70--78, June 1996.
....codes (processors execute the same code on different data) The programmer chooses the appropriate data distribution and I O operations are implicitly collective. A variable number of coalescing processes aggregate the requests to perform them efficiently under the chosen distribution. PASSION [4, 21, 19] (Parallel And Scalable Software for Input Output) is another runtime library targeted for SPMD applications with routines to efficiently perform out of core operations. The Panda [16, 3] library utilizes serverdirected I O, a variation of disk directed I O, and a high level collective interface ....
THAKUR, R., CHOUDHARY, A., BORDAWEKAR, R., MORE, S., AND KUDITIPUDI, S. Passion: Optimized I/O for Parallel Applications. IEEE Computer 29, 6 (June 1996), 70--78. 16
....imbalance incurred during the local reduction phase, while for FRA and SRA it is due to constant overheads in the initialization and global reduction phases. 5 Related Work Several runtime support libraries and file systems have been developed to support efficient I O in a parallel environment [4, 9, 16, 18, 24, 28, 34, 35]. These systems mainly focus on supporting regular strided access to uniformly distributed datasets, such as images, maps, and dense multi dimensional arrays. They also usually provide a collective I O interface, in which all processing nodes cooperate to make a single large I O request. ADR ....
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70--78, June 1996.
....retrieval and processing for a wide variety of applications and from the ability to maintain and jointly process multiple datasets with different underlying attribute spaces. Several runtime support libraries and file systems have been developed to support efficient I O in a parallel environment [7, 11, 14, 18]. These systems are analogous to T2 in that: 1) they plan data movements in advance to minimize disk access and communication overheads, and (2) in some cases, they attempt to optimize I O performance by masking I O latency with computation and with interprocessor communication. Also, T2 ....
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70--78, June 1996. 6
....data dimensions can be spatial coordinates, time, or varying experimental conditions such as temperature, velocity or magnetic field. The increasing importance of such applications has been widely recognized. Runtime systems like the Active Data Repository [5, 6] and the Passion runtime library [18, 19] allow high performance on data intensive applications, but do not address the need for programming with high level abstractions. We target two high level programming models for this important class of computations: 1. Object Oriented (Java Based) Object oriented features like encapsulation and ....
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kutipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70--78, June 1996.
....also unique in considering persistent storage, complex distributions of data on processors and disks, and the use of a sophisticated run time system for optimizing resources. Several run time support libraries and file systems have been developed to support efficient I O in a parallel environment [7, 8, 13, 19]. They also usually provide a collective I O interface, in which all processing nodes cooperate to make a single large I O request. Our work is different in two important ways. First, we are supporting a much higher level of programming by involving a compiler. Second, our target run time system, ....
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kutipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70--78, June 1996.
....across different applications and different number of processors, and look into approaches to estimate such changes to make our cost models more accurate. 5 Related Work Several runtime support libraries and file systems have been developed to support efficient I O in a parallel environment [2, 8, 12, 13, 19, 21, 24, 25]. These systems mainly focus on supporting regular strided access to uniformly distributed datasets, such as images, maps, and dense multidimensional arrays. ADR differs from these systems in several ways. First, ADR is able to carry out range queries directed at irregular spatially indexed ....
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70--78, June 1996.
....imbalance incurred during the local reduction phase, while for FRA and SRA it is due to constant overheads in the initialization and global reduction phases. 5 Related Work Several runtime support libraries and file systems have been developed to support efficient I O in a parallel environment [4, 9, 16, 18, 24, 28, 34, 35]. These systems mainly focus on supporting regular strided access to uniformly distributed datasets, such as images, maps, and dense multi dimensional arrays. They also usually provide a collective I O interface, in which all processing nodes cooperate to make a single large I O request. ADR ....
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70--78, June 1996.
....all Cyclic schemes achieve higher levels of declustering than the other schemes. A Cyclic scheme based upon exhaustive search always gave the best performance. The use of parallel I O for improving the performance of parallel programs managing multidimensional arrays has also been investigated in [14, 13]. In [14] it is assumed that the array is divided among the processors using HPFlike BLOCK and CYCLIC statements. Data for a processor may be local or stored globally across all processors. Performance improvements are made through collective I O, prefetching, and sieving. The allocation of data ....
....achieve higher levels of declustering than the other schemes. A Cyclic scheme based upon exhaustive search always gave the best performance. The use of parallel I O for improving the performance of parallel programs managing multidimensional arrays has also been investigated in [14, 13] In [14], it is assumed that the array is divided among the processors using HPFlike BLOCK and CYCLIC statements. Data for a processor may be local or stored globally across all processors. Performance improvements are made through collective I O, prefetching, and sieving. The allocation of data to disks ....
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi. PASSION optimized I/O for parallel applications. IEEE Computer, 29(6):70--78, June 1996.
....requests, it is often the case that in the aggregate the whole array is being written to or read from the file. The application can use this knowledge to significantly improve its I O performance. The technique of collective I O has been developed to better utilize the parallel I O subsystem [10, 26, 27, 4, 17, 23, 5, 8]. In this approach, the processors exchange information about their individual I O requests to develop a picture of the aggregate I O request. Based on this global knowledge, I O requests are combined and submitted in their proper order, making a much more efficient use of the I O subsystem. ....
....about their individual I O requests to develop a picture of the aggregate I O request. Based on this global knowledge, I O requests are combined and submitted in their proper order, making a much more efficient use of the I O subsystem. There are three approaches to collective I O: two phase I O [10, 26, 27], disk directed I O [17, 19] and server directed I O [7, 23] The primary distinction between these approaches is the level at which the optimal I O strategy is derived and carried out. In disk directed I O, the collective I O request is sent to the disk controllers which collectively determine ....
[Article contains additional citation context not shown here]
Thakur, R., Choudhary, A., More, S and S. Kuditipudi. Passion: Optimized I/O for Parallel Applications. IEEE Computer, 29(6):70--78, June 1996.
....requests, it is often the case that in the aggregate the whole array is being written to or read from the file. The application can use this knowledge to significantly improve its I O performance. The technique of collective I O has been developed to better utilize the parallel I O subsystem [10, 26, 27, 4, 17, 23, 5, 8]. In this approach, the processors exchange information about their individual I O requests to develop a picture of the aggregate I O request. Based on this global knowledge, I O requests are combined and submitted in their proper order, making a much more efficient use of the I O subsystem. ....
....about their individual I O requests to develop a picture of the aggregate I O request. Based on this global knowledge, I O requests are combined and submitted in their proper order, making a much more efficient use of the I O subsystem. There are three approaches to collective I O: two phase I O [10, 26, 27], disk directed I O [17, 19] and server directed I O [7, 23] The primary distinction between these approaches is the level at which the optimal I O strategy is derived and carried out. In disk directed I O, the collective I O request is sent to the disk controllers which collectively determine ....
[Article contains additional citation context not shown here]
Thakur, R., Choudhary, A., More, S and S. Kuditipudi. Passion: Optimized I/O for Parallel Applications. IEEE Computer, 29(6):70--78, June 1996.
....of high I O latency, it is critical to make as few requests to the file system as possible. When a process makes an independent request for noncontiguous data, ROMIO, therefore, does not access each contiguous portion of the data separately. Instead, it uses an optimization called data sieving [12]. The basic idea is illustrated in Figure 1. Assume that the user has made a single read request for five noncontiguous pieces of data. Instead of reading each piece separately, ROMIO reads a single contiguous chunk of data starting from the first requested byte up to the last requested byte into ....
....even for noncontiguous access patterns. We have described two optimizations our MPI IO implementation performs that enable it to deliver high performance even if the user s request consists of many small, noncontiguous accesses. Our implementation of these optimizations generalizes the work in [11, 12] to handle any noncontiguous access pattern, not just sections of arrays. For the applications we considered, collective I O performed significantly better than both data sieving and Table 5. Read performance of UNSTRUC Bandwidth (Mbytes s) Proc Grid Data CollMachine essors Points Sieving ....
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi. Passion: Optimized I/O for Parallel Applications. Computer, 29(6):70--78, June 1996.
....that can improve performance significantly. These extensions allow users to perform bulk (array) I O operations with a single method call. We have implemented these extensions and validated their performance benefits. 2 1. 3 Related Work Other than the large body of work related to parallel I O [4, 8, 9, 13, 23, 27, 28, 32, 33], the work most closely related to ours is the Jaguar project [36, 37] which aims to improve Java I O performance as one of its goals. Jaguar allows the Java runtime system to be extended with new primitive operations that enable efficient access to hardware resources. These primitives are ....
Thakur, R., Choudhary, A., More, S, and S. Kuditipudi. Passion: Optimized I/O for Parallel Applications. IEEE Computer, 29(6):70-78, June 1996.
....patterns exhibited by many parallel scientific applications [1, 5] In particular, each processor tends to make a large number of small I O requests, incurring the high cost of I O on each such request. The technique of collective I O has been developed to better utilize the parallel I O subsystem [2, 7, 8]. In this approach, the processors exchange information about their individual I O requests to develop a picture of the aggregate I O request. Based on this global knowledge, I O requests are combined and submitted in their proper order, making a much more efficient use of the I O subsystem. Two ....
Thakur, R., Choudhary, A., More, S and S. Kuditipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70-78, June 1996.
....undoubtedly benefit from the proposed methods. Finally, we note that the proposed methods are not just useful for I O, but also for interprocess communication, and would therefore benefit networking applications as well. 7. RELATED WORK Other than the large body of work related to parallel I O [1, 4, 5, 8, 14, 16, 17, 20, 21], the work most closely related to ours is the Jaguar project [23, 24] which aims to improve Java I O performance as one of its goals. Jaguar allows the Java runtime system to be extended with new primitive operations that enable efficient access to hardware resources. These primitives are ....
Thakur, R., Choudhary, A., More, S, and S. Kuditipudi. Passion: Optimized I/O for Parallel Applications. IEEE Computer, 29(6):70-78, June 1996.
....level (as done in this study) implementing them as part of the language should provide much better performance. It is also worth noting that such methods would also be very beneficial to networking applications. 7 Related Work Other than the large body of work related to parallel I O [1, 4, 5, 8, 14, 16, 17, 19, 20], the work most closely related to ours is the Jaguar project [22, 23] which has as one of its goals improvement in the performance of Java I O. Jaguar allows the Java runtime system to be extended with new primitive operations that enable efficient access to hardware resources. These primitives ....
Thakur, R., Choudhary, A., More, S and S. Kuditipudi. Passion: Optimized I/O for Parallel Applications. IEEE Computer, 29(6):70-78, June 1996.
....as a mechanism for implementing MPI IO, as illustrated in Figure 1. A similar abstract device interface is used in MPICH [6] for implementing MPI portably. 3 Data Sieving To reduce the effect of high I O latency, it is critical to make as few requests to the file system as possible. Data sieving [14] is a technique that enables an implementation to make a few large, contiguous requests to the file system even if the user s request consists of several small, noncontiguous accesses. Figure 2 illustrates the basic idea of data sieving. Assume that the user has made a single read request for five ....
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi. Passion: Optimized I/O for Parallel Applications. Computer, 29(6):70--78, June 1996.
....shown that collective I O can improve performance significantly [5, 27, 16, 24] However, collective I O cannot be done with the Unix API. Over the past few years, many research parallel file systems and I O libraries have been developed that perform various optimizations, including collective I O [28, 11, 20, 17, 3, 10, 25, 9, 19]. Each of these, however, has a different API with varying degrees of portability and generality. The only standard, portable API that has been available on all machines is the Unix API. Therefore, most users write applications for the Unix API and get bad performance for reasons explained above. ....
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi. Passion: Optimized I/O for Parallel Applications. Computer, 29(6):70--78, June 1996.
....used the right way, can result in very poor performance. The paper also explains why performance improves when users use MPI IO the right way: the MPI IO implementation can then perform optimizations, such as data sieving and collective I O. Although these optimizations have been proposed earlier [7, 15, 28, 33], this is the only paper that discusses in detail the practical issues involved in implementing these optimizations, in the context of a standard, portable API, on real state of the art parallel machines and file systems. We also present performance results that confirm that these optimizations ....
....to make as few requests to the file system as possible. Data sieving is a technique that enables an implementation to make a few large, contiguous requests to the file system even if the user s request consists of several small, noncontiguous accesses. Data sieving was first used in PASSION [31, 33] in the context of accessing sections of out of core arrays. ROMIO s data sieving implementation, on the other hand, is for any general noncontiguous access pattern (as can be described by an MPI read a contiguous chunk user s request for noncontiguous into memory into user s buffer copy ....
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi. Passion: Optimized I/O for Parallel Applications. Computer, 29(6):70--78, June 1996.
....case for the need of I O optimizations and analysis of its I O phase. For some application inputs discussed in this paper, up to 1.9GB of data are read written per application phase and this warrants optimizations to the I O phase to efficiently utilize the existing I O system. We use PASSION [17], 7] 8] 13] a parallel and scalable I O library to implement the application s I O. Furthermore, we modify the application to use PASSION s optimizations such as prefetching. We classify the factors that affect the I O performance of the application into two categories, namely, ....
....X 2 GB partition on original Maxtor RAID 3 level disks and a 16 I O node X 4 GB partition on individual Seagate disks. In both the partitions, the stripe factor is equal to the number of I O nodes. The default striping unit size of both the I O partitions is 64 KB. 3. 2 PASSION Library PASSION [17], 7] 8] 13] 3] is a parallel run time system which offers a high level interface to the underlying parallel I O subsystem of a parallel computer. PASSION calls can be used from both in core and out of core programs for I O support. In addition, it offers several optimizations such as data ....
R. Thakur, A. Choudhary, R. Bordawekar, S. More and S. Kuditipudi, PASSION: optimized I/O for parallel applications, In Computer, IEEE Computer Society, June 1996.
....transfers, sometimes involving the memories of multiple processing nodes. These interfaces, possibly integrated into parallel programming toolkits, preserve the programmer abstraction of explicitly requesting data transfer [12, 13, 14, 62, 65] Array oriented (or type oriented) interfaces [15, 69] define compiler recognized data types (typically arrays) and operations on these datatypes. Out of core computation is directly specified and no explicit I O transfers are managed by programmers. Array oriented systems are effective for scientific computations that make regular strides through ....
Thakur, R., Choudhary, A., Bordawekar, R., More, S., and Kuditipudi, S. Passion: Optimized I/O for parallel applications. Computer (June 1996), 70--78.
....file domains appropriately and possibly using a different algorithm for interprocessor communication. The best way to use the extended two phase method is to implement it as a library routine that can be called from an application program. We have implemented it in the PASSION runtime library [15], which is available on the World Wide Web at http: www.cat.syr.edu passion.html. ....
R. Thakur, A. Choudhary, R. Bordawekar, S. More, and S. Kuditipudi. Passion: Optimized I/O for Parallel Applications. IEEE Computer, June 1996.
....transfers, sometimes involving the memories of multiple processing nodes. These interfaces, possibly integrated into parallel programming toolkits, preserve the programmer abstraction of explicitly requesting data transfer [12, 13, 14, 62, 65] Array oriented (or type oriented) interfaces [15, 69] define compiler recognized data types (typically arrays) and operations on these datatypes. Out of core computation is directly specified and no explicit I O transfers are managed by programmers. Array oriented systems are e#ective for scientific computations that make regular strides through ....
Thakur, R., Choudhary, A., Bordawekar, R., More, S., and Kuditipudi, S. Passion: Optimized I/O for parallel applications. Computer (June 1996), 70--78.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC