| Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson, "High-Performance Sorting on Networks of Workstations", ACM SIGMOD '97, May, 1997. |
....key. The input records are in random order and output records must be in ascending order. The elapsed time includes the time create the sort process, to read the input from disk and to create the output file and write the output to disk. The 1998 s world record was 2. 41 seconds by NOW Sort [4] with a 32 node cluster of UltraSPARCs and the 1999 s world record was 1.18 seconds by Millennium Sort [5] with a 16 node cluster of PentiumII PCs. 1 Some of the disks (Cheetah 18XL and 36LP) support Ultra160 SCSI but they function as Ultra2 because the SCSI controllers do not support ....
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson, "High-Performance Sorting on Networks of Workstations", ACM SIGMOD '97, May, 1997.
....of sorting. Sorting is a memory hungry computational task. The biggest possible amount of data records should reside in main memory during the algorithm execution. It has been reported that the maximum amount of records that can be sorted into a 256Mbytes Unix workstation is around 1. 5 millions [2], in a database indexed by a single integer key of 10 bytes and a pointer to the record on disk. Memory usage considers the operating system (OS) memory overhead, around 38 17 in practical cases. The exact overhead depends on the total memory, measured with 64 256Mbytes. Around 20 of the time to ....
....as big as 20, memory by as much as 16 and bandwidth by as much as 4. Computao Reconfigurvel: Experincias e Perspectivas 3 This is to be compared with the cost of current systems used to break the barriers of sorting runtime, Symmetric Multiprocessors (SMP) and Network of Workstations (NOWs) In [2], the authors investigate the sorting of lots of data and compare the costbenefit figures of several systems, costing around US 1 million and sorting 6Gbytes in one minute. We are comparing our approach to these and other results in the literature. The long term goal of our work is this field is ....
[Article contains additional citation context not shown here]
A. C. Arpaci-Dusseau et alii. High-Performance Sorting on Networks of Workstations. In: ACM SIGMOD Conference on the Management of Data, Tucson, AZ, May, 1997.
....time. Thus, if V is the volume of data, there are P processors and D disks, G gigabytes of memory are actually used across all processors, and sorting takes T minutes, then we will be measuring sorting e#ciencies given by the formulas V PT , V DT , V ( P D)T ) and V GT . Arpaci Dusseau et al. [2] used an alternative measure of large scale sorting: the volume of data that can be sorted in one minute. This measure does not account for resource usage. Nevertheless, their achievement of sorting 6 gigabytes in just under one minute in 1997 is impressive. Their NOWSort algorithm did so on a ....
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, and David A. Patterson. High-performance sorting on networks of workstations. In SIGMOD, 1997.
....set of two elements chosen from a system of n elements [7, 4] The operation changes a pair of elements independently of the remaining elements and is called for this reason interaction between the two elements. Examples of all pairs computations include n body simulation [4, 3] bubble sort [2], Gaussian elimination [1] and Householder reduction [4] An all pair sequential computation over a large number of elements may become prohibitively complex. Interactions between pairs of elements happen in the order specified by a precedence graph [4] fortunately, some of the interactions ....
Arpaci-Dusseau A., R. Arpaci-Dusseau, D. Culler, J. Hellerstein, D. Patterson. High-Performance Sorting on Networks of Workstations, Proc. 1997 ACM SIGMOD Conference, ACM Press, 1997, 243---254.
....each node contains a disk. Thus OOC algorithms might map more e ectively to this type of system. One application that can take advantage of parallel I O systems and out of core algorithms is sorting. Sorting requires large amounts of I O and has proven well suited to networks of workstations [4], which exhibit many of the characteristics of Piles of PCs. This paper presents two out of core sorting algorithms and their performance on a Beowulf machine running a parallel le system. The focus of this study will be on the behavior of these algorithms with problem sizes that approach and ....
....multiple lists on a concurrent read exclusive write parallel random access machine (CREW PRAM) 12] which has similarities with the mergesort presented here. The Network of Workstations (NOW) project at UC Berkeley has provided the best non commercial disk to disk sorting performance to date [4]. Using a network of 95 Sun workstations and Myrinet network, the NOW Sort group won the Indy MinuteSort award for largest sort in one minute. The NOW Sort uses a simple bucket sort algorithm and assumes a uniform distribution. Given P workstations, the problem is partitioned into P buckets, based ....
[Article contains additional citation context not shown here]
A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, D. E. Culler, J. M. Hellerstein, and D. P. Patterson, \High-Performance Sorting on Networks of Workstations," in Proceedings of the 1997 ACM SIGMOD Conference, pp. 243-254, 1997.
....Our algorithms do not rely on any specific topology and leave the control of the physical connections to the run time system, e.g. PVM in our implementations. Others have developed solutions for individual problems that we have dealt with in this paper, such as n body simulation [6] sorting [5], and Gaussian elimination [4] A unique feature of our proposed algorithms for the same problems is that they stem from the same generic cluster computing structure and differ only in their sequential parts. We support the algorithm development process by a higher level specification and ....
Arpaci-Dusseau A., R. Arpaci-Dusseau, D. Culler, J. Hellerstein, D. Patterson. HighPerformance Sorting on Networks of Workstations, Proc. 1997 ACM SIGMOD Conference, ACM Press, 1997, 243---254.
....bits and, 2 times for smaller data sets on 16 processors of the SGI O2000. To our knowledge, Load Balanced Radix was the fastest parallel in memory version of Radix sort proposed up to now. In the field of parallel sorting, efforts have been addressed to solve the problem for disk resident data [1, 3, 4, 10] and for memory resident data [5, 6, 12] Usually, research on disk resident data focuses on minimizing the traffic between the disk and the memory. Thus, those works increase the locality of computations by working on data subsets that fit in the memories of the processing nodes. In some of those ....
....memory resident data [5, 6, 12] Usually, research on disk resident data focuses on minimizing the traffic between the disk and the memory. Thus, those works increase the locality of computations by working on data subsets that fit in the memories of the processing nodes. In some of those papers [1, 3, 10] and one of the memory resident papers [6] some memory hierarchy locality issues are addressed. However, to our knowledge, the important issue of optimizing communication has not been addressed explicitly but in [6] where a technique called Personalized Communication is used to perform an optimal ....
A. Arpaci-Dusseau, R. Arpaci-Dusseau, D. Culler, J. Hellerstein and D. Patterson, High-Performance Sorting on Networks of Workstations, SIGMOD'97 Conference, pp. 243-254.
....into attribute lists, as is in the sequential algorithm. The numerical attribute lists are globally sorted so that SMP node 0 has the first N=P entries of the sorted list, SMP node 1 has the second N=P entries and so on. There are several scalable sort algorithms proposed in the literature [3, 7] that can be used to parallelize the sortingphase. Note that sorting of numerical attributes is performed only once before the tree growth phase. In this paper, we address the parallelization of the steps executed in the tree growthphase. Therefore, we implemented a sequential out of core sorting ....
....be used to parallelize the sortingphase. Note that sorting of numerical attributes is performed only once before the tree growth phase. In this paper, we address the parallelization of the steps executed in the tree growthphase. Therefore, we implemented a sequential out of core sorting algorithm [3] for the sorting phase. In the current implementation, each numerical attribute list is sorted sequentially. Sorted lists are then partitioned among the SMP nodes. Attribute lists are stored into separate files on the local disks attached to each SMP node. Figure 2 illustrates the partitioning of ....
A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, D. E. Culler, J. M. Hellerstein, and D. A. Patterson. High-performance sorting on networks of workstations. In The ACM SIGMOD International Conference on Management of Data, pages 243--254, Tucson, AZ, USA, May 1997.
....database algorithms. One such example is Berkeley s NOWSort, a shared nothing sort implemented on a cluster of UltraSparc workstations, which held the Minute Sort record for sorting 8.41 GB in a minute, and the Datamation Sort record for sorting one million 100 B records in 2. 41 seconds [7]. Berkeley s Millennium Sort, the current holder of the Datamation Sort record (1.18 seconds) runs on a cluster of Intel based two processor SMPs [20] A final strength is that clusters today leverage the hardware development investment in the desktop, thus using processors, memory, and disks ....
A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, D. E. Culler, J. M. Hellerstein, and D. A. Patterson. "High-performance sorting on networks of workstations," Proc. of the ACM-SIGMOD Conference on Management of Data (SIGMOD `97), pages 243 - 254, May 1997.
....operating system files, not raw disk partitions. The first published time for this benchmark was an hour [12] With constant improvements in computer hardware and sort algorithms, this time diminished to just a few seconds [7] At that point, variations on the basic theme evolved [6] MinuteSort [3, 8] measures how much can be sorted in one minute and PennySort [5] measures how much can be sorted for one cent, assuming a particular depreciation period. Recently, several groups reported sorting one terabyte of data [8, 9, 10] SPsort improves substantially upon the best of these results. ....
Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., Culler, D.E., Hellerstein, J.M., and Patterson, D.A. "High-Performance Sorting on Networks of Workstations." ACM SIGMOD '97, Tucson, Arizona, May, 1997. Available at http://now.cs.berkeley.edu/NowSort/nowSort.ps.
....measurements to similar architectures. The MinuteSort is an evolution of the Datamation sort, and was introduced in 1994 by the creators of AlphaSort [21] It was defined as a best effort benchmark where price is no object. The previous MinuteSort record holder was the Berkeley NOWSort [11, 12], which sorted 8.41 GB on a cluster of 95 Ultra SPARC systems connected by 0.64 Gbps Myrinet. The NOWSort code was our starting point, the architectural similarities of both clusters allowed us to concentrate on tuning for input output throughput and communication. There is no magic in this ....
....of this application, the maximum achievable throughput from a single box needs to be determined. Given the rules defined for disk to disk sorting, all records need to be read and written to disk. Therefore, it is expected to spend most of the time doing input output transactions. Dusseau et. al [11] showed that in core sorting does not take more than 10 of the total time and the communication can be mostly hidden with input output, therefore performance depends on the maximum input output throughput that can be obtained from every box of the cluster. The data path for an input output ....
[Article contains additional citation context not shown here]
Andrea C. Dusseau, Remzi H. Arpaci, David E. Culler, Joseph M. Hellerstein, and David A. Patterson. High-performance sorting on network of workstations. In SIGMOD'97 Tucson, Arizona, pages 234--254, May 1997.
....key. The input records are in random order and output records must be in ascending order. The elapsed time includes the time create the sort process, to read the input from disk and to create the output file and write the output to disk. The 1998 s world record was 2. 41 seconds by NOW Sort [4] with a 32 node cluster of UltraSPARCs and the 1999 s world record was 1.18 seconds by Millennium Sort [5] with a 16 node cluster of PentiumII PCs. 1 Some of the disks (Cheetah 18XL) support Ultra160 SCSI but they function as Ultra2 because the SCSI controllers do not support Ultra160. 2 We ....
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, David A. Patterson, "High-Performance Sorting on Networks of Workstations", ACM SIGMOD '97, May, 1997.
....and introduced in 1994 by the creators of AlphaSort [14] as a constant time benchmark, because it offered the opportunity to evaluate our work with similar works and environments for the same benchmark. At the time we began, the current holder of the MinuteSort record was the Berkeley NOWSort [7, 8], which sorted 8.41 GB on a cluster of 95 Ultra SPARC systems connected by 0.64 Gbps Myrinet. Our starting point was the NOWSort, which we ported and adapted for the HPVM software, the Windows NT operating system and to our hardware configuration The porting involved rewriting and redesign of the ....
....model the performance of this application, we first need to study the maximum achievable throughput that can be obtained from a single box of our cluster. A Disk to disk sort application means data starts on disk and ends on disk, which implies at least one read and one write phase, Dusseau et. al [7] showed the in Core sorting does not take more than 10 of the total time and the communication can be mostly hidden with I O, therefore the performance depends on the I O throughput we can obtain from each machine of our cluster. The data path for an I O operation include all components shown in ....
[Article contains additional citation context not shown here]
Andrea C. Dusseau, Remzi H. Arpaci, David E. Culler, Joseph M. Hellerstein, and David A. Patterson. High-performance sorting on network of workstations. In SIGMOD'97 Tucson, Arizona, pages 234--254, May 1997.
....use of a different client to server network, or hosts with different relative speeds would significantly change the observed trends and trade off points. 6. Application: External Sort External sort has a long historyof research in the database community and has resulted in many fast algorithms [3, 5]. The application starts with a large unsorted data file that is partitioned across multiple nodes, and the output is a new partitioned data file that contains the same data sorted on a key field. The sample data file is based on a standard sorting benchmark that specifies 100 byte tuples, with ....
....standard sorting benchmark that specifies 100 byte tuples, with the first 10 bytes being the sort key. The distribution of the key values is assumed to be uniform, both in terms of the unsorted file as a whole and for each partition. A recent record holder for the fastest external sort is NowSort [5], and we use the pipelined version of their two pass parallel sort for our basic algorithm. The algorithm proceeds in two phases. The first phase generates temporary sorted runs on each node, and the second phase produces the output sorted partition on each node. During the first phase, a reader ....
A. Arpaci-Dusseau, R. Arpaci-Dusseau, D. Culler, J. Hellerstein, and D. Patterson. High-performance sorting on networks of workstations. In Proceedings of 1997 ACM SIGMOD Conference, Tucson, AZ, 1997.
....use of a different client to server network, or hosts with different relative speeds would significantly change the observed trends and trade off points. 6. Application: External Sort External sort has a long history of research in the database community and has resulted in many fast algorithms [3, 5]. The application starts with a large unsorted data file that is partitioned across multiple nodes, and the output is a new partitioned data file that contains the same data sorted on a key field. The sample data file is based on a standard sorting benchmark that specifies 100 byte tuples, with ....
....q2 q2 q3 q3 q3 q3 q4 q4 q4 q4 q5 q5 q5 q5 (c) 5x server load Figure 10. Execution time of queries under varying server load. means the server computation is delayed to double the execution time of a filter on the server, etc. subsampling factor is 8) for the fastest external sort is NowSort [5], and we use the pipelined version of their two pass parallel sort for our basic algorithm. The algorithm proceeds in two phases. The first phase generates temporary sorted runs on each node, and the second phase produces the output sorted partition on each node. During the first phase, a reader ....
A. Arpaci-Dusseau, R. Arpaci-Dusseau, D. Culler, J. Hellerstein, and D. Patterson. High-performance sorting on networks of workstations. In Proceedings of 1997 ACM SIGMOD Conference, Tucson, AZ, 1997.
....phase to the end of the algorithm. The same Sort and Merge phases must still be done locally, either at the source drive before the records as sent or at the destination drive before they are written back to the disk. This is essentially the algorithm proposed by the NowSort group at Berkeley [Arpaci Dusseau97] and will work well if the sample taken closely matches the final data distribution 1 . Whatever mismatch occurs will require a final Fixup phase where drives exchange overflow records with each other to rebalance the data. To get an idea of the applicability of these optimizations, Table 4 1 ....
....from the TPC D benchmark [TPC98] and a data set from the Datamation sort benchmark [Gray97a] We see significant savings for the sorts within the database queries, and a factor of exactly ten for the Datamation sort, which specifies 100 byte records and 10 byte keys. 1. the algorithm described in [Arpaci Dusseau97] actually assumes a uniform key distribution, but they mention the need for a Sampling phase if the data is known to be non uniform. 32 64 96 128 Number of Disks 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 Key Only Sort 32 64 96 128 Number of Disks 0.0 100.0 200.0 300.0 400.0 ....
[Article contains additional citation context not shown here]
Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., Culler, D.E., Hellerstein, J.M. and Patterson, D.A. "High-Performance Sorting on Networks of Workstations" SIGMOD, June 1997. 166
....a load balanced parallel radix sort algorithm which can sort 0.5G integers in 20 seconds on a 64 processor IBM SP2 WN. Dusseau, et al. and Rivera, et al. studied the performance of diskto disk sorting on clusters of workstations and broke the Minute Sorting record using these parallel machines [2, 9]. Helman [7] and Li [13] studied sample sort with regular sampling on various message passing and vector computers and found it to have good portability to outperform other sampling algorithms for sample sorting. Recently, a new type of platform has begun to dominate tightly coupled ....
A.C.Dusseau, R.H.Dusseau, and et al. High-performance sorting on networks of workstations. In SIGMOD '97 AZ, USA, 1997.
....that these machines ran a standard full function operating system similar to Solaris. Acharya et al. [1] report that, on the average, the kernel on a 128 MB Solaris machine has a memory footprint of 24 MB (including the paging free list but not including the file cache) Arpaci Dusseau et al. [7] report a similar result. Accordingly, we assumed that only 104 MB on these hosts is available to user processes. Shared memory multiprocessors (SMPs) For the SMP configurations, we followed the guidelines for configuring decision support servers (as quoted in [18] 1) put as many processors ....
.... each processor issues up to four 256 KB asynchronous requests (each request transferring a 64 KB chunk from each of four disks) However, for sort and join, which shuffle their entire dataset and write it back to disk, we partitioned the disks into separate read and write groups (as in NOW sort [7]) Since all processors can address all disks, we did not a priori partition the input datasets to processors. Instead, we maintained two shared queues (read write) of fixed size blocks in the order they appear on disk. When idle, each processor locks the queue and grabs the next block off the ....
[Article contains additional citation context not shown here]
A. Arpaci-Dusseau, R. Arpaci-Dusseau, D. Culler, J. Hellerstein, and D. Patterson. High-performance sorting on networks of workstations. In Proceedings of 1997 ACM SIGMOD International Conference on Management of Data, pages 243--54, Tucson, AZ, 1997.
....I O behavior of our spatial join algorithm. Second, it allows for a simple and fast implementation, by leveraging the performance of the highly tuned sorting routines offered by many database systems. There has been considerable work on optimized database sorts in recent years (see, e. g, Aga96, DDC 97, NBC 94] and it appears wise to try to draw on these results. 6 Fast Plane Sweeping Methods As mentioned already, the overall efficiency of many spatial join algorithm is greatly influenced by the internalmemory join algorithm used as a subroutine. In this section we describe and compare ....
A. C. Arpaci Dusseau, R. H. Arpaci Dusseau, D. E. Culler, J. M. Hellerstein, and D. A. Patterson. High-performance sorting on networks of workstations. In Proc. SIGMOD Intl. Conf. on Management of Data, pages 243--254, 1997.
....With g 0 striping, data is striped in blocks of 64 KB to each disk in the system. The table shows that g 0 striping is not effective with disks of different speeds, achieving only 77 of peak bandwidth. For g 1 striping, we gauge the relative performance of the disks via a simple off line tool [3]; we measure that we can achieve 8.0 MB s when writing simultaneously to the two Hawk disks, and 12.1 MB s to the two Barracuda disks. This peak performance measured in isolation determines the proper ratio of stripe sizes: we write two blocks of data to each of the slower disks and three to each ....
A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, D. E. Culler, J. M. Hellerstein, and D. P. Patterson. HighPerformance Sorting on Networks of Workstations. In Proceedings of the
....a database benchmark has diminished, we believe the benchmark remains relevant for three reasons. First, Datamation stresses the importance of start up time. Start up time has been noted as one of three factors limiting performance of parallel systems [7] and has been a problem for both clusters [2] as well as SMPs [8] Second, Datamation is an example of an interactive parallel application. For parallelism to become commonplace, interactive applications must become a reality. Third, interactive parallel jobs are particularly sensitive to performance fluctuations that occur in large scale ....
....performance for startup, disk I O, and network I O are in Section 4. The integration of these efforts into a record breaking sort occur in Section 5. Our conclusions are in Section 6. 2 Background This section describes the sort algorithm we used. The implementation is derived from NOW Sort [2], which assumes that the input records are evenly distributed across P nodes, numbered 0 through P 1. The sorted output file is range partitioned across the node such that the lowest valued keys are on node 0 and highest valued keys on node P 1; however, the number of records per node will ....
Andrea C. Arpaci-Dusseau, Remzi H. ArpaciDusseau, David E. Culler, Joseph M. Hellerstein, and David P. Patterson. High-Performance Sorting on Networks of Workstations. In Proceedings of the
....faults often will run at the rate of the slowest component in the system, losing much or all performance advantage gained through the use of multiple machines. 1. 1 Motivation: A Case Study To better motivate the problem of performance faults, we perform a simple experiment with NOW Sort [9], a high performance parallel external sorting implementation for clusters. In developing the sorting application, many months were spent tuning the program to the various components of the system: extracting peak bandwidth from the disk drives, optimizing the in memory sort so as to take ....
....of the interfering workload. Such analysis is beyond the scope of this article . 36] page 250, paragraph 2. NOW Sort: In our own experience with parallel external sorting in a network of workstations (NOW Sort) we found that the cluster environment was surprisingly performance heterogeneous [9, 10]. As noted in that text: The performance of NOW Sort is quite sensitive to various disturbances and requires a dedicated system to achieve peak results. In this section, we discuss how we solved a set of particular run time performance problems where a foreign agent (such as a competing ....
[Article contains additional citation context not shown here]
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, David E. Culler, Joseph M. Hellerstein, and David A. Patterson. High-Performance Sorting on Networks of Workstations. In SIGMOD '97, May 1997.
No context found.
A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, D. E. Culler, J. M. Hellerstein, and D. A. Patterson. High-Performance Sorting on Networks of Workstations. In Proc. ACM-SIGMOD International Conference on Management of Data, Tucson, May 1997.
No context found.
A.C.Arpaci-Dusseau et al. High-Performance Sorting on Networks of Workstations. SIGMOD'97
No context found.
A. C. Arpaci-Dussaeu, R. H. Arpaci-Dussaeu, D. E. Culler, J. M. Hellerstein, and D. A. Patterson. High-performance sorting on networks of workstations. In Proc. ACM SIGMOD International Conf. on Management of Data, 1997.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC