Results 1 - 10
of
26
Revisiting Sorting for GPGPU Stream Architectures
- Proc. Int’l Conf. Parallel Architectures and Compilation Techniques (PACT
, 2010
"... This report presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Our radix sorting methods demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). Compared to the state-of-the ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
(Show Context)
This report presents efficient strategies for sorting large sequences of fixed-length keys (and values) using GPGPU stream processors. Our radix sorting methods demonstrate sorting rates of 482 million key-value pairs per second, and 550 million keys per second (32-bit). Compared to the state-of-the-art, our implementations exhibit speedup of at least 2x for all fully-programmable generations of NVIDIA GPUs, and up to 3.7x for current GT200-based models. For this domain of sorting problems, we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture. We obtain our sorting performance by using a parallel scan stream primitive that has been generalized in two ways: (1) with local interfaces for producer/consumer operations (visiting logic), and (2) with interfaces for performing multiple related, concurrent prefix scans (multiscan). These abstractions allow us to improve the overall utilization of memory and computational resources while maintaining the flexibility of a reusable component. We require 38 % fewer bytes to be moved through the global memory subsystem and a 64 % reduction in the number of thread-cycles needed for computation. As part of this work, we demonstrate a method for encoding multiple compaction problems into a single, composite parallel scan. This technique provides our local sorting strategies with a 2.5x speedup over bitonic sorting networks for small problem instances, i.e., sequences that can be entirely sorted within the shared memory local to a single GPU core. 1
High-performance online spatial and temporal aggregations on multi-core CPUs and many-core GPUs, in:
- Proceedings of the ACM 15th International Workshop on Data Warehousing and OLAP,
, 2012
"... a b s t r a c t With the increasing availability of locating and navigation technologies on portable wireless devices, huge amounts of location data are being captured at ever growing rates. Spatial and temporal aggregations in an Online Analytical Processing (OLAP) setting for the large-scale ubiq ..."
Abstract
-
Cited by 19 (14 self)
- Add to MetaCart
(Show Context)
a b s t r a c t With the increasing availability of locating and navigation technologies on portable wireless devices, huge amounts of location data are being captured at ever growing rates. Spatial and temporal aggregations in an Online Analytical Processing (OLAP) setting for the large-scale ubiquitous urban sensing data play an important role in understanding urban dynamics and facilitating decision making. Unfortunately, existing spatial, temporal and spatiotemporal OLAP techniques are mostly based on traditional computing frameworks, i.e., disk-resident systems on uniprocessors based on serial algorithms, which makes them incapable of handling largescale data on parallel hardware architectures that have already been equipped with commodity computers. In this study, we report our designs, implementations and experiments on developing a data management platform and a set of parallel techniques to support highperformance online spatial and temporal aggregations on multi-core CPUs and many-core Graphics Processing Units (GPUs). Our experiment results show that we are able to spatially associate nearly 170 million taxi pickup location points with their nearest street segments among 147,011 candidates in about 5-25 s on both an Nvidia Quadro 6000 GPU device and dual Intel Xeon E5405 quad-core CPUs when their Vector Processing Units (VPUs) are utilized for computing intensive tasks. After spatially associating points with road segments, spatial, temporal and spatiotemporal aggregations are reduced to relational aggregations and can be processed in the order of a fraction of a second on both GPUs and multi-core CPUs. In addition to demonstrating the feasibility of building a high-performance OLAP system for processing large-scale taxi trip data for real-time, interactive data explorations, our work also opens the paths to achieving even higher OLAP query efficiency for large-scale applications through integrating domain-specific data management platforms, novel parallel data structures and algorithm designs, and hardware architecture friendly implementations.
Average case analysis of java 7’s dual pivot quicksort
- In Proc. of 20th European Symposium on Algorithms (ESA
, 2012
"... Abstract. Recently, a new Quicksort variant due to Yaroslavskiy was chosen as standard sorting method for Oracle’s Java 7 runtime library. The decision for the change was based on empirical studies showing that on average, the new algorithm is faster than the formerly used classic Quicksort. Surpris ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Abstract. Recently, a new Quicksort variant due to Yaroslavskiy was chosen as standard sorting method for Oracle’s Java 7 runtime library. The decision for the change was based on empirical studies showing that on average, the new algorithm is faster than the formerly used classic Quicksort. Surprisingly, the improvement was achieved by using a dual pivot approach, an idea that was considered not promising by several theoretical studies in the past. In this paper, we identify the reason for this unexpected success. Moreover, we present the first precise average case analysis of the new algorithm showing e. g. that a random permutation of length n is sorted using 1.9n lnn − 2.46n+O(lnn) key comparisons and 0.6n lnn+ 0.08n+O(lnn) swaps. 1.
Efficient Parallel Merge Sort for Fixed and Variable Length Keys
"... We design a high-performance parallel merge sort for highly parallel systems. Our merge sort is designed to use more register communication (not shared memory), and does not suffer from oversegmentation as opposed to previous comparison based sorts. Using these techniques we are able to achieve a so ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
We design a high-performance parallel merge sort for highly parallel systems. Our merge sort is designed to use more register communication (not shared memory), and does not suffer from oversegmentation as opposed to previous comparison based sorts. Using these techniques we are able to achieve a sorting rate of 250 MKeys/sec, which is about 2.5 times faster than Thrust merge sort performance, and 70 % faster than non-stable state-of-the-art GPU merge sorts. Building on this sorting algorithm, we develop a scheme for sorting variable-length key/value pairs, with a special emphasis on string keys. Sorting non-uniform, unaligned data such as strings is a fundamental step in a variety of algorithms, yet it has received comparatively little attention. To our knowledge, our system is the first published description of an efficient string sort for GPUs. We are able to sort strings at a rate of 70 MStrings/s on one dataset and up to 1.25 GB/s on another dataset using a GTX 580. 1.
Programming the Memory Hierarchy Revisited: Supporting Irregular Parallelism in Sequoia ∗
"... We describe two novel constructs for programming parallel machines with multi-level memory hierarchies: call-up, which allows a child task to invoke computation on its parent, and spawn, which spawns a dynamically determined number of parallel children until some termination condition in the parent ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
(Show Context)
We describe two novel constructs for programming parallel machines with multi-level memory hierarchies: call-up, which allows a child task to invoke computation on its parent, and spawn, which spawns a dynamically determined number of parallel children until some termination condition in the parent is met. Together we show that these constructs allow applications with irregular parallelism to be programmed in a straightforward manner, and furthermore these constructs complement and can be combined with constructs for expressing regular parallelism. We have implemented spawn and call-up in Sequoia and we present an experimental evaluation on a number of irregular applications. 1.
A New Data Layout For Set Intersection on GPUs
"... Abstract—Set intersection is the core in a variety of problems, e.g. frequent itemset mining and sparse boolean matrix multiplication. It is well-known that large speed gains can, for some computational problems, be obtained by using a graphics processing unit (GPU) as a massively parallel computing ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Set intersection is the core in a variety of problems, e.g. frequent itemset mining and sparse boolean matrix multiplication. It is well-known that large speed gains can, for some computational problems, be obtained by using a graphics processing unit (GPU) as a massively parallel computing device. However, GPUs require highly regular control flow and memory access patterns, and for this reason previous GPU methods for intersecting sets have used a simple bitmap representation. This representation requires excessive space on sparse data sets. In this paper we present a novel data layout, BATMAP, that is particularly well suited for parallel processing, and is compact even for sparse data. Frequent itemset mining is one of the most important applications of set intersection. As a case-study on the potential of BATMAPs we focus on frequent pair mining, which is a core special case of frequent itemset mining. The main finding is that our method is able to achieve speedups over both Apriori and FP-growth when the number of distinct items is large, and the density of the problem instance is above 1%. Previous implementations of frequent itemset mining on GPU have not been able to show speedups over the best single-threaded implementations. Keywords-Set intersection; Frequent itemset mining; Sparse boolean matrix multiplication; Data layout; GPU
Deterministic Sample Sort For GPUs
- PARALLEL PROCESSING LETTERS C ○ WORLD SCIENTIFIC PUBLISHING COMPANY
, 2011
"... We demonstrate that parallel deterministic sample sort for many-core GPUs (GPU Bucket Sort) is not only considerably faster than the best comparison-based sorting algorithm for GPUs (Thrust Merge [Satish et.al., Proc. IPDPS 2009]) but also as fast as randomized sample sort for GPUs (GPU Sample Sort ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
We demonstrate that parallel deterministic sample sort for many-core GPUs (GPU Bucket Sort) is not only considerably faster than the best comparison-based sorting algorithm for GPUs (Thrust Merge [Satish et.al., Proc. IPDPS 2009]) but also as fast as randomized sample sort for GPUs (GPU Sample Sort [Leischner et.al., Proc. IPDPS 2010]). However, deterministic sample sort has the advantage that bucket sizes are guaranteed and therefore its running time does not have the input data dependent fluctuations that can occur for randomized sample sort.
GPUMemSort: A High Performance Graphic Co-processors Sorting Algorithm for Large Scale In-Memory Data ∗
"... In this paper, we present a GPU-based sorting algorithm, GPUMemSort, which achieves high performance in sorting large-scale in-memory data by exploiting highparallel GPU processors. It consists of two algorithms: in-core algorithm, which is responsible for sorting data in GPU global memory efficient ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this paper, we present a GPU-based sorting algorithm, GPUMemSort, which achieves high performance in sorting large-scale in-memory data by exploiting highparallel GPU processors. It consists of two algorithms: in-core algorithm, which is responsible for sorting data in GPU global memory efficiently, and out-of-core algorithm, which is responsible for dividing large scale data into multiple chunks that fit GPU global memory. GPUMemSort is implemented based on NVIDIA CUDA framework and some critical and detailed optimization methods are also presented. The tests of different algorithms have been run on multiple data sets. The experimental results show that our in-core sorting can outperform other comparison-based algorithms and GPUMemSort is highly effective in sorting large-scale in-memory data.
Applicability of GPU Computing for Efficient Merge in In-Memory Databases
"... Column oriented in-memory databases typically use dictionary compression to reduce the overall storage space and allow fast lookup and comparison. However, there is a high performance cost for updates since the dictionary, used for compression, has to be recreated each time records are created, upda ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Column oriented in-memory databases typically use dictionary compression to reduce the overall storage space and allow fast lookup and comparison. However, there is a high performance cost for updates since the dictionary, used for compression, has to be recreated each time records are created, updated or deleted. This has to be taken into account for TPC-C like workloads with around 45 % of all queries being transactional modifications. A technique called differential updates can be used to allow faster modifications. In addition to the main storage, the database then maintains a delta storage to accommodate modifying queries. During the merge process, the modifications of the delta are merged with the main storage in parallel to the normal operation of the database. Current hardware and software trends suggest that this problem can be tackled by massively parallelizing the merge process. One approach to massive parallelism are GPUs that offer order of magnitudes more cores than modern CPUs. Therefore, we analyze the feasibility of a parallel GPU merge implementation and its potential speedup. We found that the maximum potential merge speedup is limited since only two of its four stages are likely to benefit from parallelization. We present a parallel dictionary slice merge algorithm as well as an alternative parallel merge algorithm for GPUs that achieves up to 40 % more throughput than its CPU implementation. In addition, we propose a parallel duplicate removal algorithm that achieves up to 27 times the throughput of the CPU implementation. 1.
Can GPUs Sort Strings Efficiently?
"... Abstract—String sorting or variable-length key sorting has lagged in performance on the GPU even as the fixed-length key sorting has improved dramatically. Radix sorting is the fastest on the GPUs. In this paper, we present a fast and efficient string sort on the GPU that is built on the available r ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract—String sorting or variable-length key sorting has lagged in performance on the GPU even as the fixed-length key sorting has improved dramatically. Radix sorting is the fastest on the GPUs. In this paper, we present a fast and efficient string sort on the GPU that is built on the available radix sort. Our method sorts strings from left to right in steps, moving only indexes and small prefixes for efficiency. We reduce the number of sort steps by adaptively consuming maximum string bytes based on the number of segments in each step. Performance is improved by using Thrust primitives for most steps and by removing singleton segments from consideration. Over 70 % of the string sort time is spent on Thrust primitives. This provides high performance along with high adaptability to future GPUs. We achieve speed of up to 10 over current GPU methods, especially on large datasets. We also scale to much larger input sizes. We present results on easy and difficult strings defined using their after-sort tie lengths. I.