Results 1  10
of
662
A survey of generalpurpose computation on graphics hardware
, 2007
"... The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware acompelling platform for computationally demanding tasks in awide variety of application domains. In this report, we describe, summarize, and analyze the l ..."
Abstract

Cited by 554 (15 self)
 Add to MetaCart
The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, have made graphics hardware acompelling platform for computationally demanding tasks in awide variety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping generalpurpose computation to graphics hardware. We begin with the technical motivations that underlie generalpurpose computation on graphics processors (GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. We then aim the main body of this report at two separate audiences. First, we describe the techniques used in mapping generalpurpose computation to graphics hardware. We believe these techniques will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey and categorize the latest developments in generalpurpose application development on graphics hardware.
Performance analysis of kary ncube interconnection networks
 IEEE Transactions on Computers
, 1990
"... AbstmctVLSI communication networks are wirelimited. The cost of a network is not a function of the number of switches required, but rather a function of the wiring density required to construct the network. This paper analyzes communication networks of varying dimension under the assumption of co ..."
Abstract

Cited by 357 (18 self)
 Add to MetaCart
(Show Context)
AbstmctVLSI communication networks are wirelimited. The cost of a network is not a function of the number of switches required, but rather a function of the wiring density required to construct the network. This paper analyzes communication networks of varying dimension under the assumption of constant wire bisection. Expressions for the latency, average case throughput, and hotspot throughput of kary ncube networks with constant bisection are derived that agree closely with experimental measurements. It is shown that lowdimensional networks (e.g., tori) have lower latency and higher hotspot throughput than highdimensional networks (e.g., binary ncubes) with the same bisection width. Index Terms Communication networks, concurrent computing, interconnection networks, messagepassing multiprocessors, parallel processing, VLSI. I.
Software Protection and Simulation on Oblivious RAMs
, 1993
"... Software protection is one of the most important issues concerning computer practice. There exist many heuristics and adhoc methods for protection, but the problem as a whole has not received the theoretical treatment it deserves. In this paper we provide theoretical treatment of software protectio ..."
Abstract

Cited by 312 (15 self)
 Add to MetaCart
Software protection is one of the most important issues concerning computer practice. There exist many heuristics and adhoc methods for protection, but the problem as a whole has not received the theoretical treatment it deserves. In this paper we provide theoretical treatment of software protection. We reduce the problem of software protection to the problem of efficient simulation on oblivious RAM. A machine is oblivious if the sequence in which it accesses memory locations is equivalent for any two inputs with the same running time. For example, an oblivious Turing Machine is one for which the movement of the heads on the tapes is identical for each computation. (Thus, it is independent of the actual input.) What is the slowdown in the running time of any machine, if it is required to be oblivious? In 1979 Pippenger and Fischer showed how a twotape oblivious Turing Machine can simulate, online, a onetape Turing Machine, with a logarithmic slowdown in the running time. We s...
Parallel merge sort
 SIAM Journal of Computing
, 1988
"... Abstract. We give a parallel implementation of merge sort on a CREW PRAM that uses n processors and O(logn) time; the constant in the running time is small. We also give a more complex version of the algorithm for the EREW PRAM; it also uses n processors and O(logn) time. The constant in the runnin ..."
Abstract

Cited by 311 (3 self)
 Add to MetaCart
Abstract. We give a parallel implementation of merge sort on a CREW PRAM that uses n processors and O(logn) time; the constant in the running time is small. We also give a more complex version of the algorithm for the EREW PRAM; it also uses n processors and O(logn) time. The constant in the running time is still moderate, though not as small. 1.
High Speed Switch Scheduling for Local Area Networks
 ACM Transactions on Computer Systems
, 1993
"... Current technology trends make it possible to build communication networks that can support high performance distributed computing. This paper describes issues in the design of a prototype switch for an arbitrary topology pointtopoint network with link speeds of up to one gigabit per second. The s ..."
Abstract

Cited by 246 (3 self)
 Add to MetaCart
Current technology trends make it possible to build communication networks that can support high performance distributed computing. This paper describes issues in the design of a prototype switch for an arbitrary topology pointtopoint network with link speeds of up to one gigabit per second. The switch deals in fixedlength ATMstyle cells, which it can process at a rate of 37 million cells per second. It provides high bandwidth and low latency for datagram traffic. In addition, it supports realtime traffic by providing bandwidth reservations with guaranteed latency bounds. The key to the switch's operation is a technique called parallel iterative matching, which can quickly identify a set of conflictfree cells for transmission in a time slot. Bandwidth reservations are accommodated in the switch by building a fixed schedule for transporting cells from reserved flows across the switch; parallel iterative matching can fill unused slots with datagram traffic. Finally, we note that pa...
Scans as primitive parallel operations
 IEEE Trans. Computers
, 1989
"... AbstmctIn most parallel random access machine (PRAM) models, memory references are assumed to take unit time. In practice, and in theory, certain scan operations, also known as prefix computations, can execute in no more time than these parallel memory references. This paper outlines an extensive ..."
Abstract

Cited by 187 (13 self)
 Add to MetaCart
(Show Context)
AbstmctIn most parallel random access machine (PRAM) models, memory references are assumed to take unit time. In practice, and in theory, certain scan operations, also known as prefix computations, can execute in no more time than these parallel memory references. This paper outlines an extensive study of the effect of including, in the PRAM models, such scan operations as unittime primitives. The study concludes that the primitives improve the asymptotic running time of many algorithms by an O(log n) factor greatly simplify the description of many algorithms, and are significantly easier to implement than memory references. We therefore argue that the algorithm designer should feel free to use these operations as if they were as cheap as a memory reference. This paper describes five algorithms that clearly illustrate how the scan primitives can be used in algorithm design: a mdixsort algorithm, a quicksort algorithm, a minimumspanningtrae algorithm, a linedmwing algorithm, and a merging algorithm. These all run on an EREW PRAM with the addition of two scan primitives, and are either simpler or more efficient than their pure PRAM counterparts. The scan primitives have been implemented in microcode on the Connection Machine System, are available in PARIS (the parallel instruction set of the machine), and are used in a large number of applications. All five algorithms have been tested, and the radix sort is the currently supported sorting algorithm for the Connection Machine.
Translating pseudoboolean constraints into SAT
 Journal on Satisfiability, Boolean Modeling and Computation
, 2006
"... In this paper, we describe and evaluate three different techniques for translating pseudoboolean constraints (linear constraints over boolean variables) into clauses that can be handled by a standard SATsolver. We show that by applying a proper mix of translation techniques, a SATsolver can perfor ..."
Abstract

Cited by 185 (2 self)
 Add to MetaCart
(Show Context)
In this paper, we describe and evaluate three different techniques for translating pseudoboolean constraints (linear constraints over boolean variables) into clauses that can be handled by a standard SATsolver. We show that by applying a proper mix of translation techniques, a SATsolver can perform on a par with the best existing native pseudoboolean solvers. This is particularly valuable in those cases where the constraint problem of interest is naturally expressed as a SAT problem, except for a handful of constraints. Translating those constraints to get a pure clausal problem will take full advantage of the latest improvements in SAT research. A particularly interesting result of this work is the efficiency of sorting networks to express pseudoboolean constraints. Although tangential to this presentation, the result gives a suggestion as to how synthesis tools may be modified to produce arithmetic circuits more suitable for SAT based reasoning. Keywords: pseudoBoolean, SATsolver, SAT translation, integer linear programming
GPUTeraSort: High Performance Graphics Coprocessor Sorting for Large Database Management
 SIGMOD
, 2006
"... We present a new algorithm, GPUTeraSort, to sort billionrecord widekey databases using a graphics processing unit (GPU) Our algorithm uses the data and task parallelism on the GPU to perform memoryintensive and computeintensive tasks while the CPU is used to perform I/O and resource management. ..."
Abstract

Cited by 153 (8 self)
 Add to MetaCart
(Show Context)
We present a new algorithm, GPUTeraSort, to sort billionrecord widekey databases using a graphics processing unit (GPU) Our algorithm uses the data and task parallelism on the GPU to perform memoryintensive and computeintensive tasks while the CPU is used to perform I/O and resource management. We therefore exploit both the highbandwidth GPU memory interface and the lowerbandwidth CPU main memory interface and achieve higher memory bandwidth than purely CPUbased algorithms. GPUTeraSort is a twophase task pipeline: (1) read disk, build keys, sort using the GPU, generate runs, write disk, and (2) read, merge, write. It also pipelines disk transfers and achieves nearpeak I/O performance. We have tested the performance of GPUTeraSort on billionrecord files using the standard Sort benchmark. In practice, a 3 GHz Pentium IV PC with $265 NVIDIA 7800 GT GPU is significantly faster than optimized CPUbased algorithms on much faster processors, sorting 60GB for a penny; the best reported PennySort priceperformance. These results suggest that a GPU coprocessor can significantly improve performance on large data processing tasks.
Photon Mapping on Programmable Graphics Hardware
 GRAPHICS HARDWARE
, 2003
"... We present a modified photon mapping algorithm capable of running entirely on GPUs. Our implementation uses breadthfirst photon tracing to distribute photons using the GPU. The photons are stored in a gridbased photon map that is constructed directly on the graphics hardware using one of two met ..."
Abstract

Cited by 151 (4 self)
 Add to MetaCart
We present a modified photon mapping algorithm capable of running entirely on GPUs. Our implementation uses breadthfirst photon tracing to distribute photons using the GPU. The photons are stored in a gridbased photon map that is constructed directly on the graphics hardware using one of two methods: the first method is a multipass technique that uses fragment programs to directly sort the photons into a compact grid. The second method uses a single rendering pass combining a vertex program and the stencil buffer to route photons to their respective grid cells, producing an approximate photon map. We also present an efficient method for locating the nearest photons in the grid, which makes it possible to compute an estimate of the radiance at any surface location in the scene. Finally, we describe a breadthfirst stochastic ray tracer that uses the photon map to simulate full global illumination directly on the graphics hardware. Our implementation demonstrates that current graphics hardware is capable of fully simulating global illumination with progressive, interactive feedback to the user.