Results 11 - 20
of
187
NESL: A nested data-parallel language (version 2.6
, 1993
"... The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Wright Laboratory or the U. S. Government. Keywords: Data-parallel, parallel algorithms, supe ..."
Abstract
-
Cited by 112 (8 self)
- Add to MetaCart
The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Wright Laboratory or the U. S. Government. Keywords: Data-parallel, parallel algorithms, supercomputers, nested parallelism, This report describes Nesl, a strongly-typed, applicative, data-parallel language. Nesl is intended to be used as a portable interface for programming a variety of parallel and vector computers, and as a basis for teaching parallel algorithms. Parallelism is supplied through a simple set of data-parallel constructs based on sequences, including a mechanism for applying any function over the elements of a sequence in parallel and a rich set of parallel functions that manipulate sequences. Nesl fully supports nested sequences and nested parallelism—the ability to take a parallel function and apply it over multiple instances in parallel. Nested parallelism is important for implementing algorithms with irregular nested loops (where the inner loop lengths depend on the outer iteration) and for divide-and-conquer algorithms. Nesl also provides a performance model for calculating the asymptotic performance of a program on
Compiling Collection-Oriented Languages onto Massively Parallel Computers
- Journal of Parallel and Distributed Computing
, 1990
"... : This paper introduces techniques for compiling the nested parallelism of collectionoriented languages onto existing parallel hardware. Programmers of parallel machines encounter nested parallelism whenever they write a routine that performs parallel operations, and then want to call that routine ..."
Abstract
-
Cited by 102 (11 self)
- Add to MetaCart
(Show Context)
: This paper introduces techniques for compiling the nested parallelism of collectionoriented languages onto existing parallel hardware. Programmers of parallel machines encounter nested parallelism whenever they write a routine that performs parallel operations, and then want to call that routine itself in parallel. This occurs naturally in many applications. Most parallel systems, however, do not permit the expression of nested parallelism. This forces the programmer to exploit only one level of parallelism or to implement nested parallelism themselves. Both of these alternatives tend to produce code that is harder to maintain and less modular than code described at a higher-level with nested parallel constructs. Not permitting the expression of nested parallelism is analogous to not permitting nested loops in serial languages. This paper describes issues and techniques for taking high-level descriptions of parallelism in the form of operations on nested collections and automaticall...
Provably efficient scheduling for languages with fine-grained parallelism
- IN PROC. SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1995
"... Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A ..."
Abstract
-
Cited by 95 (28 self)
- Add to MetaCart
(Show Context)
Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on p processors can use a factor of p or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any
SMART: A Scan-Based Movement-Assisted Sensor Deployment Method in Wireless Sensor Networks
- In Proc. of IEEE INFOCOM
, 2005
"... Abstract—The efficiency of sensor networks depends on the coverage of the monitoring area. Although, in general, a sufficient number of sensors are used to ensure a certain degree of redundancy in coverage, a good sensor deployment is still necessary to balance the workload of sensors. In a sensor n ..."
Abstract
-
Cited by 68 (4 self)
- Add to MetaCart
(Show Context)
Abstract—The efficiency of sensor networks depends on the coverage of the monitoring area. Although, in general, a sufficient number of sensors are used to ensure a certain degree of redundancy in coverage, a good sensor deployment is still necessary to balance the workload of sensors. In a sensor network with locomotion facilities, sensors can move around to self-deploy. The movement-assisted sensor deployment deals with moving sensors from an initial unbalanced state to a balanced state. Therefore, various optimization problems can be defined to minimize different parameters, including total moving distance, total number of moves, communication/computation cost, and convergence rate. In this paper, we first propose a Hungarian-algorithm-based optimal solution, which is centralized. Then, a localized Scan-based Movement-Assisted sensoR deploymenT method (SMART) and its several variations that use scan and dimension exchange to achieve a balanced state are proposed. An extended SMART is developed to address a unique problem called communication holes in sensor networks. Extensive simulations have been done to verify the effectiveness of the proposed scheme.
Scalable gpu graph traversal
- In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’12
, 2012
"... Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrate ..."
Abstract
-
Cited by 64 (1 self)
- Add to MetaCart
Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter. We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. This level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.
A cost calculus for parallel functional programming
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1995
"... Building a cost calculus for a parallel program development environment is difficult because of the many degrees of freedom available in parallel implementations, and because of difficulties with compositionality. We present a strategy for building cost calculi for skeleton-based programming languag ..."
Abstract
-
Cited by 61 (6 self)
- Add to MetaCart
Building a cost calculus for a parallel program development environment is difficult because of the many degrees of freedom available in parallel implementations, and because of difficulties with compositionality. We present a strategy for building cost calculi for skeleton-based programming languages which can be used for derivational software development and which deals in a pragmatic way with the difficulties of composition. The approach is illustrated for the Bird-Meertens theory of lists, a parallel functional language with an associated equational transformation system. Keywords: functional programming, parallel programming, program transformation, cost calculus, equational theories, architecture independence, Bird-Meertens formalism.
Active Disks - Remote Execution for Network-Attached Storage
, 1997
"... The principal trend in the design of computer systems is the expectation of much greater computational power in future generations of microprocessors. This trend applies to embedded systems as well as host processors. As a result, devices such as storage controllers have excess capacity and growing ..."
Abstract
-
Cited by 59 (1 self)
- Add to MetaCart
(Show Context)
The principal trend in the design of computer systems is the expectation of much greater computational power in future generations of microprocessors. This trend applies to embedded systems as well as host processors. As a result, devices such as storage controllers have excess capacity and growing computational capabilities. Storage system designers are exploiting this trend with higher-level interfaces to storage and increased intelligence inside storage devices. One development in this direction is Network-Attached Secure Disks (NASD) which attaches storage devices directly to the network and raises the storage interface above the simple (fixed-size block) memory abstraction of SCSI. This allows devices more freedom to provide efficient operations; promises more scalable subsystems by offloading file system and storage management functionality from dedicated servers; and reduces latency by executing common case requests directly at storage devices. In this paper, we push this increa...
Radix Sort For Vector Multiprocessors
- In Proceedings Supercomputing '91
, 1991
"... We have designed a radix sort algorithm for vector multiprocessors and have implemented the algorithm on the CRAY Y-MP. On one processor of the Y-MP, our sort is over 5 times faster on large sorting problems than the optimized library sort provided by CRAY Research. On eight processors we achieve a ..."
Abstract
-
Cited by 51 (6 self)
- Add to MetaCart
(Show Context)
We have designed a radix sort algorithm for vector multiprocessors and have implemented the algorithm on the CRAY Y-MP. On one processor of the Y-MP, our sort is over 5 times faster on large sorting problems than the optimized library sort provided by CRAY Research. On eight processors we achieve an additional speedup of almost 5, yielding a routine over 25 times faster than the library sort. Using this multiprocessor version, we can sort at a rate of 15 million 64-bit keys per second. Our sorting algorithm is adapted from a data-parallel algorithm previously designed for a highly parallel Single Instruction Multiple Data (SIMD) computer, the Connection Machine CM-2. To develop our version we introduce three general techniques for mapping data-parallel algorithms ontovector multiprocessors. These techniques allow us to fully vectorize and parallelize the algorithm. The paper also derives equations that model the performance of our algorithm on the Y-MP. These equations are then used t...
The Bird-Meertens Formalism as a Parallel Model
- Software for Parallel Computation, volume 106 of NATO ASI Series F
, 1993
"... The expense of developing and maintaining software is the major obstacle to the routine use of parallel computation. Architecture independent programming offers a way of avoiding the problem, but the requirements for a model of parallel computation that will permit it are demanding. The BirdMeertens ..."
Abstract
-
Cited by 46 (0 self)
- Add to MetaCart
(Show Context)
The expense of developing and maintaining software is the major obstacle to the routine use of parallel computation. Architecture independent programming offers a way of avoiding the problem, but the requirements for a model of parallel computation that will permit it are demanding. The BirdMeertens formalism is an approach to developing and executing data-parallel programs; it encourages software development by equational transformation; it can be implemented efficiently across a wide range of architecture families; and it can be equipped with a realistic cost calculus, so that trade-offs in software design can be explored before implementation. It makes an ideal model of parallel computation. Keywords: General purpose parallel computing, models of parallel computation, architecture independent programming, categorical data type, program transformation, code generation. 1 Properties of Models of Parallel Computation Parallel computation is still the domain of researchers and those ...