Results 1 - 10
of
14
Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors
, 1999
"... Devices]: Modes of Computation---Parallelism and concurrency General Terms: Algorithms, Design, Performance, Theory Additional Key Words and Phrases: Automatic parallelization, DAG, multiprocessors, parallel processing, software tools, static scheduling, task graphs This research was supported ..."
Abstract
-
Cited by 142 (4 self)
- Add to MetaCart
Devices]: Modes of Computation---Parallelism and concurrency General Terms: Algorithms, Design, Performance, Theory Additional Key Words and Phrases: Automatic parallelization, DAG, multiprocessors, parallel processing, software tools, static scheduling, task graphs This research was supported by the Hong Kong Research Grants Council under contract numbers HKUST 734/96E, HKUST 6076/97E, and HKU 7124/99E. Authors' addresses: Y.-K. Kwok, Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong; email: ykwok@eee.hku.hk; I. Ahmad, Department of Computer Science, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee. 2000 ACM 0360-0300/99/1200--0406 $5.00 ACM Computing Surveys, Vol. 31, No. 4, December 1999 1.
Efficient Scheduling of Arbitrary Task Graphs to Multiprocessors using A Parallel Genetic Algorithm
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1997
"... Given a parallel program represented by a task graph, the objective of a scheduling algorithm is to minimize the overall execution time of the program by properly assigning the nodes of the graph to the processors. This multiprocessor scheduling problem is NP-complete even with simplifying assumptio ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
Given a parallel program represented by a task graph, the objective of a scheduling algorithm is to minimize the overall execution time of the program by properly assigning the nodes of the graph to the processors. This multiprocessor scheduling problem is NP-complete even with simplifying assumptions, and becomes more complex under relaxed assumptions such as arbitrary precedence constraints, and arbitrary task execution and communication times. The present literature on this topic is a large repertoire of heuristics that produce good solutions in a reasonable amount of time. These heuristics, however, have restricted applicability in a practical environment because they have a number of fundamental problems including high time complexity, lack of scalability, and no performance guarantee with respect to optimal solutions. Recently, genetic algorithms (GAs) have been widely reckoned as a useful vehicle for obtaining high quality or even optimal solutions for a broad range of combinato...
Partitioning and Scheduling Using Graph Decomposition
- In Twenty-eighth annual ACM symposium on theory of computing
, 1993
"... Automated parallelization of source code is a goal on which many researchers in parallel computing have focused. The increasing availability of parallel computers, the difficulty of creating good parallel programs, and the vast amount of existing serial source code all contribute to the need for aut ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Automated parallelization of source code is a goal on which many researchers in parallel computing have focused. The increasing availability of parallel computers, the difficulty of creating good parallel programs, and the vast amount of existing serial source code all contribute to the need for automated means of parallelization. This paper centers on the issues of partitioning and scheduling within automatic parallelization, or the creation of appropriately-sized tasks and their assignments to processors. An algorithm is introduced which uses the program dependence graph (PDG) representation of serial programs, and relies on a prior graph decomposition, or parse, for identification of parallelism. The algorithm uses local heuristics to determine the cost effectiveness of each opportunity for parallelization, and creates and schedules tasks accordingly. Keywords partitioning, scheduling, automatic parallelization, graph decomposition, program dependence graph * This research is s...
On task scheduling accuracy: Evaluation methodology and results
- Journal of Supercomputing
, 2004
"... Abstract. Many heuristics based on the directed acyclic graph (DAG) have been proposed for the static scheduling problem. Most of these algorithms apply a simple model of the target system that assumes fully connected processors, a dedicated communication sub-system and no contention for the communi ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract. Many heuristics based on the directed acyclic graph (DAG) have been proposed for the static scheduling problem. Most of these algorithms apply a simple model of the target system that assumes fully connected processors, a dedicated communication sub-system and no contention for the communication resources. Only a few algorithms consider the network topology and the contention for the communication resources. This article evaluates the accuracy of task scheduling algorithms and thus the appropriateness of the applied models. An evaluation methodology is proposed and applied to a representative set of scheduling algorithms. The obtained results show a significant inaccuracy of the produced schedules. Analyzing these results is important for the development of more appropriate models and more accurate scheduling algorithms.
Highthroughput bayesian computing machine with reconfigurable hardware
- in FPGA ’10: Proceedings of the 18th annual ACM/SIGDA international
"... We use reconfigurable hardware to construct a high throughput Bayesian computing machine (BCM) capable of evaluating probabilistic networks with arbitrary DAG (directed acyclic graph) topology. Our BCM achieves high throughput by exploiting the FPGA’s distributed memories and abundant hardware struc ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
We use reconfigurable hardware to construct a high throughput Bayesian computing machine (BCM) capable of evaluating probabilistic networks with arbitrary DAG (directed acyclic graph) topology. Our BCM achieves high throughput by exploiting the FPGA’s distributed memories and abundant hardware structures (such as long carry-chains and registers), which enables us to 1) develop an innovative memory allocation scheme based on a maximal matching algorithm that completely avoids memory stalls, 2) optimize and deeply pipeline the logic design of each processing node, and 3) schedule them optimally. The BCM architecture not only can be applied to many important algorithms in artificial intelligence, signal processing, and digital communications, but also has high reusability, i.e., a new application needs not change a BCM’s hardware design, only new task graph processing and code compilation are necessary. Moreover, the throughput of a BCM scales almost linearly with the size of the FPGA on which it is implemented. A Bayesian computing machine with 16 processing nodes was implemented with a Virtex-5 FPGA (XCV5LX155T-2) on a BEE3 (Berkeley Emulation Engine) platform. For a wide variety of sample Bayesian problems, comparing running the same network evaluation algorithm on a 2.4 GHz Core 2 Duo Intel processor and a GeForce 9400m using the CUDA software package, the BCM demonstrates 80x and 15x speedups respectively, with a peak throughput of 20.4
A modular genetic algorithm for scheduling task graphs
, 2003
"... Abstract. Several genetic algorithms have been designed for the problem of scheduling task graphs onto multiprocessors, the primary distinction among most of them being the chromosomal representation used for a schedule. However, these existing approaches are monolithic as they attempt to scan the e ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract. Several genetic algorithms have been designed for the problem of scheduling task graphs onto multiprocessors, the primary distinction among most of them being the chromosomal representation used for a schedule. However, these existing approaches are monolithic as they attempt to scan the entire solution space without consideration to techniques that can reduce the complexity of the optimization. In this paper, a genetic algorithm based in a bi-chromosomal rep-resetnation and capable of being incorporated into a cluster/merging optimization framework is proposed, and it is experimentally shown to outperform a leading genetic algorithm for schedul-ing.
Runtime Data Flow Scheduling of Matrix Computations
, 2009
"... We investigate the scheduling of matrix computations expressed as directed acyclic graphs for shared-memory parallelism. Because of the data granularity in this problem domain, even slight variations in load balance or data locality can greatly affect performance. Well-known scheduling algorithms su ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
We investigate the scheduling of matrix computations expressed as directed acyclic graphs for shared-memory parallelism. Because of the data granularity in this problem domain, even slight variations in load balance or data locality can greatly affect performance. Well-known scheduling algorithms such as work stealing have proven time and space bounds, but these bounds do not provide a discernable indicator of performance between different scheduling algorithms and heuristics. We provide a flexible framework for scheduling matrix computations, which we use to empirically quantify different scheduling algorithms. By building software solutions based on hardware techniques through leveraging a cache coherence protocol, we develop a scheduling algorithm that addresses both load balance and data locality simultaneously and show its performance benefits.
Static Scheduling of instructions on Micronet-based Asynchronous Processors
- In Proc. 2th. Int. Symp. on Advanced Research on Asynchronous Circuits and Systems ASYNC'96
, 1996
"... This paper investigates issues which impinge on the design of static instruction schedulers for micronet-based asynchronous processor (MAP) architectures. The micronet model exposes both temporal and spa-tial concurrency within a processor. A list schedul-ing algorithm is described which has been op ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper investigates issues which impinge on the design of static instruction schedulers for micronet-based asynchronous processor (MAP) architectures. The micronet model exposes both temporal and spa-tial concurrency within a processor. A list schedul-ing algorithm is described which has been optimtsed with MAP-specific heuristics. Their performance on some program graphs are presented and conclusions are drawn on the suitability of MAP as targets for ILP compilers.
Energy-Efficient Embedded Software Implementation on Multiprocessor System-on-Chip with Multiple Voltages
"... This paper develops energy-driven completion ratio guaranteed scheduling techniques for the implementation of embedded software on multiprocessor systems with multiple supply voltages. We leverage application’s performance requirements, uncertainties in execution time, and tolerance for reasonable e ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper develops energy-driven completion ratio guaranteed scheduling techniques for the implementation of embedded software on multiprocessor systems with multiple supply voltages. We leverage application’s performance requirements, uncertainties in execution time, and tolerance for reasonable execution failures to scale each processor’s supply voltage at run-time to reduce the multiprocessor system’s total energy consumption. Specifically, we study how to trade the difference between the system’s highest achievable completion ratio Qmax and the required completion ratio Q0 for energy saving. First, we propose a best-effort energy minimization algorithm (BEEM1) that achieves Qmax with the provably minimum energy consumption. We then relax its unrealistic assumption on the application’s real execution time and develop algorithm BEEM2 that only requires the application’s best- and worst-case execution times. Finally, we propose a hybrid offline on-line completion ratio guaranteed energy minimization algorithm (QGEM) that provides the required Q0 with further energy reduction based on the probabilistic distribution of the application’s execution time. We implement the proposed algorithms and verify their energy efficiency on real-life DSP applications and the TGFF random benchmark suite. BEEM1, BEEM2, and QGEM all provide the required completion ratio with average energy reduction of 28.7, 26.4, and 35.8%, respectively.
A Comparison of Clustering and Scheduling Techniques for Embedded Multiprocessor Systems
, 2003
"... In this paper we extensively explore and illustrate the effectiveness of the two-phase decomposition of scheduling --- into clustering and cluster-scheduling or merging --- and mapping task graphs onto embedded multiprocessor systems. We describe efficient and novel partitioning (clustering) and ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper we extensively explore and illustrate the effectiveness of the two-phase decomposition of scheduling --- into clustering and cluster-scheduling or merging --- and mapping task graphs onto embedded multiprocessor systems. We describe efficient and novel partitioning (clustering) and scheduling techniques that aggressively streamline interprocessor communication and can be tuned to exploit the significantly longer compilation time that is available to embedded system designers.

