Results 1 - 10
of
14
A compiler controlled Threaded Abstract Machine
- Journal of Parallel and Distributed Computing
, 1993
"... ..."
Comparison of Two Storage Models in Data-Driven Multithreaded Architectures
- In Eighth IEEE Symposium on Parallel and Distributed Processing (SPDP
, 1996
"... Multithreaded execution models attempt to combine some aspects of dataflow-like execution with von Neumann model execution, with the objective of masking the latency of inter-processor communications and remote memory accesses in multiprocessors. An important issue in the analysis and evaluation of ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
Multithreaded execution models attempt to combine some aspects of dataflow-like execution with von Neumann model execution, with the objective of masking the latency of inter-processor communications and remote memory accesses in multiprocessors. An important issue in the analysis and evaluation of multithreaded execution is the design and performance of the storage hierarchy. Because of the sequential execution of threads, the locality of access within an executing thread can be exploited using registers and cache. At the inter-thread level, however, the locality of accesses to memory and its effect on the cache is not yet well understood. Two storage hierarchy models, that attempt to capture and exploit this locality, are described and evaluated in this paper. 1 Introduction The foremost benefits of multithreading are the increased processor utilization realized by dynamically switching among ready threads at run-time and thereby its ability to mask memory and communication latenci...
Code Generations, Evaluations, and Optimizations in Multithreaded Executions
, 1995
"... OF DISSERTATION CODE GENERATIONS, EVALUATIONS, AND OPTIMIZATIONS IN MULTITHREADED EXECUTIONS Efficient large-scale parallel processing can result only from proper handling of latency. Latency arises either from remote memory accesses or synchronizations. Multithreading is an execution model that can ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
OF DISSERTATION CODE GENERATIONS, EVALUATIONS, AND OPTIMIZATIONS IN MULTITHREADED EXECUTIONS Efficient large-scale parallel processing can result only from proper handling of latency. Latency arises either from remote memory accesses or synchronizations. Multithreading is an execution model that can effectively deal with latency by switching among a set of ready threads. This model has been proposed in a variety of forms: a unit of storage can be based on either a collection of threads or a single thread, threads can be either blocking or non-blocking, and synchronization can be either implicit or explicit. This dissertation describes research in the evaluation and optimization of various issues in multithreading. Issues of particular interest are the development of a multithreaded execution model to be used as a test-bed and a hybrid code generation scheme where threads are generated in a top-down manner and then optimized in a bottom-up fashion. Various forms of locality are also ide...
A Performance Analysis of Transposition-Table-Driven Scheduling in Distributed Search
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 2002
"... This paper discusses a new work-scheduling algorithm for parallel search of single-agent state spaces, called Transposition-Table-Driven Work Scheduling, that places the transposition table at the heart of the parallel work scheduling. The scheme results in less synchronization overhead, less proce ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
This paper discusses a new work-scheduling algorithm for parallel search of single-agent state spaces, called Transposition-Table-Driven Work Scheduling, that places the transposition table at the heart of the parallel work scheduling. The scheme results in less synchronization overhead, less processor idle time, and less redundant search effort. Measurements on a 128-processor parallel machine show that the scheme achieves close-to-linear speedups; for large problems the speedups are even superlinear due to better memory usage. On the same machine, the algorithm is 1.6 to 12.9 times faster than traditional work-stealing-based schemes.
Transposition Table Driven Work Scheduling in Distributed Search
- IN 16TH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI'99
, 1999
"... This paper introduces a new scheduling algorithm for parallel single-agent search, transposition table driven work scheduling, that places the transposition table at the heart of the parallel work scheduling. The scheme results in less synchronization overhead, less processor idle time, and less ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
This paper introduces a new scheduling algorithm for parallel single-agent search, transposition table driven work scheduling, that places the transposition table at the heart of the parallel work scheduling. The scheme results in less synchronization overhead, less processor idle time, and less redundant search effort. Measurements on a 128-processor parallel machine show that the scheme achieves nearly-optimal performance and scales well. The algorithm performs a factor of 2.0 to 13.7 times better than traditional work-stealing-based schemes.
Lazy Threads: Compiler and Runtime Structures for Fine-Grained Parallel Programming
, 1997
"... Many modern parallel languages support dynamic creation of threads or require multithreading in their implementations. The threads describe the logical parallelism in the program. For ease of expression and better resource utilization, the logical parallelism in a program often exceeds the physical ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Many modern parallel languages support dynamic creation of threads or require multithreading in their implementations. The threads describe the logical parallelism in the program. For ease of expression and better resource utilization, the logical parallelism in a program often exceeds the physical parallelism of the machine and leads to applications with many fine-grained threads. In practice, however, most logical threads need not be independent threads. Instead, they could be run as sequential calls, which are inherently cheaper than independent threads. The challenge is that one cannot generally predict which logical threads can be implemented as sequential calls. In lazy multithreading systems each logical thread begins execution sequentially (with the attendant effic...
Hardware-modulated parallelism in chip multiprocessors
- SIGARCH Comput. Archit. News
, 2005
"... Chip multi-processors (CMPs) already have widespread commercial availability, and technology roadmaps project enough on-chip transistors to replicate tens or hundreds of current processor cores. How will we express parallelism, partition applications, and schedule/place/migrate threads on these high ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Chip multi-processors (CMPs) already have widespread commercial availability, and technology roadmaps project enough on-chip transistors to replicate tens or hundreds of current processor cores. How will we express parallelism, partition applications, and schedule/place/migrate threads on these highlyparallel CMPs? This paper presents and evaluates a new approach to highlyparallel CMPs, advocating a new hardware-software contract. The software layer is encouraged to expose large amounts of multi-granular, heterogeneous parallelism. The hardware, meanwhile, is designed to offer low-overhead, low-area support for orchestrating and modulating this parallelism on CMPs at runtime. Specifically, our proposed CMP architecture consists of architectural and ISA support targeting thread creation, scheduling
Analysis of Communications and Overhead Reduction in Multithreaded Execution
- In Proc. Int. Conf. on Parallel Architectures and Compilation Techniques
, 1995
"... In a multithreaded execution, each thread can be thought of as running on its own virtual processor, with several virtual processors multiplexed onto a single physical processor. At any given time, some of these virtual processors are either sending or waiting for messages. When the degree of multit ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
In a multithreaded execution, each thread can be thought of as running on its own virtual processor, with several virtual processors multiplexed onto a single physical processor. At any given time, some of these virtual processors are either sending or waiting for messages. When the degree of multithreading is high, there is a high potential load on the interconnection network. It is important to understand the aggregate behavior of these messages for the design of the memory hierarchy, network structure, and understand more concretely the behavior of multithreaded execution. In this paper, we study several issues related to communication patterns in a non-blocking, framelet-based multithreaded model. These issues include the sources of message generations and the locality of these messages. The results indicate that roughly a third of all tokens are memory-related, another third are involved in parallelism management, and the final third are involved in intrafunction and intra-loop ...
An Evaluation of Medium-Grain Dataflow Code
- Int. J. of Parallel Programming
, 1994
"... In this paper, we study several issues related to the medium grain dataflow model of execution. We present bottom-up compilation of medium grain clusters from a fine grain dataflow graph. We compare the basic block and the dependence sets algorithms that partition dataflow graphs into clusters. For ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper, we study several issues related to the medium grain dataflow model of execution. We present bottom-up compilation of medium grain clusters from a fine grain dataflow graph. We compare the basic block and the dependence sets algorithms that partition dataflow graphs into clusters. For an extensive set of benchmarks we assess the average number of instructions in a cluster and the reduction in matching operations compared with fine grain dataflow execution. We study the performance of medium grain dataflow when several architectural parameters, such as the number of processors, matching cost, and network latency, are varied. The results indicate that medium grain execution offers a good speedup over the fine grain model, that it is scalable, and tolerates network latency and high matching costs well. Medium grain execution can benefit from a higher output bandwidth of a processor and finally, a simple superscalar processor with an issue rate of two is sufficient to exploit...

