Results 1 
7 of
7
A Memory Access Model for Highlythreaded Manycore Architectures
 SUBMITTED AND ACCEPTED BY ICPADS'2012
, 2012
"... Manycore architectures are excellent in hiding memoryaccess latency by lowoverhead context switching among a large number of threads. The speedup of algorithms carried out on these machines depends on how well the latency is hidden. If the number of threads were infinite, then theoretically thes ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
Manycore architectures are excellent in hiding memoryaccess latency by lowoverhead context switching among a large number of threads. The speedup of algorithms carried out on these machines depends on how well the latency is hidden. If the number of threads were infinite, then theoretically these machines should provide the performance predicted by the PRAM analysis of the programs. However, the number of allowable threads per processor is not infinite. In this paper, we introduce the Threaded Manycore Memory (TMM) model which is meant to capture the important characteristics of these highlythreaded, manycore machines. Since we model some important machine parameters of these machines, we expect analysis under this model to give more finegrained performance prediction than the PRAM analysis. We analyze 4 algorithms for the classic all pairs shortest paths problem under this model. We find that even when two algorithms have the same PRAM performance, our model predicts different performance for some settings of machine parameters. For example, for dense graphs, the FloydWarshall algorithm and Johnson’s algorithms have the same performance in the PRAM model. However, our model predicts different performance for large enough memoryaccess latency and validates the intuition that the FloydWarshall algorithm performs better on these machines.
A manycore machine model for designing algorithms with minimum parallelism overheads
 CoRR
"... We present a model of multithreaded computation with an emphasis on estimating parallelism overheads of programs written for modern manycore architectures. We establish a GrahamBrent theorem for this model so as to estimate execution time of programs running on a given number of streaming multipr ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
(Show Context)
We present a model of multithreaded computation with an emphasis on estimating parallelism overheads of programs written for modern manycore architectures. We establish a GrahamBrent theorem for this model so as to estimate execution time of programs running on a given number of streaming multiprocessors. We evaluate the benefits of our model with fundamental algorithms from scientific computing. For two case studies, our model is used to minimize parallelism overheads by determining an appropriate value range for a given program parameter. For the others, our model is used to compare different algorithms solving the same problem. In each case, the studied algorithms were implemented and the results of their experimental comparison are coherent with the theoretical analysis based on our model. 1
Analysis of Classic Algorithms on GPUs
"... Abstract—The recently developed Threaded Manycore Memory (TMM) model provides a framework for analyzing algorithms for highlythreaded manycore machines such as GPUs. In particular, it tries to capture the fact that these machines hide memory latencies via the use of a large number of threads and ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract—The recently developed Threaded Manycore Memory (TMM) model provides a framework for analyzing algorithms for highlythreaded manycore machines such as GPUs. In particular, it tries to capture the fact that these machines hide memory latencies via the use of a large number of threads and large memory bandwidth. The TMM model analysis contains two components: computational complexity and memory complexity. A model is only useful if it can explain and predict empirical data. In this work, we investigate the effectiveness of the TMM model. We analyze algorithms for 5 classic problems — suffix tree/array for string matching, fast Fourier transform, merge sort, list ranking, and allpairs shortest paths — under this model, and compare the results of the analysis with the experimental findings of ours and other researchers who have implemented and measured the performance of these algorithms on an spectrum of diverse GPUs. We find that the TMM model is able to predict important and sometimes previously unexplained trends and artifacts in the experimental data. Keywords—Threaded Manycore Memory (TMM) Model I.
Hardware Acceleration Technologies in Computer Algebra: Challenges and Impact (Thesis format: Monograph)
"... The objective of high performance computing (HPC) is to ensure that the computational power of hardware resources is well utilized to solve a problem. Various techniques are usually employed to achieve this goal. Improvement of algorithm to reduce the number of arithmetic operations, modifications ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
The objective of high performance computing (HPC) is to ensure that the computational power of hardware resources is well utilized to solve a problem. Various techniques are usually employed to achieve this goal. Improvement of algorithm to reduce the number of arithmetic operations, modifications in accessing data or rearrangement of data in order to reduce memory traffic, code optimization at all levels, designing parallel algorithms with smaller span or reduced overhead are some of the attractive areas that HPC researchers are working on. In this thesis, we investigate HPC techniques for the implementation of basic routines in computer algebra targeting hardware acceleration technologies. We start with a sorting algorithm and its application to sparse matrixvector multiplication for which we focus on work on cache complexity issues. Since basic routines in computer algebra often provide a lot of fine grain parallelism, we then turn our attention to manycore architectures on which we consider dense polynomial and matrix operations ranging from plain to fast arithmetic. Most of these operations are combined within a bivariate system solver running entirely on a graphics processing unit (GPU).
A ManyCore Machine Model for Designing Algorithms with Minimum Parallelism Overheads
, 2015
"... Abstract. We present a model of multithreaded computation with an emphasis on estimating parallelism overheads of programs written for modern manycore architectures. We establish a GrahamBrent theorem so as to estimate execution time of programs running on a given number of streaming multiprocess ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. We present a model of multithreaded computation with an emphasis on estimating parallelism overheads of programs written for modern manycore architectures. We establish a GrahamBrent theorem so as to estimate execution time of programs running on a given number of streaming multiprocessors. We evaluate the benefits of our model with fundamental algorithms from scientific computing. For two case studies, our model is used to minimize parallelism overheads by determining an appropriate value range for a given program parameter. For the others, our model is used to compare different algorithms solving the same problem. In each case, the studied algorithms were implemented and the results of their experimental comparison are coherent with the theoretical analysis based on our model.
Graphics Processing Unit Bloom Filters: Classical and Probabilistic!
, 2014
"... This Thesis is brought to you for free and open access by the Graduate School at Trace: Tennessee Research and Creative Exchange. It has been ..."
Abstract
 Add to MetaCart
(Show Context)
This Thesis is brought to you for free and open access by the Graduate School at Trace: Tennessee Research and Creative Exchange. It has been