Results 1  10
of
42
An analytical model for a GPU architecture with memorylevel and threadlevel parallelism awareness
, 2009
"... GPU architectures are increasingly important in the multicore era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architect ..."
Abstract

Cited by 134 (5 self)
 Add to MetaCart
(Show Context)
GPU architectures are increasingly important in the multicore era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications. To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on microbenchmarks is 5.4 % and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.
Sniper: Exploring the level of abstraction for scalable and accurate parallel multicore simulations
 in International Conference for High Performance Computing, Networking, Storage and Analysis (SC
, 2011
"... Two major trends in highperformance computing, namely, larger numbers of cores and the growing size of onchip cache memory, are creating significant challenges for evaluating the design space of future processor architectures. Fast and scalable simulations are therefore needed to allow for suffici ..."
Abstract

Cited by 42 (9 self)
 Add to MetaCart
(Show Context)
Two major trends in highperformance computing, namely, larger numbers of cores and the growing size of onchip cache memory, are creating significant challenges for evaluating the design space of future processor architectures. Fast and scalable simulations are therefore needed to allow for sufficient exploration of large multicore systems within a limited simulation time budget. By bringing together accurate highabstraction analytical models with fast parallel simulation, architects can trade off accuracy with simulation speed to allow for longer application runs, covering a larger portion of the hardware design space. Interval simulation provides this balance between detailed cycleaccurate simulation and oneIPC simulation, allowing longrunning simulations to be modeled much faster than with detailed cycleaccurate simulation, while still providing the detail necessary to observe coreuncore interactions across the entire system. Validations against real hardware show average absolute errors within 25 % for a variety of multithreaded workloads; more than twice as accurate on average as oneIPC simulation. Further, we demonstrate scalable simulation speed of up to 2.0 MIPS when simulating a 16core system on an 8core SMP machine.
Control flow modeling in statistical simulation for accurate and efficient processor design studies
 In ISCA
, 2004
"... Designing a new microprocessor is extremely timeconsuming. One of the contributing reasons is that computer designers rely heavily on detailed architectural simulations, which are very timeconsuming. Recent work has focused on statistical simulation to address this issue. The basic idea of statisti ..."
Abstract

Cited by 40 (15 self)
 Add to MetaCart
(Show Context)
Designing a new microprocessor is extremely timeconsuming. One of the contributing reasons is that computer designers rely heavily on detailed architectural simulations, which are very timeconsuming. Recent work has focused on statistical simulation to address this issue. The basic idea of statistical simulation is to measure characteristics during program execution, generate a synthetic trace with those characteristics and then simulate the synthetic trace. The statistically generated synthetic trace is orders of magnitude smaller than the original program sequence and hence results in significantly faster simulation. This paper makes the following contributions to the statistical simulation methodology. First, we propose the use of a statistical flow graph to characterize the control flow of a program execution. Second, we model delayed update of branch predictors while profiling program execution characteristics. Experimental results show that statistical simulation using this improved control flow modeling attains significantly better accuracy than the previously proposed HLS system. We evaluate both the absolute and the relative accuracy of our approach for power/performance modeling of superscalar microarchitectures. The results show that our statistical simulation framework can be used to efficiently explore processor design spaces. 1.
Rapid development of a flexible validated processor model
 In Proceedings of the 2005 Workshop on Modeling, Benchmarking, and Simulation
, 2005
"... For a variety of reasons, most architectural evaluations use simulation models. An accurate baseline model validated against existing hardware provides confidence in the results of these evaluations. Meanwhile, a meaningful exploration of the design space requires a wide range of quicklyobtainable ..."
Abstract

Cited by 24 (13 self)
 Add to MetaCart
(Show Context)
For a variety of reasons, most architectural evaluations use simulation models. An accurate baseline model validated against existing hardware provides confidence in the results of these evaluations. Meanwhile, a meaningful exploration of the design space requires a wide range of quicklyobtainable variations of the baseline. Unfortunately, these two goals are generally considered to be at odds; the set of validated models is considered exclusive of the set of easily malleable models. Vachharajani et al. challenge this belief and propose a modeling methodology they claim allows rapid construction of flexible validated models. Unfortunately, they only present anecdotal and secondary evidence to support their claims. In this paper, we present our experience using this methodology to construct a validated flexible model of Intel’s Itanium 2 processor. Our practical experience lends support to the above claims. Our initial model was constructed by a single researcher in only 11 weeks and predicts processor cyclesperinstruction (CPI) to within 7.9 % on average for the entire SPEC CINT2000 benchmark suite. Our experience with this model showed us that aggregate accuracy for a metric like CPI is not sufficient. Aggregate measures like CPI may conceal remaining internal “offsetting errors ” which can adversely affect conclusions drawn from the model. Using this as our motivation, we explore the flexibility of the model by modifying it to target specific error constituents, such as frontend stall errors. In 2 1 2 personweeks, average CPI error was reduced to 5.4%. The targeted error constituents were reduced more dramatically; frontend stall errors were reduced from 5.6 % to 1.6%. The swift implementation of significant new architectural features on this model further demonstrated its flexibility. 1
An analytical model of the workingset sizes in decisionsupport systems
 In SIGMETRICS
, 2000
"... This paper presents an analytical model to study how working sets scale with database size and other applications parameters in decisionsupport systems (DSS). The model uses application parameters, that are measured on downscaled database executions, to predict cache miss ratios for executions of ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
(Show Context)
This paper presents an analytical model to study how working sets scale with database size and other applications parameters in decisionsupport systems (DSS). The model uses application parameters, that are measured on downscaled database executions, to predict cache miss ratios for executions of large databases. By applying the model to two database engines and typical DSS queries we find that, even for large databases, the most performancecritical working set is small and is caused by the instructions and private data that are required to access a single tuple. Consequently, its size is not affected by the database size. Surprisingly, database data may also exhibit temporal locality but the size of its working set critically depends on the structure of the query, the method of scanning, and the size and the content of the database. 1.
Interval simulation: Raising the level of abstraction in architectural simulation
 In High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on
, 2010
"... Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multicore processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by map ..."
Abstract

Cited by 14 (4 self)
 Add to MetaCart
(Show Context)
Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multicore processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by mapping the simulation models on FPGAs; these approaches achieve substantial simulation speedups while simulating performance in a cycleaccurate manner. This paper proposes interval simulation which takes a completely different approach: interval simulation raises the level of abstraction and replaces the corelevel cycleaccurate simulation model by a mechanistic analytical model. The analytical model estimates corelevel performance by analyzing intervals, or the timing between two miss events (branch mispredictions and TLB/cache misses); the miss events are determined through simulation of the memory hierarchy, cache coherence protocol, interconnection network and branch predictor. By raising the level of abstraction, interval simulation reduces both development time and evaluation time. Our experimental results using the SPEC CPU2000 and PARSEC benchmark suites and the M5 multicore simulator, show good accuracy up to eight cores (average error of 4.6 % and max error of 11 % for the multithreaded fullsystem workloads), while achieving a one order of magnitude simulation speedup compared to cycleaccurate simulation. Moreover, interval simulation is easy to implement: our implementation of the mechanistic analytical model incurs only one thousand lines of code. Its high accuracy, fast simulation speed and easeofuse make interval simulation a useful complement to the architect’s toolbox for exploring systemlevel and highlevel microarchitecture tradeoffs. 1
Microarchitecture Modeling for DesignSpace Exploration DesignSpace Exploration
, 2004
"... To identify the best processor designs, designers explore a vast design space. To assess the quality of candidate designs, designers construct and use simulators. Unfortunately, simulator construction is a bottleneck in this designspace exploration because existing simulator construction methodolog ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
To identify the best processor designs, designers explore a vast design space. To assess the quality of candidate designs, designers construct and use simulators. Unfortunately, simulator construction is a bottleneck in this designspace exploration because existing simulator construction methodologies lead to long simulator development times. This bottleneck limits exploration to a small set of designs, potentially diminishing quality of the final design.
Performance Prediction for Random Write Reductions: A Case Study in Modeling Shared Memory Programs
 ACM SIGMETRICS intern.conf. on Measur. and Model. of computer systems, p117 – 128, 2002
, 2002
"... In this paper, we revisit the problem of performance prediction on shared memory parallel machines, motivated by the need for selecting parallelization strategy for random write reductions. Such reductions frequently arise in data mining algorithms. ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
In this paper, we revisit the problem of performance prediction on shared memory parallel machines, motivated by the need for selecting parallelization strategy for random write reductions. Such reductions frequently arise in data mining algorithms.
AMVA Techniques for High Service Time Variability
 IN PROC. ACM SIGMETRICS
, 2000
"... Motivated by experience gained during the validation of a recent Approximate Mean Value Analysis (AMVA) model of modern shared memory architectures, this paper reexamines the "standard" AMVA approximation for nonexponential FCFS queues. We find that this approximation is often inaccurat ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Motivated by experience gained during the validation of a recent Approximate Mean Value Analysis (AMVA) model of modern shared memory architectures, this paper reexamines the "standard" AMVA approximation for nonexponential FCFS queues. We find that this approximation is often inaccurate for FCFS queues with high service time variability. For such queues, we propose and evaluate: (1) AMVA estimates of the mean residual service time at an arrival instant that are much more accurate than the standard AMVA estimate, (2) a new AMVA technique that provides a much more accurate estimate of mean center residence time than the standard AMVA estimate, and (3) a new AMVA technique for computing the mean residence time at a "downstream" queue which has a more bursty arrival process than is assumed in the standard AMVA equations. Together, these new techniques increase the range of applications to which AMVA may be fruitfully applied, so that for example, the memory system architecture of shared memory systems with complex modern processors can be analyzed with these computationally efficient methods.
Beyond Amdahl’s law: An objective function that links multiprocessor performance gains to delay and energy
 42nd Hawaii International Conference
, 2009
"... Abstract—Beginning with Amdahl’s law, we derive a general objective function that links parallel processing performance gains at the system level, to energy and delay in the subsystem microarchitecture structures. The objective function employs parameterized models of computation and communication ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Beginning with Amdahl’s law, we derive a general objective function that links parallel processing performance gains at the system level, to energy and delay in the subsystem microarchitecture structures. The objective function employs parameterized models of computation and communication to represent the characteristics of processors, memories, and communications networks. The interaction of the latter microarchitectural elements defines global system performance in terms of energydelay cost. Following the derivation, we demonstrate its utility by applying it to the problem of Chip MultiProcessor (CMP) architecture exploration. Given a set of application and architectural parameters, we solve for the optimal CMP architecture for six different architectural optimization examples. We find the parameters that minimize the total system cost, defined by the objective function under the area constraint of a single die. The analytical formulation presented in this paper is general and offers the foundation for the quantitative and rapid evaluation of computer architectures under different constraints including that of single die area. 1