Results 11 -
17 of
17
Accurately approximating superscalar processor performance from traces
- Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS
, 2009
"... Trace-driven simulation of superscalar processors is particularly complicated. The dynamic nature of superscalar processors combined with the static nature of traces can lead to large inaccuracies in the results, especially when traces contain only a subset of executed instructions for trace reducti ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Trace-driven simulation of superscalar processors is particularly complicated. The dynamic nature of superscalar processors combined with the static nature of traces can lead to large inaccuracies in the results, especially when traces contain only a subset of executed instructions for trace reduction. The main problem in the filtered trace simulation is that the trace does not contain enough information with which one can predict the actual penalty of a cache miss. In this paper, we discuss and evaluate three strategies to quantify the impact of a long latency memory access in a superscalar processor when traces have only L1 cache misses. The strategies are based on models about how a cache miss is treated with respect to other cache misses: (1) isolated cache miss model, (2) independent cache miss model, and (3) pairwise dependent cache miss model. Our experimental results demonstrate that the pairwise dependent cache miss model produces reasonably accurate results (4.8 % RMS error) under perfect branch prediction. Our work forms a basis for fast, accurate, and configurable multicore processor simulation using a pre-determined processor core design. 1.
Memory Hierarchy Performance Prediction for Blocked Sparse Algorithms
- Parallel Processing Letters
, 1999
"... Nowadays the performance gap between processors and main memory makes an efficient usage of the memory hierarchy necessary for good program performance. Several techniques have been proposed for this purpose. Nevertheless most of them consider only regular access patterns, while many scientific and ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Nowadays the performance gap between processors and main memory makes an efficient usage of the memory hierarchy necessary for good program performance. Several techniques have been proposed for this purpose. Nevertheless most of them consider only regular access patterns, while many scientific and numerical applications give place to irregular patterns. A typical case is that of indirect accesses due to the use of compressed storage formats for sparse matrices. This paper describes an analytic approach to model both regular and irregular access patterns. The application modeled is an optimized sparse matrix-dense matrix product algorithm with several levels of blocking. Our model can be directly applied to any memory hierarchy consisting of K- way associative caches. Results are shown for several current microprocessor architectures. Keywords: Sparse matrix, irregular computation, cache performance, memory hierarchy, probabilistic analytical model. 1 Introduction Despit...
Fine-grained application source code profiling for ASIP design
- In DAC ’05: Proceedings of the 42nd Annual Conference on Design Automation
, 2005
"... Current Application Specific Instruction set Processor (ASIP) design methodologies are mostly based on iterative architecture exploration that uses Architecture Description Languages (ADLs) and retargetable software development tools. However, for improved design efficiency, additional pre-architect ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Current Application Specific Instruction set Processor (ASIP) design methodologies are mostly based on iterative architecture exploration that uses Architecture Description Languages (ADLs) and retargetable software development tools. However, for improved design efficiency, additional pre-architecture exploration tools are required to help narrow-down the huge design space and making coarsegrained Instruction Set Architecture (ISA) decisions before detailed ADL modeling. Extensive application code profiling is the key in such early design stages. Based on a novel code instrumentation technology, we present a microprofiling approach that fills the current gap between sourcelevel and instruction-level profilers and combines their advantages w.r.t. speed and accuracy. We show how the microprofiler is embedded into an advanced ASIP design flow and justify its use in a case study to design an MP3 decoder ASIP. 1.
Communication architecture simulation on the virtual synchronization framework
- in Proc. Int. Workshop Syst., Archit., Model. Simul
, 2007
"... Abstract. As multi-processor system-on-chip (MPSoC) has become an effective solution to ever-increasing design complexity of modern embedded systems, fast and accurate HW/SW cosimulation of such system becomes more important to explore wide design space of communication architecture. Recently we hav ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. As multi-processor system-on-chip (MPSoC) has become an effective solution to ever-increasing design complexity of modern embedded systems, fast and accurate HW/SW cosimulation of such system becomes more important to explore wide design space of communication architecture. Recently we have proposed the trace-driven virtual synchronization technique to boost the cosimulation speed while accuracy is almost preserved, where simulation of communication architectures is separated from simulation of the processing components. This paper proposes two methods of simulation modeling of communication architectures in the trace-driven virtual synchronization framework: SystemC modeling and C modeling. SystemC modeling gives better extensibility and accuracy but lower performance than C modeling as confirmed by experimental results. Fast reconfiguration of communication architecture is available in both methods to enable efficient design space exploration. 1
In-N-Out: Reproducing Out-of-Order Superscalar Processor Behavior from Reduced In-Order Traces
"... Abstract—Trace-driven simulation is a widely practiced simulation method. However, its use has been typically limited to modeling of in-order processors because of accuracy issues. In this work, we propose and explore In-N-Out, a fast approximate simulation method to reproduce the behavior of an out ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—Trace-driven simulation is a widely practiced simulation method. However, its use has been typically limited to modeling of in-order processors because of accuracy issues. In this work, we propose and explore In-N-Out, a fast approximate simulation method to reproduce the behavior of an out-oforder superscalar processor with a reduced in-order trace. During trace generation, we use a functional cache simulator to capture interesting processor events such as uncore accesses in the program order. We also collect key information about the executed program. The prepared in-order trace then drives a novel simulation algorithm that models an out-of-order processor. Our experimental results demonstrate that In-N-Out produces reasonably accurate absolute performance (7 % difference on average) and fast simulation speeds (115 × on average), compared with detailed execution-driven simulation. Moreover, In-N-Out was shown to preserve a processor’s dynamic uncore access patterns and predict the relative performance change when the processor’s core- or uncore-level parameters are changed. Keywords—Superscalar out-of-order processor, performance modeling, trace-driven simulation. I.
Northeastern University,
"... Abstract — We use a novel virtualization-based approach for computer architecture performance analysis. We present a case study analyzing a hypothetical hybrid main memory, which consists of a first-level DRAM augmented by a 10-100x slower second-level memory. This architecture is motivated by the r ..."
Abstract
- Add to MetaCart
Abstract — We use a novel virtualization-based approach for computer architecture performance analysis. We present a case study analyzing a hypothetical hybrid main memory, which consists of a first-level DRAM augmented by a 10-100x slower second-level memory. This architecture is motivated by the recent emergence of lower-cost, higher-density, and lower-power alternative memory technologies. To model such a system, we customize a virtual machine monitor (VMM) with delay-simulation and instrumentation code. Benchmarks representing server, technical computing, and desktop productivity workloads are evaluated in virtual machines (VMs). Relative to baseline all-DRAM systems, these workloads experience widely varying performance degradation when run on hybrid main memory systems which have significant amounts of second-level memory. I.
Scalable Multi-Cache Simulation Using GPUs
"... Abstract—Software simulation is the primary tool used for evaluation of processor design. Simulation offers better accuracy than analytical models and is an important evaluation step before actually fabricating a chip. Unfortunately, simulator speeds are slow—a conventional cycle-accurate simulator ..."
Abstract
- Add to MetaCart
Abstract—Software simulation is the primary tool used for evaluation of processor design. Simulation offers better accuracy than analytical models and is an important evaluation step before actually fabricating a chip. Unfortunately, simulator speeds are slow—a conventional cycle-accurate simulator will be unable to keep up with increasing core counts in modern processor design. Parallel simulation is one method for improving simulation speeds. Two major areas of parallel simulation research are multithreaded simulators and FPGAs as simulation accelerators. Multithreaded simulators can only extract coarse-grained parallelism and must sacrifice accuracy in order to scale well. FPGA-based simulators can extract fine-grained parallelism, but are expensive and difficult to program. We propose using GPUs for architectural simulation, which can take advantage of a high degree of fine-grained parallelism. In addition, they are inexpensive and easier to program compared to FPGAs. To demonstrate our ideas, we implement a tracedriven many-cache simulator using NVIDIA's CUDA toolkit. GPU-accelerated cache simulation displays remarkable scaling with number of simulated caches when compared to serial CPUonly simulation.

