Results 1 - 10
of
12
Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulations
- in International Conference for High Performance Computing, Networking, Storage and Analysis (SC
, 2011
"... Two major trends in high-performance computing, namely, larger numbers of cores and the growing size of on-chip cache memory, are creating significant challenges for evaluating the design space of future processor architectures. Fast and scalable simulations are therefore needed to allow for suffici ..."
Abstract
-
Cited by 42 (9 self)
- Add to MetaCart
(Show Context)
Two major trends in high-performance computing, namely, larger numbers of cores and the growing size of on-chip cache memory, are creating significant challenges for evaluating the design space of future processor architectures. Fast and scalable simulations are therefore needed to allow for sufficient exploration of large multi-core systems within a limited simulation time budget. By bringing together accurate highabstraction analytical models with fast parallel simulation, architects can trade off accuracy with simulation speed to allow for longer application runs, covering a larger portion of the hardware design space. Interval simulation provides this balance between detailed cycle-accurate simulation and one-IPC simulation, allowing long-running simulations to be modeled much faster than with detailed cycle-accurate simulation, while still providing the detail necessary to observe core-uncore interactions across the entire system. Validations against real hardware show average absolute errors within 25 % for a variety of multi-threaded workloads; more than twice as accurate on average as one-IPC simulation. Further, we demonstrate scalable simulation speed of up to 2.0 MIPS when simulating a 16-core system on an 8-core SMP machine.
An evaluation of high-level mechanistic core models
- ACM Trans. Archit. Code Optim
, 2014
"... Large core counts and complex cache hierarchies are increasing the burden placed on commonly used simulation and modeling techniques. Although analytical models provide fast results, they do not apply to complex, many-core shared-memory systems. In contrast, detailed cycle-level simulation can be ac ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Large core counts and complex cache hierarchies are increasing the burden placed on commonly used simulation and modeling techniques. Although analytical models provide fast results, they do not apply to complex, many-core shared-memory systems. In contrast, detailed cycle-level simulation can be accurate but also tends to be slow, which limits the number of configurations that can be evaluated. A middle ground is needed that provides for fast simulation of complex many-core processors while still providing accurate results. In this article, we explore, analyze, and compare the accuracy and simulation speed of high-abstraction core models as a potential solution to slow cycle-level simulation. We describe a number of enhancements to interval simulation to improve its accuracy while maintaining simulation speed. In addition, we introduce the instruction-window centric (IW-centric) core model, a new mechanistic core model that bridges the gap between interval simulation and cycle-accurate simulation by enabling high-speed simulations with higher levels of detail. We also show that using accurate core models like these are important for memory subsystem studies, and that simple, naivemodels, like a one-IPC coremodel, can lead tomisleading and incorrect results and conclusions in practical design studies. Validation against real hardware shows good accuracy, with an average single-core error of 11.1 % and a maximum of 18.8 % for the IW-centric model with a 1.5 × slowdown
Fast and Cycle-Accurate Modeling of a Multicore Processor
"... Abstract—An ideal simulator allows an architect to swiftly explore design alternatives and accurately determine their impact on performance. Design exploration requires simulators to be easily modifiable, and accurate performance estimates require detailed models. Unfortunately, detailed modeling no ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Abstract—An ideal simulator allows an architect to swiftly explore design alternatives and accurately determine their impact on performance. Design exploration requires simulators to be easily modifiable, and accurate performance estimates require detailed models. Unfortunately, detailed modeling not only impacts the ease with which a simulator can be modified, but also the speed at which it can be executed, resulting in fidelity being traded for simulation speed. Although FPGA-based simulators have dramatically higher speed than software simulators, sacrificing fidelity is still common. In this paper we present Arete, an FPGA-based processor simulator, which offers high performance along with accuracy and modifiability. We begin with a cycle-level specification of a multicore architecture which includes realistic in-order cores and detailed models of shared, coherent memory and on-chip network. We then describe how this specification is implemented faithfully and efficiently on FPGAs. Arete delivers a performance of up to 11 MIPS per core. We run a subset of the PARSEC benchmark suite on top of off-the-shelf SMP Linux, and achieve an average performance of 55 MIPS for an 8-core model. We also describe two significant architectural explorations: one involving three different branch predictors and the other requiring major modifications to the cache-coherence protocol. I.
Heracles: A Tool for Fast RTL-Based Design Space Exploration of Multicore Processors,”
- in Proceedings of the 21st ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA),
, 2013
"... ABSTRACT This paper presents Heracles, an open-source, functional, parameterized, synthesizable multicore system toolkit. Such a multi/many-core design platform is a powerful and versatile research and teaching tool for architectural exploration and hardware-software co-design. The Heracles toolkit ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
ABSTRACT This paper presents Heracles, an open-source, functional, parameterized, synthesizable multicore system toolkit. Such a multi/many-core design platform is a powerful and versatile research and teaching tool for architectural exploration and hardware-software co-design. The Heracles toolkit comprises the soft hardware (HDL) modules, application compiler, and graphical user interface. It is designed with a high degree of modularity to support fast exploration of future multicore processors of different topologies, routing schemes, processing elements (cores), and memory system organizations. It is a component-based framework with parameterized interfaces and strong emphasis on module reusability. The compiler toolchain is used to map C or C++ based applications onto the processing units. The GUI allows the user to quickly configure and launch a system instance for easy factorial development and evaluation. Hardware modules are implemented in synthesizable Verilog and are FPGA platform independent. The Heracles tool is freely available under the open-source MIT license at: http://projects.csail.mit.edu/heracles.
Fast Scalable FPGA-Based Network-on-Chip Simulation Models
"... Abstract—This paper presents a set of two FPGA-based Network-on-Chip (NoC) simulation engines that composed the winning design of the 2011 MEMOCODE Design Contest in the absolute performance class. Both simulation engines were developed in Bluespec System Verilog (BSV) and were implemented on a Xili ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract—This paper presents a set of two FPGA-based Network-on-Chip (NoC) simulation engines that composed the winning design of the 2011 MEMOCODE Design Contest in the absolute performance class. Both simulation engines were developed in Bluespec System Verilog (BSV) and were implemented on a Xilinx ML605 FPGA development board. For smaller networks and simpler router configurations a directmapped approach was employed, where the network to be simulated was directly implemented on the FPGA. For larger networks, where a direct-mapped approach is not feasible due to FPGA resource limitations, a virtualized time-multiplexed approach was used. Compared to the provided software reference implementation, our direct-mapped approach achieves three orders of magnitude speedup, while our virtualized timemultiplexed approach achieves one to two orders of magnitude speedup, depending on the network and router configuration.
Power-Aware Multi-Core Simulation for Early Design Stage Hardware/Software Co-Optimization
"... Stringent performance targets and power constraints push designers towards building specialized workload-optimized systems across a broad spectrum of the computing arena, including supercomputing applications as exemplified by the IBM BlueGene and Intel MIC architectures. In this paper, we make the ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Stringent performance targets and power constraints push designers towards building specialized workload-optimized systems across a broad spectrum of the computing arena, including supercomputing applications as exemplified by the IBM BlueGene and Intel MIC architectures. In this paper, we make the case for hardware/software co-design during early design stages of processors for scientific computing applications. Considering an important scientific kernel, namely stencil computation, we demonstrate that performance and energy-efficiency can be improved by a factor of 1.66 × and 1.25×, respectively, by co-optimizing hardware and software. To enable hardware/software co-design in early stages of the design cycle, we propose a novel simulation infrastructure by combining high-abstraction performance simulation using Sniper with power modeling using McPAT and custom DRAM power models. Sniper/McPAT is fast — simulation speed is around 2 MIPS on an 8-core host machine — because it uses analytical modeling to abstract away core performance during multi-core simulation. We demonstrate Sniper/McPAT’s accuracy through validation against real hardware; we report average performance and power prediction errors of 22.1 % and 8.3%, respectively, for a set of SPEComp benchmarks.
A Unified Model for Hardware/Software Codesign
, 2011
"... Embedded systems are almost always built with parts implemented in both hardware and software. Market forces encourage such systems to be developed with different hardware-software decompositions to meet different points on the priceperformance-power curve. Current design methodologies make the expl ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Embedded systems are almost always built with parts implemented in both hardware and software. Market forces encourage such systems to be developed with different hardware-software decompositions to meet different points on the priceperformance-power curve. Current design methodologies make the exploration of different hardware-software decompositions difficult because such exploration is both expensive and introduces significant delays in time-to-market. This thesis addresses this problem by introducing, Bluespec Codesign Language (BCL), a unified language model based on guarded atomic actions for hardware-software codesign. The model provides an easy way of specifying which parts of the design should be implemented in hardware and which in software without obscuring important design decisions. In addition to describing BCL’s operational semantics, we formalize the equivalence of BCL programs and use this to mechanically verify design refinements. We describe the partitioning of a BCL program via computational domains and the compilation
Hard Real-time Applications
, 2013
"... Hybrid control systems are a growing domain of application. They are pervasive and their complexity is increasing rapidly. Distributed control systems for future "Intelligent Grid" and renewable energy generation systems are demanding high-performance, hard real-time computation, and more ..."
Abstract
- Add to MetaCart
(Show Context)
Hybrid control systems are a growing domain of application. They are pervasive and their complexity is increasing rapidly. Distributed control systems for future "Intelligent Grid" and renewable energy generation systems are demanding high-performance, hard real-time computation, and more programmability. General-purpose computer systems are primarily designed to process data and not to interact with physical processes as required by these systems. Generic general-purpose architectures even with the use of real-time operating systems fail to meet the hard real-time constraints of hybrid system dynamics. ASIC, FPGA, or traditional embedded design approaches to these systems often result in expensive, complicated systems that are hard to program, reuse, or maintain. In this thesis, we propose a domain-specific architecture template targeting hybrid con-trol system applications. Using power electronics control applications, we present new mod-eling techniques, synthesis methodologies, and a parameterizable computer architecture for these large distributed control systems.
Method and System for Automated Unit-Level Testing of FPGA-based Cycle-Accurate Microprocessor Simulator Using Software Simulator as a Golden Model
"... Abstract For new hard ware/software co-designed CPU architectures there is a need for fast and flexib le performance simu lation to perform extensive design space exp loration in both software and hardware do mains. To confirm reliability of design decisions, the simu lator should also be accurate, ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract For new hard ware/software co-designed CPU architectures there is a need for fast and flexib le performance simu lation to perform extensive design space exp loration in both software and hardware do mains. To confirm reliability of design decisions, the simu lator should also be accurate, which is usually achieved at the cost of reduced simulat ion speed. Although FPGA-accelerated simulators have dramat ically higher speed than software simulators, such models require much higher development effort. An FPGA -based model generally assumes a top-down design flo w, where the system can be tested only after all units have been developed. The paper describes a method and system for bottom-up development and automated unit-level testing of FPGA-based cycle-accurate simulator leveraging functionality of existing software simu lator.
Ultra-Fast and Accurate Simulation for Large-Scale Many-Core Processors
"... The proposed methods enable ultra-fast and accurate emulations of large-scale NoC architectures with up to thousands of nodes using only on-chip resources of a single FPGA. The evaluation results show that more than 5,000× simulation speedup over BookSim, one of the most widely used software-based ..."
Abstract
- Add to MetaCart
(Show Context)
The proposed methods enable ultra-fast and accurate emulations of large-scale NoC architectures with up to thousands of nodes using only on-chip resources of a single FPGA. The evaluation results show that more than 5,000× simulation speedup over BookSim, one of the most widely used software-based NoC simulators, is achieved while the simulation accuracy is maintained. ii