Results 1 - 10
of
73
Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulations
- in International Conference for High Performance Computing, Networking, Storage and Analysis (SC
, 2011
"... Two major trends in high-performance computing, namely, larger numbers of cores and the growing size of on-chip cache memory, are creating significant challenges for evaluating the design space of future processor architectures. Fast and scalable simulations are therefore needed to allow for suffici ..."
Abstract
-
Cited by 42 (9 self)
- Add to MetaCart
(Show Context)
Two major trends in high-performance computing, namely, larger numbers of cores and the growing size of on-chip cache memory, are creating significant challenges for evaluating the design space of future processor architectures. Fast and scalable simulations are therefore needed to allow for sufficient exploration of large multi-core systems within a limited simulation time budget. By bringing together accurate highabstraction analytical models with fast parallel simulation, architects can trade off accuracy with simulation speed to allow for longer application runs, covering a larger portion of the hardware design space. Interval simulation provides this balance between detailed cycle-accurate simulation and one-IPC simulation, allowing long-running simulations to be modeled much faster than with detailed cycle-accurate simulation, while still providing the detail necessary to observe core-uncore interactions across the entire system. Validations against real hardware show average absolute errors within 25 % for a variety of multi-threaded workloads; more than twice as accurate on average as one-IPC simulation. Further, we demonstrate scalable simulation speed of up to 2.0 MIPS when simulating a 16-core system on an 8-core SMP machine.
Atac: a 1000-core cache-coherent processor with on-chip optical network
- in Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT ’10
, 2010
"... Based on current trends, multicore processors will have 1000 cores or more within the next decade. However, their promise of increased performance will only be realized if their inherent scaling and programming challenges are overcome. Fortunately, recent advances in nanophotonic device manufacturin ..."
Abstract
-
Cited by 22 (3 self)
- Add to MetaCart
(Show Context)
Based on current trends, multicore processors will have 1000 cores or more within the next decade. However, their promise of increased performance will only be realized if their inherent scaling and programming challenges are overcome. Fortunately, recent advances in nanophotonic device manufacturing are making CMOS-integrated optics a reality—interconnect technology which can provide significantly more bandwidth at lower power than conventional electrical signaling. Optical interconnect has the potential to enable massive scaling and preserve familiar programming models in future multicore chips. This paper presents ATAC, a new multicore architecture with integrated optics, and ACKwise, a novel cache coherence protocol designed to leverage ATAC’s strengths. ATAC uses nanophotonic technology to implement a fast, efficient global broadcast network which helps address a number of the challenges that future multicores will face. ACKwise is a new directory-based cache coherence protocol that uses this broadcast mechanism to provide high performance and scalability. Based on 64-core and 1024-core simulations with Splash2, Parsec, and synthetic benchmarks, we show that ATAC with ACKwise out-performs a chip with conventional interconnect and cache coherence protocols. On 1024-core evaluations, ACKwise protocol on ATAC outperforms the best conventional cache coherence protocol on an electrical mesh network by 2.5x with Splash2 benchmarks and by 61 % with synthetic benchmarks.
Generalized just-in-time trace compilation using a parallel task farm in a dynamic binary translator
- in Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’11). ACM
, 2011
"... Dynamic Binary Translation (DBT) is the key technology behind cross-platform virtualization and allows software compiled for one Instruction Set Architecture (ISA) to be executed on a processor supporting a different ISA. Under the hood, DBT is typically implemented using Just-In-Time (JIT) compilat ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
(Show Context)
Dynamic Binary Translation (DBT) is the key technology behind cross-platform virtualization and allows software compiled for one Instruction Set Architecture (ISA) to be executed on a processor supporting a different ISA. Under the hood, DBT is typically implemented using Just-In-Time (JIT) compilation of frequently executed program regions, also called traces. The main challenge is translating frequently executed program regions as fast as possible into highly efficient native code. As time for JIT compilation adds to the overall execution time, the JIT compiler is often decoupled and operates in a separate thread independent from the main simulation loop to reduce the overhead of JIT compilation. In this paper we present two innovative contributions. The first contribution is a generalized trace compilation approach that considers all frequently executed paths in a program for JIT compilation,
Hasim: Fpga-based high-detail multicore simulation using time-division multiplexing
- In HPCA
, 2011
"... Abstract—In this paper we present the HAsim FPGAaccelerated simulator. HAsim is able to model a sharedmemory multicore system including detailed core pipelines, cache hierarchy, and on-chip network, using a single FPGA. We describe the scaling techniques that make this possible, including novel uses ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
(Show Context)
Abstract—In this paper we present the HAsim FPGAaccelerated simulator. HAsim is able to model a sharedmemory multicore system including detailed core pipelines, cache hierarchy, and on-chip network, using a single FPGA. We describe the scaling techniques that make this possible, including novel uses of time-multiplexing in the core pipeline and on-chip network. We compare our timemultiplexed approach to a direct implementation, and present a case study that motivates why high-detail simulations should continue to play a role in the architectural exploration process.
ESESC: A Fast Multicore Simulator Using Time-Based Sampling
- in International Symposium on High Performance Computer Architecture
, 2013
"... Architects rely on simulation in their exploration of the design space. However, slow simulation speed caps their productivity and limits the depth of their exploration. Sampling has been a commonly used remedy. While sampling is shown to be an effective technique for single core processors, its app ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
(Show Context)
Architects rely on simulation in their exploration of the design space. However, slow simulation speed caps their productivity and limits the depth of their exploration. Sampling has been a commonly used remedy. While sampling is shown to be an effective technique for single core processors, its application has been limited to simulation of multiprogram, throughput applications only. This work presents Time-Based Sampling (TBS), a framework that is the first to enable sampling in simulation of multicore processors with virtually no limitation in terms of application type (multiprogrammed or multithreaded), number of cores, homogeneity or heterogeneity of the simulated configuration (4.99 % error averaged across all the evaluated configurations). TBS also is the first to enable integrated power and temperature evaluation in statistically sampled simulation of multicore systems (with 5.5 % and 2.4 % error on average, respectively). We implement an architectural simulator based on TBS, called ESESC, that provides a holistic set of tools for a fair evaluation of different architectures. 1.
Using Cycle Stacks to Understand Scaling Bottlenecks in Multi-Threaded Workloads
"... This paper proposes a methodology for analyzing parallel performance by building cycle stacks. A cycle stack quantifies where the cycles have gone, and provides hints towards optimization opportunities. We make the case that this is particularly interesting for analyzing parallel performance: unders ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
(Show Context)
This paper proposes a methodology for analyzing parallel performance by building cycle stacks. A cycle stack quantifies where the cycles have gone, and provides hints towards optimization opportunities. We make the case that this is particularly interesting for analyzing parallel performance: understanding how cycle components scale with increasing core counts and/or input data set sizes leads to insight with respect to scaling bottlenecks due to synchronization, load imbalance, poor memory performance, etc. We present several case studies illustrating the use of cycle stacks. As a subsequent step, we further extend the methodology to analyze sets of parallel workloads using statistical data analysis, and perform a workload characterization to understand behavioral differences across benchmark suites. We analyze the SPLASH-2, PARSEC and Rodinia benchmark suites and conclude that the three benchmark suites cover similar areas in the workload space. However, scaling behavior of these benchmarks towards larger input sets and/or higher core counts is highly dependent on the benchmark, the way in which the inputs have been scaled, and on the machine configuration. 1
EFFEX: An Embedded Processor for Computer Vision-Based Feature Extraction
- In Design Automation Conference
, 2011
"... The deployment of computer vision algorithms in mobile applications is growing at a rapid pace. A primary component of the computer vision software pipeline is feature extraction, which identifies and encodes relevant image features. We present an embedded heterogeneous multicore design named EFFEX ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
The deployment of computer vision algorithms in mobile applications is growing at a rapid pace. A primary component of the computer vision software pipeline is feature extraction, which identifies and encodes relevant image features. We present an embedded heterogeneous multicore design named EFFEX that incorporates novel functional units and memory architecture support, making it capable of increasing mobile vision performance while balancing power and area. We demonstrate this architecture running three common feature extraction algorithms, and show that it is capable of providing significant speedups at low cost. Our simulations show a speedup of as much as 14 × for feature extraction with a decrease in energy of 40x for memory accesses.
COREMU: A Scalable and Portable Parallel Full-system Emulator
, 2010
"... NOTES: This report has been submitted for early dissemination of its contents. It will thus be subjective to change without prior notice. It will also be probabaly copyrighted if accepted for publication in a referred conference of journal. Parallel Processing Institute makes no gurantee on the cons ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
NOTES: This report has been submitted for early dissemination of its contents. It will thus be subjective to change without prior notice. It will also be probabaly copyrighted if accepted for publication in a referred conference of journal. Parallel Processing Institute makes no gurantee on the consequences of using the viewpoints and results in the technical report. It
The Locality-Aware Adaptive Cache Coherence Protocol
"... Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in future processors. We propose a scalable, efficient ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in future processors. We propose a scalable, efficient shared memory cache coherence protocol that enables seamless adaptation between private and logically shared caching of on-chip data at the fine granularity of cache lines. Our data-centric approach relies on inhardware yet low-overhead runtime profiling of the locality of each cache line and only allows private caching for data blocks with high spatio-temporal locality. This allows us to better exploit the private caches and enable low-latency, low-energy memory access, while retaining the convenience of shared memory. On a set of parallel benchmarks, our lowoverhead locality-aware mechanisms reduce the overall energy by 25 % and completion time by 15 % in an NoC-based multicore with the Reactive-NUCA on-chip cache organization and the ACKwise limited directory-based coherence protocol. Categories and Subject Descriptors C.1.2.g [Processor Architectures]: [Parallel processors]; B.3.2.g [Memory Structures]: [Shared memory]
DIRECTORYLESS SHARED MEMORY COHERENCE USING EXECUTION MIGRATION
"... We introduce the concept of deadlock-free migration-based coherent shared memory to the NUCA family of architectures. Migration-based architectures move threads among cores to guarantee sequential semantics in large multicores. Using a execution migration (EM) architecture, we achieve performance co ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
We introduce the concept of deadlock-free migration-based coherent shared memory to the NUCA family of architectures. Migration-based architectures move threads among cores to guarantee sequential semantics in large multicores. Using a execution migration (EM) architecture, we achieve performance comparable to directory-based architectures without using directories: avoiding automatic data replication significantly reduces cache miss rates, while a fast network-level thread migration scheme takes advantage of shared data locality to reduce remote cache accesses that limit traditional NUCA performance. EM area and energy consumption are very competitive, and, on the average, it outperforms a directory-based MOESI baseline by 1.3 × and a traditional S-NUCA design by 1.2×. We argue that with EM scaling performance has much lower cost and design complexity than in directorybased coherence and traditional NUCA architectures: by merely scaling network bandwidth from 256 to 512 bit flits, the performance of our architecture improves by an additional 13%, while the baselines show negligible improvement. 1