• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Last-level cache (LLC) performance of data-mining workloads on a CMP—A case study of parallel bioinformatics workloads (2006)

by A Jaleel, M Mattina, B Jacob
Venue:In IEEE HPCA
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 20
Next 10 →

The PARSEC benchmark suite: Characterization and architectural implications

by Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, Kai Li - IN PRINCETON UNIVERSITY , 2008
"... This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited ..."
Abstract - Cited by 150 (1 self) - Add to MetaCart
This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.

Cooperative caching for chip multiprocessors

by Jichuan Chang - In Proceedings of the 33nd Annual International Symposium on Computer Architecture , 2006
"... Chip multiprocessor (CMP) systems have made the on-chip caches a critical resource shared among co-scheduled threads. Limited off-chip bandwidth, increasing on-chip wire delay, destructive inter-thread interference, and diverse workload characteristics pose key design challenges. To address these ch ..."
Abstract - Cited by 87 (1 self) - Add to MetaCart
Chip multiprocessor (CMP) systems have made the on-chip caches a critical resource shared among co-scheduled threads. Limited off-chip bandwidth, increasing on-chip wire delay, destructive inter-thread interference, and diverse workload characteristics pose key design challenges. To address these challenge, we propose CMP cooperative caching (CC), a unified framework to efficiently organize and manage on-chip cache resources. By forming a globally managed, shared cache using cooperative private caches. CC can effectively support two important caching applications: (1) reduction of average memory access latency and (2) isolation of destructive inter-thread interference. CC reduces the average memory access latency by balancing between cache latency and capacity opti-mizations. Based private caches, CC naturally exploits their access latency benefits. To improve the effective cache capacity, CC forms a “shared ” cache using replication control and LRU-based global replacement policies. Via cooperation throttling, CC provides a spectrum of caching behaviors between the two extremes of private and shared caches, thus enabling dynamic adaptation to suit workload requirements. We show that CC can achieve a robust performance advantage over private and shared cache schemes across different processor, cache and memory configurations, and a wide selection of multithreaded and multiprogrammed

ASR: Adaptive selective replication for CMP caches

by Bradford M. Beckmann - In Proceedings of MICRO-39 , 2006
"... The large working sets of commercial and scientific workloads stress the L2 caches of Chip Multiprocessors (CMPs). Some CMPs use a shared L2 cache to maximize the on-chip cache capacity and minimize off-chip misses. Others use private L2 caches, replicating data to limit the delay due to global wire ..."
Abstract - Cited by 32 (3 self) - Add to MetaCart
The large working sets of commercial and scientific workloads stress the L2 caches of Chip Multiprocessors (CMPs). Some CMPs use a shared L2 cache to maximize the on-chip cache capacity and minimize off-chip misses. Others use private L2 caches, replicating data to limit the delay due to global wires and minimize cache access time. Recent hybrid proposals use selective replication to balance latency and capacity, but their static replication rules result in performance degradation for some combinations of workloads and system configurations. This paper proposes Adaptive Selective Replication (ASR), a mechanism that dynamically monitors workload behavior to control replication. ASR replicates cache blocks only when it estimates the benefit of replication (lower L2 hit latency) exceeds the cost (more L2 misses). Full-system simulations of 8-processor CMPs show that ASR provides robust performance: improving performance by as much as 29 % versus shared caches, 19% versus private caches, and 12 % versus CMP-NuRapid [9] and Victim Replication [41]. Furthermore, while ASR does not improve the performance of all workloads, it provides performance stability by always performing at least comparably to the best alternative including Cooperative Caching [8]. 1.

Rodinia: A Benchmark Suite for Heterogeneous Computing

by Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-ha Lee, Kevin Skadron
"... Abstract—This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applicat ..."
Abstract - Cited by 27 (4 self) - Add to MetaCart
Abstract—This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley’s dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout. I.

Fully-buffered DIMM memory architectures: Understanding mechanisms, overheads and scaling

by Brinda Ganesh, Aamer Jaleel, David Wang, Bruce Jacob - In Proceedings of the 13th International Symposium on High Performance Computer Architecture , 2007
"... Performance gains in memory have traditionally been obtained by increasing memory bus widths and speeds. The diminishing returns of such techniques have led to the proposal of an alternate architecture, the Fully-Buffered DIMM. This new standard replaces the conventional memory bus with a narrow, hi ..."
Abstract - Cited by 8 (0 self) - Add to MetaCart
Performance gains in memory have traditionally been obtained by increasing memory bus widths and speeds. The diminishing returns of such techniques have led to the proposal of an alternate architecture, the Fully-Buffered DIMM. This new standard replaces the conventional memory bus with a narrow, high-speed interface between the memory controller and the DIMMs. This paper examines how traditional DDRx based memory controller policies for scheduling and row buffer management perform on a Fully-Buffered DIMM memory architecture. The split-bus architecture used by FBDIMM systems results in an average improvement of 7 % in latency and 10 % in bandwidth at higher utilizations. On the other hand, at lower utilizations, the increased cost of serialization resulted in a degradation in latency and bandwidth of 25 % and 10 % respectively. The split-bus architecture also makes the system performance sensitive to the ratio of read and write traffic in the workload. In larger configurations, we found that the FBDIMM system performance was more sensitive to usage of the FBDIMM links than to DRAM bank availability. In general, FBDIMM performance is similar to that of DDRx systems, and provides better performance characteristics at higher utilization, making it a relatively inexpensive mechanism for scaling capacity at higher bandwidth requirements. The mechanism is also largely insensitive to scheduling policies, provided certain ground rules are obeyed. 1.

Performance characterization of data mining applications using MineBench

by Joseph Zambreno, Berkin Özıs. Ikyılmaz, Gokhan Memik, Alok Choudhary - In 9th Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW , 2006
"... Data mining is the process of finding useful patterns in large sets of data. These algorithms and techniques have become vital to researchers making discoveries in diverse fields, and to businesses looking to gain a competitive advantage. In recent years, there has been a tremendous increase in both ..."
Abstract - Cited by 8 (4 self) - Add to MetaCart
Data mining is the process of finding useful patterns in large sets of data. These algorithms and techniques have become vital to researchers making discoveries in diverse fields, and to businesses looking to gain a competitive advantage. In recent years, there has been a tremendous increase in both the size of the data being collected and also the complexity of the data mining algorithms themselves. This rate of growth has been exceeding the rate of improvements in computing systems, thus widening the performance gap between data mining systems and algorithms. The first step in closing this gap is to analyze these algorithms and understand their bottlenecks. In this paper, we present a set of representative data mining applications we call MineBench. We evaluate the MineBench applications on an 8-way shared memory machine and analyze some important performance characteristics. We believe that this information can aid the designers of future systems with regards to data mining applications. 1.

Managing Wire Delay in Chip Multiprocessor Caches

by Bradford M. Beckmann , 2006
"... Increasing on-chip wire delay and growing off-chip miss latency, present two key challenges in designing large Level-2 (L2) CMP caches. Currently, some CMPs use a shared L2 cache to maximize cache capacity and minimize off-chip misses. Others use private L2 caches, replicating data to limit the dela ..."
Abstract - Cited by 4 (1 self) - Add to MetaCart
Increasing on-chip wire delay and growing off-chip miss latency, present two key challenges in designing large Level-2 (L2) CMP caches. Currently, some CMPs use a shared L2 cache to maximize cache capacity and minimize off-chip misses. Others use private L2 caches, replicating data to limit the delay from slow on-chip wires and minimize cache access time. Ideally, to improve performance for a wide variety of work-loads, CMPs prefer both the capacity of a shared cache and the access latency of private caches. In this thesis, we propose three techniques that combine the benefits of shared and private caches. In partic-ular, to reduce access latency in a shared cache, we investigate cache block migration and on-chip trans-mission lines. Migration reduces access latency by moving frequently used blocks towards the lower-latency banks. We show migration successfully reduces latency to blocks requested by only one processor, but doesn’t reduce the latency to shared blocks. In contrast, transmission lines can reduce on-chip wire delay by an order of magnitude versus conventional wires and provide low latency to all shared cache banks. We demonstrate on-chip transmission lines consistently improve performance versus a baseline shared cache, but bandwidth contention can limit them from reaching their full potential. To improve the effective capacity of private caches, we propose Adaptive Selective Replication (ASR). ASR dynamically monitors workload behavior and replicates cache blocks only when it estimates the ben-efit of replication (lower L2 hit latency) exceeds the cost (more L2 misses). When ASR detects replication is less beneficial, processors coordinate writebacks with remote on-chip caches to conserve cache storage. ASR provides a robust CMP cache hierarchy: improving performance versus both shared and private caches. Additionally, ASR can leverage the fast remote cache access latency provided by transmission lines and reduce off-chip misses versus a design using conventional wires. We demonstrate the combina-tion of transmission lines and ASR outperforms either isolated technique and preforms similarly to a shared cache using four times the transmission line bandwidth.

A case for an over-provisioned multicore system: Energy efficient processing of multithreaded programs

by Koushik Chakraborty , 2007
"... Technology scaling has provided system designers with an exploding transistor budget, far more than what was available when the core principles behind many existing commodity microprocessors were envisioned. With this tremendous growth, however, comes a whole new set of engineering challenges involv ..."
Abstract - Cited by 4 (2 self) - Add to MetaCart
Technology scaling has provided system designers with an exploding transistor budget, far more than what was available when the core principles behind many existing commodity microprocessors were envisioned. With this tremendous growth, however, comes a whole new set of engineering challenges involving power density, thermal efficiency, programmability and so on. In this paper, we study another important trend in high performance microprocessors: the reduction in the Simultaneously Active Fraction (SAF) — the fraction of the entire chip resources that can be active simultaneously, given a target power envelope. As the improvement in the energy efficiency of individual transistor devices is lagging behind the growth in their integration capacity, we find that the SAF is monotonically decreasing for each successive technology generation. Given this increasing constraint on the SAF, we examine the utility of temporarily suspending computation on a core as a means for reducing the SAF, and hence, remain within the confines of costeffective cooling and power delivery. We investigate a SAF aware over-provisioned multicore system (OPMS), where only a subset of the available cores are employed to perform active computation at any given time, by allowing the individual cores to transition between active and inactive state. Though several possible directions for utilizing such an over-provisioned system are possible, this paper focuses on energy efficient dynamic task redistribution. In particular, this paper examines the use of Computation Spreading—a recently proposed technique for runtime specialization of homogeneous multicores—in an OPMS. We show several benefits for such an OPMS design, including reductions in energy, runtime, and superior thermal characteristics. Overall, our technique improves the energy-delay product of the commercial workloads we examine by 5–20%. 1.

Thread Fusion

by José González, Qiong Cai, Pedro Chaparro, Grigorios Magklis, Ryan Rakvic, Antonio González
"... This work proposes Thread Fusion as an effective way of reducing power consumption when a Simultaneous Multi-Threaded (SMT) core is executing two threads from a homogeneous parallel application. Two dynamic instances of the same static instruction, each from a different thread are merged (fused) int ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
This work proposes Thread Fusion as an effective way of reducing power consumption when a Simultaneous Multi-Threaded (SMT) core is executing two threads from a homogeneous parallel application. Two dynamic instances of the same static instruction, each from a different thread are merged (fused) into a single instruction, consuming half of the resources of front-end pipeline stages. When the fused instruction is executed, it is cloned and it proceeds at full bandwidth. Our simulation results show average energy reduction of 10 % with less than 1 % impact on performance.

A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads

by Shuai Che, Jeremy W. Sheaffer, Michael Boyer, Lukasz G. Szafaryn, Liang Wang, Kevin Skadron
"... Abstract—The recently released Rodinia benchmark suite enables users to evaluate heterogeneous systems including both accelerators, such as GPUs, and multicore CPUs. As Rodinia sees higherlevelsofacceptance,itbecomesimportantthatresearchers understand this new set of benchmarks, especially in how th ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
Abstract—The recently released Rodinia benchmark suite enables users to evaluate heterogeneous systems including both accelerators, such as GPUs, and multicore CPUs. As Rodinia sees higherlevelsofacceptance,itbecomesimportantthatresearchers understand this new set of benchmarks, especially in how they differ from previous work. In this paper, we present recent extensions to Rodinia and conduct a detailed characterization of the Rodinia benchmarks (including performance results on an NVIDIA GeForce GTX480, the first product released based on the Fermi architecture). We also compare and contrast Rodinia with Parsec to gain insights into the similarities and differences of the two benchmark collections; we apply principal component analysis to analyze the application space coverage of the two suites. Our analysis shows that many of the workloads in Rodinia and Parsec are complementary, capturing different aspects of certain performance metrics. I.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University