Results 1 - 10
of
53
3d-stacked memory architectures for multi-core processors
- In International Symposium on Computer Architecture
"... Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works only consider commodity 2D DRAM organizations. In ..."
Abstract
-
Cited by 132 (7 self)
- Add to MetaCart
(Show Context)
Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works only consider commodity 2D DRAM organizations. In this work, we explore more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count. Our simulation results show that with a few simple changes to the 3D-DRAM organization, we can achieve a 1.75 × speedup over previously proposed 3D-DRAM approaches on our memoryintensive multi-programmed workloads on a quad-core processor. The significant increase in memory system performance makes the L2 miss handling architecture (MHA) a new bottleneck, which we address by combining a novel data structure called the Vector Bloom Filter with dynamic MSHR capacity tuning. Our scalable L2 MHA yields an additional 17.8 % performance improvement over our 3D-stacked memory architecture. 1.
MineBench: A Benchmark Suite for Data Mining Workloads
- 2006 IEEE International Symposium on Workload Characterization
, 2006
"... Data mining constitutes an important class of scientific and commercial applications. Recent advances in data extraction techniques have created vast data sets, which require increasingly complex data mining algorithms to sift through them to generate meaningful information. The disproportionately s ..."
Abstract
-
Cited by 78 (10 self)
- Add to MetaCart
(Show Context)
Data mining constitutes an important class of scientific and commercial applications. Recent advances in data extraction techniques have created vast data sets, which require increasingly complex data mining algorithms to sift through them to generate meaningful information. The disproportionately slower rate of growth of computer systems has led to a sizeable performance gap between data mining systems and algorithms. The first step in closing this gap is to analyze these algorithms and understand their bottlenecks. With this knowledge, current computer architectures can be optimized for data mining applications. In this paper, we present MineBench, a publicly available benchmark suite containing fifteen representative data mining applications belonging to various categories such as clustering, classification, and association rule mining. We believe that MineBench will be of use to those looking to characterize and accelerate data mining workloads. 1.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs
- In Proceedings of the 40th Intl. Symp. on Microarchitecture
, 2007
"... DRAMs require periodic refresh for preserving data stored in them. The refresh interval for DRAMs depends on the vendor and the design technology they use. For each refresh in a DRAM row, the stored information in each cell is read out and then written back to itself as each DRAM bit read is self-de ..."
Abstract
-
Cited by 60 (1 self)
- Add to MetaCart
(Show Context)
DRAMs require periodic refresh for preserving data stored in them. The refresh interval for DRAMs depends on the vendor and the design technology they use. For each refresh in a DRAM row, the stored information in each cell is read out and then written back to itself as each DRAM bit read is self-destructive. The refresh process is inevitable for maintaining data correctness, unfortunately, at the expense of power and bandwidth overhead. The future trend to integrate layers of 3D die-stacked DRAMs on top of a processor further exacerbates the situation as accesses to these DRAMs will be more frequent and hiding refresh cycles in the available slack becomes increasingly difficult. Moreover, due to the implication of temperature increase, the refresh interval of 3D die-stacked DRAMs will become shorter than those of conventional ones. This paper proposes an innovative scheme to alleviate the energy consumed in DRAMs. By employing a time-out counter for each memory row of a DRAM module, all the unnecessary periodic refresh operations can be eliminated. The basic concept behind our scheme is that a DRAM row that was recently read or written to by the processor (or other devices that share the same DRAM) does not need to be refreshed again by the periodic refresh operation, thereby eliminating excessive refreshes and the energy dissipated. Based on this concept, we propose a low-cost technique in the memory controller for DRAM power reduction. The simulation results show that our technique can reduce up to 86 % of all refresh operations and 59.3 % on the average for a 2GB DRAM. This in turn results in a 52.6 % energy savings for refresh operations. The overall energy saving in the DRAM is up to 25.7 % with an average of 12.13 % obtained for SPLASH-2, SPECint2000, and Biobench benchmark programs simulated on a 2GB DRAM. For a 64MB 3D DRAM, the energy saving is up to 21 % and 9.37 % on an average when the refresh rate is 64 ms. For a faster 32ms refresh rate the maximum and average savings are 12 % and 6.8 % respectively. 1.
Last level cache (llc) performance of data mining workloads on a cmp - a case study of parallel bioinformatics workloads
- In Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA-12
, 2006
"... ..."
(Show Context)
Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors
, 2007
"... 3D integration technology greatly increases transistor density while providing faster on-chip communication. 3D implementations of processors can simultaneously provide both latency and power benefits due to reductions in critical wires. However, 3D stacking of active devices can potentially exacerb ..."
Abstract
-
Cited by 52 (5 self)
- Add to MetaCart
(Show Context)
3D integration technology greatly increases transistor density while providing faster on-chip communication. 3D implementations of processors can simultaneously provide both latency and power benefits due to reductions in critical wires. However, 3D stacking of active devices can potentially exacerbate existing thermal problems. In this work, we propose a family of Thermal Herding techniques that (1) reduces 3D power density and (2) locates a majority of the power on the top die closest to the heat sink. Our 3D/thermal-aware microarchitecture contributions include a significance-partitioned datapath that places the frequently switching 16-bits on the top die, a 3D-aware instruction scheduler allocation scheme, an address memoization approach for the load and store queues, a partial value encoding for the L1 data cache, and a branch target buffer that exploits a form of frequent partial value locality in target addresses. Compared to a conventional planar processor, our 3D processor achieves a 47.9 % frequency increase which results in a 47.0 % performance improvement (min 7%, max 77 % on individual benchmarks), while simultaneously reducing total power by 20 % (min 15%, max 30%). Without our Thermal Herding techniques, the worstcase 3D temperature increases by 17 degrees. With our Thermal Herding techniques, the temperature increase is only 12 degrees (29 % reduction in the 3D worst-case temperature increase).
BioPerf: A benchmark suite to evaluate high-performance computer architecture on bioinformatics applications
- In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC
"... The exponential growth in the amount of genomic data has spurred growing interest in large scale analysis of genetic information. Bioinformatics applications, which explore computational methods to allow researchers to sift through the massive biological data and extract useful information, are beco ..."
Abstract
-
Cited by 44 (5 self)
- Add to MetaCart
(Show Context)
The exponential growth in the amount of genomic data has spurred growing interest in large scale analysis of genetic information. Bioinformatics applications, which explore computational methods to allow researchers to sift through the massive biological data and extract useful information, are becoming increasingly important computer workloads. This paper presents BioPerf, a benchmark suite of representative bioinformatics applications to facilitate the design and evaluation of highperformance computer architectures for these emerging workloads. Currently, the BioPerf suite contains codes from 10 highly popular bioinformatics packages and covers the major fields of study in computational biology such as sequence comparison, phylogenetic reconstruction, protein structure prediction, and sequence homology & gene finding. We demonstrate the use of BioPerf by providing simulation points of pre-compiled Alpha binaries and with a performance study on IBM Power using IBM Mambo simulations cross-compared with Apple G5 executions. The BioPerf suite (available from www.bioperf.org) includes benchmark source code, input datasets of various sizes, and information for compiling and using the benchmarks. Our benchmark suite includes parallel codes where available. 1.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement
"... Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems read data from cell arrays and populate a row buff ..."
Abstract
-
Cited by 34 (6 self)
- Add to MetaCart
(Show Context)
Power consumption and DRAM latencies are serious concerns in modern chip-multiprocessor (CMP or multi-core) based compute systems. The management of the DRAM row buffer can significantly impact both power consumption and latency. Modern DRAM systems read data from cell arrays and populate a row buffer as large as 8 KB on a memory request. But only a small fraction of these bits are ever returned back to the CPU. This ends up wasting energy and time to read (and subsequently write back) bits which are used rarely. Traditionally, an open-page policy has been used for uni-processor systems and it has worked well because of spatial and temporal locality in the access stream. In future multi-core processors, the possibly independent access streams of each core are interleaved, thus destroying the available locality and significantly under-utilizing the contents of the row buffer. In this work, we attempt to improve row-buffer utilization for future multi-core systems. The schemes presented here are motivated by our observations that a large number of accesses within heavily accessed OS pages are to small, contiguous “chunks ” of cache blocks. Thus, the colocation of chunks (from different OS pages) in a row-buffer will improve the overall utilization of the row buffer contents, and consequently reduce memory energy consumption and access time. Such co-location can be achieved in many ways, notably involving a reduction in OS page size and software or hardware assisted migration of data within DRAM. We explore these mechanisms and discuss the trade-offs involved along with energy and performance improvements from each scheme. On average, for applications with room for improvement, our best performing scheme increases performance by 9 % (max. 18%) and reduces memory energy consumption by 15 % (max. 70%).
Adaptive Caches: Effective Shaping of Cache Behavior to Workloads,” in MICRO,
, 2006
"... ..."
(Show Context)
Brief Announcement: The Problem Based Benchmark Suite
"... This announcement describes the problem based benchmark suite (PBBS). PBBS is a set of benchmarks designed for comparing parallel algorithmic approaches, parallel programming language styles, and machine architectures across a broad set of problems. Each benchmark is defined concretely in terms of a ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
(Show Context)
This announcement describes the problem based benchmark suite (PBBS). PBBS is a set of benchmarks designed for comparing parallel algorithmic approaches, parallel programming language styles, and machine architectures across a broad set of problems. Each benchmark is defined concretely in terms of a problem specification and a set of input distributions. No requirements are made in terms of algorithmic approach, programming language, or machine architecture. The goal of the benchmarks is not only to compare runtimes, but also to be able to compare code and other aspects of an implementation (e.g., portability, robustness, determinism, and generality). As such the code for an implementation of a benchmark is as important as its runtime, and the public PBBS repository will include both code and performance results. The benchmarks are designed to make it easy for others to try their own implementations, or to add new benchmark problems. Each benchmark problem includes the problem specification, the specification of input and output file formats, default input generators, test codes that check the correctness of the output for a given input, driver code that can be linked with implementations, a baseline sequential implementation, a baseline multicore implementation, and scripts for running timings (and checks) and outputting the results in a standard format. The current suite includes the following problems: integer sort, comparison sort, remove duplicates, dictionary, breadth first search, spanning forest, minimum spanning forest, maximal independent set, maximal matching, K-nearest neighbors, Delaunay triangulation, convex hull, suffix arrays, n-body, and ray casting. For each problem, we report the performance of our baseline multicore implementation on a 40-core machine.
Performance characterization of data mining applications using MineBench
- In 9th Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW
, 2006
"... Data mining is the process of finding useful patterns in large sets of data. These algorithms and techniques have become vital to researchers making discoveries in diverse fields, and to businesses looking to gain a competitive advantage. In recent years, there has been a tremendous increase in both ..."
Abstract
-
Cited by 14 (8 self)
- Add to MetaCart
(Show Context)
Data mining is the process of finding useful patterns in large sets of data. These algorithms and techniques have become vital to researchers making discoveries in diverse fields, and to businesses looking to gain a competitive advantage. In recent years, there has been a tremendous increase in both the size of the data being collected and also the complexity of the data mining algorithms themselves. This rate of growth has been exceeding the rate of improvements in computing systems, thus widening the performance gap between data mining systems and algorithms. The first step in closing this gap is to analyze these algorithms and understand their bottlenecks. In this paper, we present a set of representative data mining applications we call MineBench. We evaluate the MineBench applications on an 8-way shared memory machine and analyze some important performance characteristics. We believe that this information can aid the designers of future systems with regards to data mining applications. 1.