| A. Saulsbury, F. Pong, and A. Nowatzyk, "Missing the Memory Wall: The Case for Processor/Memory Integration," Proc. 23rd Ann. Int'l Symp. Computer Architecure, pp. 90-103, 1996. |
....to post on servers or to redistribute to lists, requires prior specific permission and or a fee. ICS 02, June 22 26, 2002, New York, New York, USA. Copyright 2002 ACM 1 58113 483 5 02 0006 . 5.00. PIM) chips to address the well known performance gap between processor and memory speeds [2, 7, 8, 12, 15, 16, 17, 20, 21, 22, 25, 27, 28, 30]. Many previous architectural solutions to the processor memory gap such as multithreading, prefetching, and speculation, seek to reduce or tolerate memory latency, at the expense of increased memory bandwidth requirements [3] PIMs instead dramatically improve memory bandwidth, by 10 100X over ....
....benefits of PIM technology and a state ofthe art host, yielding high performance for mixed workloads. Since PIM processors are usually less sophisticated due to on chip space constraints, systems using PIMs alone in a multiprocessor may sacrifice performance on uniprocessor computations [12, 16, 25, 27], while system on a chip solutions (e.g. the IRAM [22] and the Mitsubishi M32R D [20] limit the application domain. DIVA s support for a broad range of familiar parallel programming paradigms, including task parallelism for irregular computations, distinguishes it from systems with restricted ....
A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the memory wall: The case for processor/memory integration. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996.
....designs that integrate microarchitectures and main memory. The idea is to put a large amount of memory on the same die as the processor. Initial designs were targeted at embedded applications and used very simple processing cores [Deering 94] Later designs used general purpose microarchitectures [Saulsbury 96, Patterson 97] By putting the memory close to the processor, much of the memory latency is removed. However, challenges remain for these designs [Patterson 97] The memory fabrication technology is different from the process used for general purpose processors. Integrating these two technologies ....
A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 90--101, May 1996.
....performance gap. As this difference grows, designers are forced to develop techniques to leverage the negative effects of this gap. One attractive solution for this problem is to distribute intelligence to storage structures such as memories and secondary storage devices (e.g. Active Memories [4, 19, 28, 32] and Active Disks [2, 17, 23, 30] The common goal of these techniques is to offioad parts of the processing to the off chip structures such that the amount of communication between the storage and processing elements is minimized. As the VLSI technology advances, the gap between the performance ....
....effect on the overall power consumption, because each additional execution core increases the bus, the DFE, and the level 2 cache activity linearly, hence the ratios for different system did not change significantly. V. RELATED WORK Active or smart memories have been extensively studied [4, 19, 28, 32]. However, such techniques concentrate on off chip active memory in contrast to our methods, which improve the performance of on chip caches. Therefore, the fine grain offioading may not be feasible for such systems. Streaming data, i.e. data accessed with a known, fixed displacements between ....
Saulsbury, A., F. Pong, and A. Nowatzky. Missing the Memory Wall: The Case for Processor/Memory Integration. In Proceedings of International Symposium on Computer Architecture, 90-101, 1996.
....the memory bandwidth can be achieved by integrating the cache and the main memory into the same chip, or merged DRAM logic LSI. Eliminating the chip boundary between the cache and the main memory solves the I O pin bottleneck problem, thereby improving dramatically the memory bandwidth [42] 48] [53]. 4. Low Energy Memory Access Techniques From the energy definition explained in Section 2, it can be understood that there are at least three approaches to reducing the average memory access energy (AMAE) as follows: Reducing the cache access energy (ECache ) and maintaining the cache miss ....
A. Saulsbury, F. Pong, and A. Nowatzyk,"Missing the Memory Wall: The Case for Processor/Memory Integration, " Proc. of the 23rd Annual International Symposium on Computer Architecture, pp. 90--101, May 1996.
....schemes that select which banks operate in page mode and which do not. Section 5 investigates the impact of the SRAM cache in DRAM devices and evaluates several design alternatives. Finally, in Section 6 we conclude and place this study in the perspective of integrated processor memory systems [3, 14]. 2 Page mode DRAM and Cached DRAM Operation In this section, we review very brie y the operation of page mode DRAMs and cached DRAMs. 2.1 Page mode DRAM An access in DRAM devices usually consists of a row access followed by a column access (see Figure 1) A read request consists rst of a row ....
....is a cache miss. In this case, the memory latencies depend on the memory subsystem being modelled as explained below. We have performed our experiments with two extreme cache sizes: 8 KB and 256 KB. The small 8KB capacity corresponds to low end machines, e.g. the MicroSparc as indicated in [14]. The small capacity cache can also be seen as a way to model the behavior of systems with larger caches running applications whose working set sizes are larger than those we are using. The larger capacity 256 KB cache corresponds to higher end systems. Although larger caches, in the megabyte ....
[Article contains additional citation context not shown here]
Ashley Saulsbury, Fong Pong, and Andreas Nowatzy. Missing the memory wall: The case for processor/memory integration. In Proc. of 23rd Int. Symp. on Computer Architecture, pages 90-101, 1996.
....causes the thread to be handed over to the Execute Processor. Threads are blocking and additional context switches may be incurred during the execution of a We acknowledge that there are alternative techniques for overcoming memory latency (e.g. RAMBUS [Rambus 96] Cuppu 99] IRAM [Saulsbury 96] thread, since when a thread is blocked, it will be context switched out of execution. Three different types of context switches exist in this system: Implicit Context Switch resulting during instruction decode as described above (memory access vs. non memory access instructions) Switch on ....
A. Saulsbury, F. Pong and A. Nowatzyk. "Missing the memory wall: The case for processor/memory integration", Proc. of the Intl. Symposium on Computer Architecture (ISCA-26), pp 90-101, May 1996.
....preloading of contexts, thus increasing run lengths of non blocking threads. While we advocated multithreading as the solution to memory latency, other researchers have been exploring different solutions, including Data scalar [Berger 97] Multiscalar [Sohi 95] processor memory integration [Saulsbury 96] and aggressive prefetching techniques ( Baer 91] Dahlgren 93] Farkas 95] 6. ....
Saulsbury, A., Pong, F., and Nowatzyk, A., "Missing the memory wall: A case for processor/memory integration," Proceedings of the 23rd International Symposium on Computer Architecture, May 1996, pp 90-101.
....the results are post stored from its registers into memory. The instruction set implements dataflow computational model, while the execution engine relies on control2 We acknowledge that there are alternative techniques for overcoming memory latency (e.g. RAMBUS [Rambus 96] Cuppu 99] IRAM [Saulsbury 96] flow like scheduling of instructions. We have completed the definition of the instruction set and developed an instruction level simulator. We have also developed a backend to an existing Sisal compiler [Bohm 91] and used MIDC as intermediate language [Shankar 95, 96] to produce code for our ....
A. Saulsbury, F. Pong, and A. Nowatzyk. "Missing the memory wall: A case for processor/memory integration," Proceedings of the 23rd International Symposium on Computer Architecture, May 1996, pp 90-101.
.... load multiple type instructions to efficiently preload a thread context) While we advocated multithreading as the solution to memory latency, other researchers have been exploring different solutions, including Data scalar [Berger 97] Multiscalar [Sohi 95] processor memory integration [Saulsbury 96] and aggressive prefetching techniques [Baer 91] Acknowledgement. This research is supported in part by the following grants from NSF, MIPS 9622593, MIP9622836, CDA 9529561. 6. ....
Saulsbury, A., Pong, F., and Nowatzyk, A., "Missing the memory wall: A case for processor/memory integration," Proceedings of the 23rd May 1996, pp 90-101.
....Trends and Future Technology There are converging trends in the design of processors and memories that point to future existence of chips that include both processors and memory. Examples include Processors In Memory (PIM) Ko 94] Computational RAM [El et al. 92] IRAM [Pa 95, Sh et.al 96, Sa 96] Similar ideas were proposed as early as 1970 in [St 70] The driving argument for these approaches is the fact that the integration of CPU and memory on the same chip brings benefits of lower latency and higher bandwidth in accessing memory that outweigh possible reductions in the complexity of ....
.... were proposed as early as 1970 in [St 70] The driving argument for these approaches is the fact that the integration of CPU and memory on the same chip brings benefits of lower latency and higher bandwidth in accessing memory that outweigh possible reductions in the complexity of the processor [Sa 96] Since memory access latency is becoming a limiting performance factor [Jo 95, Wi 95, WuMc 95] it is reasonable to expect that future generations of commercial chips will increasingly follow this trend. In fact, there are already some examples of such chips ( Sa 96, Sh et.al 96, AD 93] Two ....
[Article contains additional citation context not shown here]
A. Saulsbury, F. Pong, et. al., "Missing the Memory Wall: The Case for Processor/Memory Integration," Proc. 23rd Int. Computer Architecture Symp., Jun. 1996, pp. 90-101.
....necessitating best case conditions for each. The best case for memory side prefetching would integrate the prefetch engine with memory and address translation hardware. There have been significant recent advances in processor in memory (PIM) systems, which integrate the processor with memory [12, 15, 20, 22, 24, 37]. For the large class of applications where a single PIM chip does not provide sufficient memory, systems based on multiple PIM chips have been proposed (e.g. IBM s Blue Gene [15] and Execube [22] Wallach predicts high performance systems of 2009 will be built solely from multiple PIM chips ....
....While this generates excellent prefetching characteristics, the large amount of redundant computation makes inefficient use of the hardware. Many other PIM systems have been proposed, including CRAM [10] Execube [22] the Terasys Massively Parallel PIM Array [11] Saulsbury et al. s system [37], Vector IRAM [24] Active Pages [31] DIVA [12] and FlexRAM [20] These systems can be split into two types: 1) the main processor integrated with memory, and (2) co processors integrated into the system DRAM for data parallel operations. None of these systems includes specific support to ....
A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In Proc. of the 23th Intl. Symp. on Comp. Arch., 1996.
No context found.
Saulsbury, A., Pong, F., Nowatzyk, A.: Missing the Memory Wall: The Case for Processor/Memory Integration. In Proceedings of the 23rd International Symposium on Computer Architecture, pages 90--101, May 1996.
No context found.
A. Saulsbury and A. Nowatzyk. Missing the memory wall: the case for processor/memory integration. In the Proceedings of the 23rd Intl. Symposium on Computer Architecture, pages 90--101, Philadelphia, PA, May 1996.
....since P.Arrays do not have caches, each bank has several row buffers. Based on an analysis of the applications, a good design includes 3 2 Kbyte row buffers per bank [13] We use random row buffer replacement. These row buffers, although costly, are useful to capture important program localities [35]. A P.Array access to memory should take 10 and 20 ns in a row buffer hit and miss respectively. With so many processing units on chip, contention for memory may occur. Specifically, a DRAM bank may be accessed by the P. Host, the local P. Mem, or a remote P. Mem through the global on chip bus. ....
A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In Proceedings of the 3rd Annual International Symposium on Computer Architecture, pages 90-101, May 1996.
....problems in todays systems, uni and multiprocessor alike. Recent developments, such as RAMBUS DRAM and its bus protocol [1] point to a solution for a problem of DRAM and bus bandwidth but more remains to be done to support higher ILP. The so called intelligent or active memory RAM architectures [2, 3, 4, 5] have taken a different approach to the problem by avoiding off chip access and exploiting onchip access parallelism. Latency, however, remains a problem in both of these cases since they do not change the DRAM core organization. Thus both uni and multiprocessor systems continue to rely on a ....
....Thus both uni and multiprocessor systems continue to rely on a memory hierarchy to solve the memory access problem. However, the cost of cache misses is becoming prohibitive. RAMBUS and synchronous DRAM s utilize a form of on chip caching and even more aggressive approaches have been proposed [3]. Latency hiding techniques, primarily prefetching [6, 7, 8] also have been utilized to help solve the problem. Projected advances in VLSI and packaging technology are expected to make the problem much worse in the near future. The number of gates on a chip is projected to reach 20M in 10 years, ....
Saulsbury, A., Pong, F., and Nowatzk, A. Missing the Memory Wall: the Case for processor/Memory Integration Proc. 23rd Int. Symp. on Computer Architecture, 1996.
No context found.
A. Saulsbury, F. Pong, and A. Nowatzyk, "Missing the Memory Wall: The Case for Processor/Memory Integration," Proc. 23rd Ann. Int'l Symp. Computer Architecure, pp. 90-103, 1996.
No context found.
A. Saulsbury, F. Pong, A. Nowatzyk, "Missing the Memory Wall: The case for Processor/Memory Integration," 23 ISCA, May 1996, pp. 90--101.
No context found.
Ashley Saulsbury, Fong Pong, and Andreas Nowatzyk. Missing the memory wall: The case for processor/memory integration. In 23rd Annual International Symposium on Computer Architecture (23rd ISCA'96), pages 90--101. May 1996.
No context found.
A. Saulsbury, F. Pong, and A. Nowatzyk, "Missing the Memory Wall: The Case for Processor/Memory Integration," Proc. 23rd Ann. Int'l Symp. Computer Architecture, pp. 90-101, May 1996.
No context found.
Ashley Saulsbury, Fong Pong, and Andreas Nowatzyk. Missing the memory wall: the case for processor/memory integration. In Proc. 23rd annual Int. Symp. on Computer architecture, pages 90--101, Philadelphia, PA, 1996.
No context found.
Ashley Saulsbury, Fong Pong, and Andreas Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 90--101, May 1996.
No context found.
Saulsbury, A. Nowatzyk. Missing the memory wall: the case for processor/memory integration. ISCA'96: The 23rd Annual International Conference on Computer Architecture, Philadelphia, PA, USA, 22-24 May 1996 p.90-101.
No context found.
A. Saulsbury, F. Pong, and A. Novatzyk, "Missing the Memory Wall: The Case for Processor/Memory Integration" Proc. of 23rd Int. Symp. on Computer Architecture, Pages 90-101, June, 1996.
No context found.
Ashley Saulsbury, Fong Pong and Andreas Nowatzyk, "Missing the Memory Wall: The Case for Processor Memory Integration", ISCA 96, pp. 90-101.
No context found.
Ashley Saulsbury, Fong Pong and Andreas Nowatzyk, Missing the Memory Wall: The Case for Processor/Memory Intergration, Intl. Symposium on Computer Architecture, pp. 90-101, 1996.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC