| A. Saulsbury,F.Pong, and A. Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In Proceedings of the 23rd International Symposium on Computer Architecture, pages 90-- 101, May 1996. |
....future. The next section describes how active memory systems improve the performance of single machines, which lends strong arguments for their inclusion in COTS components. 3.1. Active Memory Systems One of the biggest challenges facing modern computer architects is overcoming the memory wall [37]. Technology trends dictate that the gap between processor and memory performance is widening. Even though good cache behavior mitigates this problem to some extent, memory latency remains a critical performance bottleneck in modern high performance processors. Heavily pipelined clocked ....
....have improved memory bandwidth, but this does nothing to address memory latency or reduce the number of cache misses incurred by the processor. One approach to reducing the gap between processor and memory performance is to move processing into the memory system by using active memories [5,11,12,30,34,35,37]. Schemes vary, but either parts of a program that have poor cache behavior are executed in the memory system, thereby reducing cache misses and memory bandwidth requirements# or address remapping techniques are used to re structure data (like linked lists or non unit stride accesses) so that the ....
A. Saulsbury,F.Pong, and A. Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In Proceedings of the 23rd International Symposium on Computer Architecture, pages 90-- 101, May 1996.
.... Kyushu University 2 Discrete LSIs Merged DRAM logc LSIs Datapath Registers Cache (SRAM) Main Memory (DRAM) Datapath Registers Cache (SRAM) Main Memory (DRAM) Datapath Main Memory (DRAM) Main Memory (DRAM) Registers Datapath Figure 1: Memory Path Architectures on cache misses [7][8][11] This approach tends to increase the cache line size if we attempt to improve the attainable memory bandwidth. In general, large cache lines can benefit some application programs with much spatial locality of references, because they provide the effect of prefetching. Larger cache lines, ....
....Consequently, in merged DRAM logic LSIs, the designer can positively take the advantage of spatial locality inherent in programs. For example, since instruction streams have much spatial locality in almost all programs, increasing the cache line size is very effective for instruction caches[8]. 2.2 Disadvantages of High Bandwidth In Section 2.1, we mentioned that a great advantage of high bandwidth is the ability to increase the cache line size in a constant miss penalty. Unfortunately, since conventional caches employ a single cacheline size, increasing the cache line size is the ....
[Article contains additional citation context not shown here]
Saulsbury, A., Pong, F., and Nowatzyk, A., "Missing the Memory Wall: The Case for Processor /Memory Integration," Proc. of the 23rd Annual International Symposium on Computer Architecture, pp.90--101, May 1996.
....of DRAM [1] 3] 4] That work considered the memory as a black box, accepting current DRAM structures without modification. Newer studies have investigated the effects of retaining entire lines of DRAM within the memory array using the sense amplifiers, those projects looked at performance [7]. This study approaches the idea from a power efficiency standpoint. In this paper, we discuss the ongoing simulation of the CIM architecture and the results derived from preliminary simulations. By closely coupling memory energy and time simulations and microprocessor pipeline emulators, it is ....
.... single device [11] 12] Studies have shown show how much performance and energy gain can be obtained from creating systems on a single chip while leaving the memory arrays virtually unchanged [3] 4] And there have been some performance estimates of more interesting ways to utilize the proximity [7]. Memory devices are constructed of smaller arrays. The reasons involve the ability to drive and sense signal changes across limited distances due to capacitance in the wiring, as well as providing convenient redundancy. If the whole system is on the single die, could there be an arrangement of ....
[Article contains additional citation context not shown here]
Saulsbury, Ashley, ET AL., Missing the Memory Wall: The Case for Processor /Memory Integration, In Proceedings of the 23rd Annual International Symposium on Computer Architecture (Philadelphia, Pennsylvania, May, 1996), pp. 90-101
....are useful. About 30 of the cycles are wasted on average on memory stalls, while about 37 are wasted on data hazards. Even if we consider an ideal memory system (Figure 4) the situation does not get much better. The architecture modeled is close to one with on chip memory as suggested by [SPN96] The figure shows that an average of only half of the P1 cycles are useful. The rest 7 0 10 20 30 40 50 60 70 80 90 100 110 120 100 50 25 12 37 18 9 100 52 28 17 34 19 11 100 61 32 23 35 23 17 100 65 46 34 36 29 23 useful ....
A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the memory wall: The case for processor /memory integration. In 23rd International Symposium on Computer Architecture, 1996.
....using DRAM optimized for access time rather than area. They supplement the DRAM with a 4KB fully associative SRAM line buffer with 128B blocks and true least recently used (LRU) replacement policy. It is interesting to note that they claim hit rates as high as 80 on this buffer. In their paper [25], Saulsbury et al. use a cache with large (512 byte) blocks but with 2 way associativity. They also noted that once the cache is integrated with the DRAM array, a cache with wide blocks is very effective. However, the low associativity of the cache in their design limits its usefulness. Also, the ....
....lower local miss rate that a 4MB direct mapped L3 caches for many scientific codes. The increase in hit rates can lead to an imporovement in CPI by a factor of 1.5 2.0 for some programs. Wide but shallow caches could be put closer to the CPU, provided memory is either integrated with the CPU [25] or if circuit technologies could be developed to provide sufficient bandwidth making it possible to quickly transfer large blocks of data from memory to the CPU. While our work explore only portion of the design space, we demonstrated that the proposed architecture can dramatically reduce average ....
Ashley Saulsbury, Fong Pong, Andreas Nowatzky, "Missing the Memory Wall: The Case for Processor/Memory Integration", J Proc. 23rd Annual Symposium on Computer Architecture, pp. 90-101, May, 1996, http://playground.sun.com/pub/S3.mp/papers.html.
....cache is about 27 , compared to a conventional direct mapped cache with fixed 32 byte lines. 1 Introduction For merged DRAM logic LSIs with a memory hierarchy including cache memory, we can exploit high on chip memory bandwidth by means of replacing a whole cache line at a time on cache misses [5][10][11] This approach tends to increase the cache line size if we attempt to improve the attainable memory bandwidth. In general, large cache lines can benefit some application as the effect of prefetching. Larger cache lines, however, might worsen the system performance if programs do not have ....
....application as the effect of prefetching. Larger cache lines, however, might worsen the system performance if programs do not have enough spatial locality and cache misses frequently take place. This kind of cache misses (i.e. conflict misses) could be reduced by increasing the cache associativity[10][11] But, this approach usually makes the cache access time longer. To resolve the above mentioned dilemma, we have proposed a concept of variable line size cache (VLS cache) 5] The VLS cache can alleviate the negative effects of larger cache line size by partitioning the large cache line ....
Saulsbury, A., Pong, F., and Nowatzyk, A., "Missing the Memory Wall: The Case for Processor/Memory Integration, " Proc. of the 23rd Annual International Symposium on Computer Architecture, pp.90--101, May 1996.
....instructions lead to much lower protocol engine latency and occupancy. 2.5. 2 Directory Storage The Piranha design supports directory data with virtually no memory space overhead by computing ECC at a coarser granularity and utilizing the unused bits for storing the directory information [31,38]. ECC is computed across 256 bit boundaries (typical is 64 bit) leaving us with 44 bits for directory storage per 64 byte line. Compared to having a dedicated external storage and datapath for directories, this approach leads to lower cost by requiring fewer components and pins, and provides ....
A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In 23rd Annual International Symposium on Computer Architecture. May 1996.
....and incur the same or fewer number of slow off chip accesses. Finally, we examine the effect that manufacturing technology may have on improving the PMI, by integrating more of the system (DRAM) onto the processor, which includes eventually combining all memory and logic onto a single substrate [13, 92, 100]. If the processes permit, merging the DRAM and logic on one die may allow the memory hierarchy to be flatter, bringing it closer to the ideal and thus reducing the need for distributing it. We present some simulation results that indicate that, with current processors and workloads, full ....
Ashley Saulsbury, Fong Pong, and Andreas Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 90--101, May 1996.
....the performance of the memory hierarchy the key determinant of overall system performance. The growing interest in IRAM chips which combine processor and physical memory on a single die reflects the growing importance of the memory hierarchy in system design. IRAM chips have been proposed [2, 10, 13] as a costeffective way to improve memory bandwidth and reduce memory latency, as opposed to the current conventional approach of multiple levels of expensive caches and highperformance inter chip buses. The complete integration of processors and main memory, if it happens, could take one of two ....
Ashley Saulsbury, Fong Pong, and Andreas Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 90-- 101, May 1996.
.... While microprocessors follow an explosive growth in performance, DRAM based memory systems fall behind creating the memory wall [3,18] Integration of main memory on the same module (MCM) or even on the same die (IRAM) with the processor promises a high performance yet inexpensive memory system [12,13,15]. Through the elimination of the pin interface, such integration is expected to deliver: A substantial increase in memory bandwidth (hundredfold increase over the current workstation memory bandwidth) due to the vastly improved ability to interconnect the processing core to multiple DRAM row ....
....reflecting the substantially improved ability to interconnect read write ports to multiple banks on a single integrated device. We do, however, model contention on the memory banks (bank conflicts) We assumed 128 memory banks (4096 bits wide) This design is between the 32 banks chosen in [15] and the 512 banks proposed in [14] We chose this number to match the architectural vector length and the number of outstanding memory operations. To cover a wide spectrum of processor memory integration implementations we explored five different on chip access latencies: 2, 4, 8, 16 and 32 ....
[Article contains additional citation context not shown here]
Ashley Saulsbury, Fong Pong, and Andreas Nowatzyk, "Missing the Memory Wall: The Case for Processor/ Memory Integration." In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996.
....memory coherency make the situation worse to the point where the speedup possible from future parallel supercomputers will be severely limited. This well known memory wall will kill significant future increases in performance, regardless of the amount of cache or processor speedup techniques [19]. The high performance computing community has faced similar problems once before in building high performance systems out of networked combinations of processors. There, the latency to access data in another processor is through a network, and is some huge multiple of the processor s native ....
....the next command is received. Processing In Memory (PIM also called Intelligent RAM or IRAM, Merged Logic and Memory, or Embedded RAM) promises to change that. PIM is an increasingly viable VLSI technology where significant amounts of logic are placed on a high density memory part [9] 17] [19]. Such combinations offer literally orders of magnitude more memory bandwidth at greatly reduced local latencies. This paper considers a potentially new model of memory semantics, using PIM technology to implement what we call microservers, that radically expands the types of functions that may be ....
Saulsbury, A. et. al, "Missing the Memory Wall: The Case for Processor/Memory Integration," ISCA-96, Philadelphia, PA, May 1996.
.... Mitsubishi M32 R D, the IRAM is another system on a chip embedded DRAM device with vector processing logic, designed for streaming computations [Patterson97] Other approaches use PIM devices as the only processors in a multiprocessor architecture: a cache coherent distributed shared memory system [Saulsbury96], and a large scale distributed memory system [Kogge96] The Active Pages project, which is the most closely related to DIVA, associates configurable logic with each memory page to accelerate performance of an external host [Oskin98] There are also several other architecture approaches, not ....
A. Sauslbury, F. Pong and A. Nowatzyk, "Missing the Memory Wall: The Case for Processor/Memory Integration, Proc. of the International Symposium on Computer Architecture, May, 1996.
....since P.Arrays do not have caches, each bank has several row bu ers. Based on an analysis of the applications, a good design includes 3 2 Kbyte row bu ers per bank [13] We use random row bu er replacement. These row bu ers, although costly, are useful to capture important program localities [35]. A P.Array access to memory should take 10 and 20 ns in a row bu er hit and miss respectively. With so many processing units on chip, contention for memory may occur. Speci cally, a DRAM bank may be accessed by the P.Host, the local P.Mem, or a remote P.Mem through the global on chip bus. It can ....
A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 90-101, May 1996.
....simple processor cycling at around 200 MHz. However, in a not far o future, a chip will likely t the memory that now comes with a high end server and faster processors. Architectures that integrate processor and memory in the same chip are called processor in memory (PIM) or intelligent memory [8, 12, 14] architectures. An important challenge for these architectures is how to ensure that the processor exploits the large bandwidth that the on chip memory can potentially provide. Even a wide issue superscalar can only sustain so many memory accesses per cycle. There is the danger that a PIM chip ....
A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 90-101, May 1996.
....bene ts of simple controller design, averagecase performance, and support for non uniform memory access times. The latter bene t is the key to transparent support for active memories. 1. Motivation The gap between microprocessor performance and memory performance (often called the memory wall [1]) has been steadily increasing. A cache miss on modern microprocessors that issue multiple instructions per cycle can result in hundreds of lost instruction issue slots. Even though good cache behavior mitigates this problem to some extent, memory latency remains a critical performance bottleneck ....
....that issue multiple instructions per cycle can result in hundreds of lost instruction issue slots. Even though good cache behavior mitigates this problem to some extent, memory latency remains a critical performance bottleneck for realistic workloads in modern high performance processors [1,2,3]. In an attempt to close the gap between processor and memory performance, modern highperformance memory systems have moved from unpipelined DRAM implementations to high speed, deeply pipelined clocked implementations. However, these improvements come at the cost of signi cant complexity at the ....
[Article contains additional citation context not shown here]
A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In Proceedings of the 23rd International Symposium on Computer Architecture, pages 90101, May 1996.
....computer systems are becoming increasingly limited by memory performance. While processor performance increases at a rate of 60 per year, the bandwidth of a memory chip increases by only 10 per year making it costly to provide the memory bandwidth required to match the processor performance [14] [17]. The memory bandwidth bottleneck is even more acute for media processors with streaming memory reference patterns that do not cache well. Without an effective cache to reduce the bandwidth demands on main memory, these media processors are more often limited by memory system bandwidth than other ....
SAULSBURY, ASHLEY, PONG, FONG, AND NOWATZYK, ANDREAS, Missing the Memory Wall: The Case for Processor/Memory Integration. In Proceedings of the International Symposium on Computer Architecture (May 1996), pp. 90-101.
....idealized versions of the traditional COMA and NUMA organizations. 1 Introduction With vast increases in the on chip transistor count expected in the near future, a major trend in microprocessor design is the progressive integration of processor and memory to relieve the memory access bottleneck [4, 5, 6]. Traditional node organizations, where the processor is connected to the memory via a slow memory bus are expected to be replaced by more integrated organizations. Eventually, processor and main memory are likely to share the same chip. Such an organization promises to provide low latency and ....
A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 90-- 101, May 1996.
....transistors that can be integrated on a VLSI chip are fueling the trend toward integration of processor and memory on a chip. It is widely expected that o the shelf microprocessor designs will exploit this trend to provide low latency and high bandwidth communication between processor and memory [1, 4, 6, 9, 12, 16, 19]. Since directory based, cache coherent Distributed SharedMemory (DSM) multiprocessors are typically built around the latest o the shelf microprocessors, they will be a ected by the trend of progressive processor memory integration. Currently, the nodes in DSM systems are typically orga This ....
A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 90-101, May 1996.
....schemes that select which banks operate in page mode and which do not. Section 5 investigates the impact of the SRAM cache in DRAM devices and evaluates several design alternatives. Finally, in Section 6 we conclude and place this study in the perspective of integrated processor memory systems [3, 14]. 2 Page mode DRAM and Cached DRAM Operation In this section, we review very briefly the operation of page mode DRAMs and cached DRAMs. 2 2.1 Page mode DRAM DRAM array column address page page buffer row access (30 40 ns. column access (30 40 ns. row address Figure 1: DRAM ....
....of 32 KB page bu#er (8 KB per bank) The memory controller determines the mapping and the operating mode of each bank. We have performed our experiments with two extreme cache sizes: 8 KB and 256 KB. The small 8KB capacity corresponds to low end machines, e.g. the MicroSparc as indicated in [14]. The small capacity cache can also be seen as a way to model the behavior of systems with larger caches running applications whose working set sizes are larger than those we are using. The larger capacity 256 KB cache corresponds to higher end systems. Although larger caches, in the megabyte ....
[Article contains additional citation context not shown here]
Ashley Saulsbury, Fong Pong, and Andreas Nowatzy. Missing the memory wall: The case for processor/memory integration. In Proc. of 23rd Int. Symp. on Computer Architecture, pages 90--101, 1996.
....cycle time improves and processors use enhanced instruction level parallelism. There is a maximum logical memory latency beyond which an application sees no performance advantage. In one form or another, this notion of a memory bound to processor performance has been referred to as the memory wall [10, 14]. It has been described as a latency or a bandwidth limitation of memory. Using large cache sizes to avoid the memory wall is limited. For large caches at least, in order to reduce the hit rate, one is naturally inclined towards using large line sizes. Large line sizes increase the bandwidth ....
A. Saulsbury, F. Fong, A. Nowatzyk. Missing the memory wall: The case for processor /memory integration. In Proceedings of ISCA'96 pages 90--100.
....on a chip or a large out of order execution processor with more functional units and deeper pipelines Another approach is to integrate the processor with main memory. Currently anticipated possibilities are Intelligent Memory [Patterson96] and processors with DRAM on a chip [Saulsbury96, Shimizu96]. We will discuss some of these systems in the following section. 3 Present and Future Memory Systems 3.1 Modern DRAM Technologies: Rambus RDRAM Popular DRAM technologies for general purpose computers are Enhanced Data Out (EDORAM) Synchronous DRAM (SDRAM) Przybyl96] and Rambus 1 (RDRAM) ....
A. Saulsbury,F. Pong, A. Nowatzyk, Missing the Memory Wall: The Case for Processor/Memory Integration, International Symposium on Computer Architecture, Philadelphia, May 1996.
....of heat. DRAM packages have few pins, low cost, and are suited to the low power characteristics of DRAM circuits. However, these differences are diminishing. DRAM fabrication processes have become better suited for processor implementations, with two or three levels of metal, and better logic speed[Sau 96]. The drive toward integrating logic into the DRAM is driven partly by necessity and partly by opportunity. The immense increase in capacity (4x every three years) has required that the interFigure 12 3 Fraction of Transistors on Microprocessors devote to Caches. Since caches migrated on chip in ....
....generation, the incremental cost of the processor is modest, perhaps 20 . From the processor designer s viewpoint, the advantage of DRAM over SRAM is that it has more than an order of magnitude better density. However, the access time is greater, access is more restrictive, and refresh is required[Sau 96]. Somewhat more subtle threshold phenomena further increase the attractiveness of PAM. The capacity of DRAM has been growing faster than the demand for storage in most applications. The rapid increase in capacity has been beneficial at the high end, because it became possible to run very large ....
[Article contains additional citation context not shown here]
Ashley Saulsbury and Fong Pong and Andreas Nowatzyk, "Missing the Memory Wall: The Case for Processor/Memory Integration", Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 90-101, May 1996.
....coherence controller in the access path of these cases. Second, the coherence controller can use part of the memory as its directory store. Furthermore, there are techniques, such as computing ECC across a larger number of bits and utilizing the unused bits for storing the directory information [14, 19], that provide the option of supporting the directory data with virtually no memory space overhead. Given the trend towards larger main memories, dedicated directory storage can become a significant cost factor. The above techniques lead to lower costs by requiring fewer components and pins, and ....
....that outperform their contemporary high performance counterparts for memory intensive applications that benefit from the lower CPU I D L2 Cache Memory Controller CC NIC I O DRAM Directory Router Figure 9: Configuration with integrated L2 and memory controller. memory latencies [19]. In this paper, we consider high performance processor designs where latency reduction is the primary motivation for the integration. For uniprocessor systems, in addition to providing faster access to the local memory, integrating the memory controllers can also lead to higher bandwidth to ....
[Article contains additional citation context not shown here]
A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the Memory Wall: The Case for Processor/Memory Integration. In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996.
No context found.
Ashley Saulsbury, Fong Pong and Andreas Nowatzyk. "Missing the Memory Wall: the Case for Processor/Memory Integration". Computer Architecture News, Vol. 24, No. 2, pp.90-101, May, 1996.
No context found.
A. Saulsbury, F. Pong, and A. Nowatzk, "Missing the Memory Wall: The Case for Processor/Memory Integration," Int'l Symp. Computer Architecture, IEEE CS Press, 1996, pp. 90-101.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC