| Burger, D., Kaxiras, S., and Goodman, J. R. (1997). DataScalar Architectures. In Proceedings of the Twenty-Fourth International Symposium on Computer Architecture. 174 |
....speed of the processor. Even if memory density is not affected, the size of an emerging contemporary memory chip (1 gigabit) may not be large enough to serve as the only memory in the system. Either traditional off chip memory systems or more radical multi chip architectures will be needed [Burger 97b] It seems that applications will still need to go off chip to access large data sets. An alternative to caching or placing memory close to the processor is to increase the performance of main memory devices. Constructing main memory from faster static random access memory (SRAM) is possible ....
D. C. Burger, S. Kaxiras, and J. R. Goodman. DataScalar Architectures. In Architecture , pages 338--349, June 1997
....branch prediction techniques can be used to generate greedy preloading of contexts, thus increasing run lengths of non blocking threads. While we advocated multithreading as the solution to memory latency, other researchers have been exploring different solutions, including Data scalar [Berger 97] Multiscalar [Sohi 95] processor memory integration [Saulsbury 96] and aggressive prefetching techniques ( Baer 91] Dahlgren 93] Farkas 95] 6. ....
Berger, D., Kaxiras, S., and Goodman, J., "Datascalar architecture," Proceedings of the 24th International Symposium on Computer Architecture, June 1997. 21
.... data placement (e.g. blocking the variables of a thread would permit load multiple type instructions to efficiently preload a thread context) While we advocated multithreading as the solution to memory latency, other researchers have been exploring different solutions, including Data scalar [Berger 97] Multiscalar [Sohi 95] processor memory integration [Saulsbury 96] and aggressive prefetching techniques [Baer 91] Acknowledgement. This research is supported in part by the following grants from NSF, MIPS 9622593, MIP9622836, CDA 9529561. 6. ....
Berger, D., Kaxiras, S., and Goodman, J., "Datascalar architecture," Proceedings of the 24th June 1997.
....consists of datapath, registers, SRAM cache, and DRAM main memory. High on chip memory bandwidth would be exploited between the cache and the main memory on cache replacement. Examples are Kyushu University s PPRAM R [5] Mitsubishi s M32R D[8] Sun s work [7] U. W. Madison s DataScalar[2], and so on. 2) DRM architecture (Figure 1(b) DRM (Datapath Register Main memory) borrows the concept from the vector architectures with vector registers such as Cray 1. As shown in Figure 1(b) the on chip memorypath consists of datapath, registers, and DRAM main memory. High on chip ....
Burger, D., Kaxiras, S., and Goodman, J. R., "DataScalar Architectures," Proceedings of 24th International Symposium on Computer Architecture, Jun. 1997.
....significantly since multiple prefetches may be issued for each line accessed. Finally, these schemes do not attempt to get far enough ahead of the processor to hide all of the memory latency. There has also been a lot of work on PIMs. Burger et al. developed the DataScalar architecture [3], a multiprocessor PIM system for running uniprocessor applications. The PIMs all run the full application on the entire input data set, but act as intelligent prefetch engines for one another by sending local data to the other processors as it is needed. While this generates excellent prefetching ....
D. Burger, S. Kaxiras, and J. R. Goodman. DataScalar Architectures. In Proc. of the 24th Intl. Symp. on Comp. Arch., 1997.
....taking advantage of the multiple processors in the system. With IRAM type chips of large enough DRAM capacity (the Berkeley group suggests that with one billion transistors the chips will have 96MB of DRAM) the cost of the processor logic will be relatively small compared to the cost of the DRAM [2]. The only proposed system to exploit a network of IRAMs for running sequential programs that I am aware of is the DataScalar architecture [2] which has superscalar processors on each of the IRAM chips running the same parts of the same sequential program. All memory accesses are therefore local ....
.... with one billion transistors the chips will have 96MB of DRAM) the cost of the processor logic will be relatively small compared to the cost of the DRAM [2] The only proposed system to exploit a network of IRAMs for running sequential programs that I am aware of is the DataScalar architecture [2], which has superscalar processors on each of the IRAM chips running the same parts of the same sequential program. All memory accesses are therefore local to some processor and the local processor can forward values to the others, potentially hiding much of the communication cost that would ....
[Article contains additional citation context not shown here]
Doug Burger, Stefanos Kaxiras, and James R. Goodman. DataScalar Architectures. In Proceedings of the 24th Annual International Symposium on Computer Architecture, 1997.
....sequences. One core of a trace processor executes the current trace while the other cores execute future traces speculatively. The Simultaneous Speculation Scheduling technique can be applied to trace processors in a similar style like for multiscalar processors. A datascalar processor [3] runs the same sequential program redundantly across multiple processors using distributed data sets. Loads and stores are performed only locally on the processor memory that owns the data, but a local load broadcasts the loaded value to all other processors. Speculation is an optimisation option, ....
D. Burger, S. Kaxiras, and J. R. Goodman. DataScalar Architectures. In Proceedings of the 24th International Symposium on Computer Architecture, pages 338--349, Boulder, CO, June 1997.
.... which is the most closely related to DIVA, associates configurable logic with each memory page to accelerate performance of an external host [Oskin98] There are also several other architecture approaches, not based on PIM technology, designed to improve processor memory bandwidth [Carter99][Burger97][Rixner98] Impulse augments the memory system to perform application specified scatter gather operations on irregular data in the memory controller, so that contiguous data is brought into the cache [Carter99] Imagine is a system on a chip streaming architecture designed for media applications, ....
.... is a system on a chip streaming architecture designed for media applications, which uses a stream programming model [Rixner98] The DataScalar architecture is a multiprocessor system where each processor asynchronously executes the same code and broadcasts any local data to the other processors [Burger97]. DIVA is distinguished from these approaches as it supports a wide variety of parallel programming models; DIVA PIMs, with the appropriate interconnect, can be used in a scal able system with an unlimited number of chips, not just single chip solutions. The DIVA architecture and the material ....
Doug Burger, Stefanos Kaxiras, and James R. Goodman, "DataScalar Architectures", In Proc. of the 19th International Symposium on Computer Architecture (ISCA), June, 1997.
....transistors that can be integrated on a VLSI chip are fueling the trend toward integration of processor and memory on a chip. It is widely expected that o the shelf microprocessor designs will exploit this trend to provide low latency and high bandwidth communication between processor and memory [1, 4, 6, 9, 12, 16, 19]. Since directory based, cache coherent Distributed SharedMemory (DSM) multiprocessors are typically built around the latest o the shelf microprocessors, they will be a ected by the trend of progressive processor memory integration. Currently, the nodes in DSM systems are typically orga This ....
D. Burger, S. Kaxiras, and J. Goodman. DataScalar Architectures. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 338-349, June 1997.
....execution can provide significant benefits. In the dense case, cascaded execution provides speedups of around 4 for both systems. Speedups are even more impressive for the more memory intensive sparse case: 16 for the Pentium Pro and close to 14 for the R10000 4. Related work Numerous hardware [4, 6, 18] and software [7, 8, 14] techniques have been proposed to tolerate memory latency in sequential programs. The approaches most relevant to our work are prefetching and multithreading. In software controlled prefetching [7, 14] the compiler analyzes the program and inserts prefetch instructions ....
D. Burger, S. Kaxiras, and J. R. Goodman. Datascalar architectures. In Proceedings of the International Symposium on Computer Architecture, pages 338--349, Denver, CO, June 1997.
....implicit multithreaded architectures that spawn and execute multiple threads implicitly not visible to the compiler can also take advantage of a modified version of S 3 . Examples of such implicit multithreaded architectures are the multiscalar [5, 25, 24] the trace [23] and the datascalar [3] processor approaches. Simultaneous multithreaded (SMT) architectures [26] combine the multithreading technique with a wideissue processor such that the full issue bandwidth is utilized by issuing instructions from different threads simultaneously. Separate architectural register sets are ....
D. Burger, S. Kaxiras, and J. R. Goodman. DataScalar Architectures. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 338--349, Boulder, CO, June 1997.
....case, the off chip memory could be managed as secondary storage with pages swapped between on chip and off chip memory. Alternatively, multiple IRAMs could be interconnected with a high speed network to form a parallel computer. Ways to achieve this have already been proposed in the literature [9] [10] [11] However, historical trends indicate that the end user demand for memory scales at a much lower rate than the available capacity per chip. So, over time a single IRAM will be sufficient for increasingly larger systems, from portable and low end PCs to workstations and servers. Finally, a ....
D.C.Burger, S.Kaxiras, J.R.Goodman, "DataScalar Architectures", 24th Annual International Symposium on Computer Architecture, Denver, CO, 2-4 June 1997.
....memory banks that include 16Mb DRAM, two 4096b wide data buffers and a 4096b wide instruction buffer. Wide instruction data buffers will make full use of high on chip memory bandwidth[13] Burger et al. propose DataScalar architecture that connects merged DRAM logic LSIs via interconnection network[2]. In addition to the above, Computational RAM[3] and PIP RAM[1] are proposed as SIMD(single instruction, multiple data) approach. As MIMD(multiple instruction, multiple data) approach such as PPRAM R , Execube is proposed[5] With all not merged DRAM logic, M Machine[4] and Hydra[10] are ....
Burger, D., Kaxiras, S., and Goodman, J. R., "DataScalar Architecture," Proceedings of the 24th Annual International Symposium on Computer Architecture, pp.338--349, June, 1997.
....to the multiscalar approach [8] 18] and to the trace processor [17] provided that the thread handling instructions are included in the respective instruction sets. Speculative execution of alternative program paths can also be applied as an optimization technique for the datascalar processor [3]. III. Scheduling with Speculative Execution of Alternative Program Paths A. Scheduling Techniques and Speculative Execution Instruction scheduling techniques [14] are of great importance to expose ILP contained in a program to a wide issue processor. The instructions of a given program are ....
D. Burger, S. Kaxiras, and J. R. Goodman. DataScalar Architectures. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 338--349, Boulder, CO, June 1997.
....in performance. One of the issues often ignored by multithreaded systems is the 44 performance degradation of single threaded applications, due increased hardware data paths. Recently, numerous alternate approaches to tolerating memory latencies have been proposed, including DataScalar [Berger97] Multscalar [Sohi95] preload prefetch techniques ( Baer91] Dahlgren93] Farkas95] There has been a proposal for moving the processor onto DRAM chips, to reduce the latency [Saulsbury96] It is our belief that multithreaded model of execution should be combined with some of these approaches ....
Berger, D., Kaxiras, S., and Goodman, J., "Datascalar architecture," Proceedings of the 24th International Symposium on Computer Architecture, June 1997.
....are considered [3, 13] Currently, microprocessor performance improves at rate of 60 each year while DRAM speed only improves at less than 10 per year. Recent research has pointed out that to reduce this gap the best approach is to merge on the same chip memory and logic. The IRAM [10] and PIM [4] proposals both aim at this direction. Parallelism inside a program: Parallelism inside a program is understood to be the number of independent tasks (instructions) that can be executed in parallel while preserving the original program semantics. If many independent operations are available, a ....
D. Burger, S. Kaxiras, and J. R. Goodman. DataScalar Architectures. In 24rd Annual International Symposium on Computer Architecture, Denver , Colorado, June 2--4, 1997.
....loaded value to all other processors. Speculation is an optimization option, but is limited to data value speculation. However, every DataScalar machine is a de facto multiprocessor. When codes contain coarse grain parallelism, the DataScalar machine can also run like a traditional multiprocessor [3]. This ability can also be used to run threads of controls speculatively. ffl The idea of a Trace processor [19] centers around the Trace cache [16] A trace cache is a special instruction cache that captures dynamic instruction sequences in contrast to the instruction cache that contains static ....
D. Burger, S. Kaxiras, and J. R. Goodman. DataScalar Architectures. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 338--349, Boulder, CO, June 1997.
....is entered, the hardware can check the input operands with the values used in a previous execution of the tree. If they all match, the previous result can be reused instead of having to recompute all of the instructions in the tree. The same compiler analysis can benefit DataScalar architectures [BKG97] These architectures run uniprocessor binaries across multiple processors, each of which is tightly coupled with a fraction of the program s physical memory. Each processor runs the same program, performing redundant computation, and broadcasts needed local operands to all other processors ....
Doug Burger, Stefanos Kaxiras, and James R. Goodman. Datascalar architectures. In Proceedings of the 24th International Symposium on Computer Architecture, May 1997.
....memory centric architectures, which distribute the PMI to the physical memory. These architectures execute unmodified serial binaries, and they reduce inter processor traffic significantly. We propose two such architectures: DataScalar and Dynamic Data Threaded (DDT) The DataScalar architecture [15] uses redundant computation to reduce memory latencies and traffic. DataScalar architectures completely eliminate all request and writeback traffic, at the expense of some extra read traffic. DDT architectures perform a partial dynamic parallelization (in hardware) of the serial program, thus ....
....but gives little useful data other than number of instructions traced and a few other statistics. We have characterized the attempt to evaluate future microprocessors with software simulation as simulating the processors of tomorrow on the machines of today with the benchmarks of yesterday [15]. Even using yesterday s benchmarks (such as SPEC95) with small data sets, a four order of magnitude slowdown is prohibitively large. For example, simulating the longest running SPEC95 benchmark with our timing simulator would require approximately 100 days. There are a number of possibilities for ....
[Article contains additional citation context not shown here]
Doug Burger, Stefanos Kaxiras, and James R. Goodman. DataScalar Architectures. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 338--349, June 1997.
....from a design perspective, the perennial problem of how to program this architecture looms. Traditional parallelization techniques will work for many codes, but there are many others that parallelize either poorly or not at all. A good candidate for such codes is the DataScalar architecture [3]. Each participating processor runs the same program, performing redundant computation. In a CMP based DataScalar system (or multi CMP based) when a processor loads an operand from its local bank, it broadcasts the operand to the rest of the participating processors. All communication is one way, ....
Doug Burger, Stefanos Kaxiras, and James R. Goodman. DataScalar Architectures. In Proceedings of the 24th Annual International Symposium on Computer Architecture, May 1997.
....we have no access to a SV1 compiler we were unable to directly compare the two approaches. As of yet, no technical papers exist to describe SV 1 performance. DIVA is also closely related to the DataScalar architecture. The DataScalar architecture (developed by Burger, Goodman and the author) [2], is an architecture that uses multiple nodes to execute serial programs that do not fit in the memory of one of the nodes. Like DIVA, DataScalar is based on the ESP execution model of the Massive Memory Machine [6] In this model, all nodes execute all the instructions of an application. Each ....
....the number of broadcasts) and out of order execution that allows nodes to run asynchronously. Result communication in 1 Throughout this paper we will use the term IRAM to refer to highly integrated processor memory nodes. 3 DataScalar allows code to be executed only by a subset of the nodes [2]. Using result communication, a DataScalar system (based on superscalar nodes) can emulate a DIVA system at the cost of performance. We evaluate DIVA using simulation and we show that for our target work load, the NAS benchmarks [1] DIVA actually generates less external traffic than other ....
[Article contains additional citation context not shown here]
Doug Burger, Stefanos Kaxiras, and James R. Goodman, "DataScalar Architectures," In Proceedings of the 24th International Symposium on Computer Architecture, June 1997.
No context found.
Burger, D., Kaxiras, S., and Goodman, J. R. (1997). DataScalar Architectures. In Proceedings of the Twenty-Fourth International Symposium on Computer Architecture. 174
No context found.
D. Burger, S. Kaxiras, and J. Goodman. Datascalar architectures. In Proceedings of the 24th Annual International Symposium on Computer Architecture, 1997.
No context found.
D. Burger, S. Kaxiras, and J. Goodman. Datascalar architectures. In Proceedings of the 24th Annual International Symposium on Computer Architecture, 1997.
No context found.
D.C. Burger, S. Kaxiras, and J.R. Goodman, "Data-Scalar Architectures," Proc. 24th Ann. Int'l Symp. Computer Architecture, ACM Press, May 1997, pp. 338--349.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC