| H. S. Stone. High Performance Computer Architecture. Addison-Wesley 1993 (3rd ed.). ISBN: 0-201-52688-3. |
....Figure 4) The typical cache size in a wireless terminal is not large. When the cache is full, some cached objects must be removed to accommodate new objects. We consider the least recently used (LRU) replacement policy. This policy is often utilized to manage cache memory in computer architecture [20], virtual memory in operating systems [19] and location tracking in mobile phone networks [10] LRU uses the recent past as an approximation of the near future, and replaces the cached object that has not been used for the longest period of time. LRU associates with each cached object the time of ....
....application server are potentially accessed by a wireless terminal. Although the objects to be accessed vary from time to time, the number N is not significantly larger than the cache size of the wireless terminal. That is, the data access pattern of a wireless terminal exhibits temporal locality [20], which is the tendency for a wireless terminal to access in the near future those data objects referenced in the recent past. Temporal locality may not be observed in wireline Internet access because the desktop users typically navigate through several web sites at the same time. On the other ....
Stone, H. High-Performance Computer Architecture. Addison-Wesley, Reading, Massachusetts, 1990.
....boundaries is not determined by boundary type. Finally, section 5 summarizes the discussion and outlines future work. 2 The Cache Model of Attentional State A cache is an easily accessible temporary location used for storing information that is currently being used by a computational procedure [Stone, 1987] . The fundamental idea of the cache model is that the functioning of the cache when processing discourse is analogous to that of a cache when executing a program on a computer. Just as discourses may be structured into goals and subgoals which contribute to achieving the purpose of the discourse, ....
Harold S. Stone. High Performance Computer Architecture. Addison Wesley, 1987.
....future work. The cache model is an extension of the AWM model in [Walker, 1993; Jordan and Walker, 1996] 3 2 The Cache Model of Attentional State A cache is an easily accessible temporary location used for storing information that is currently being used by a computational procedure [Stone, 1987] . The fundamental idea of the cache model is that the functioning of the cache when processing discourse is analogous to that of a cache when executing a program on a computer. Just as discourses may be structured into goals and subgoals which contribute to achieving the purpose of the discourse, ....
Harold S. Stone. High Performance Computer Architecture. Addison Wesley, 1987.
....tectural simulator. There are both hardware and soft ware methods for capturing the reference traces. Once obtained, the they can be fed into an architectural simulator of the cache. The greatest drawback to this approach is that to be effective, traces have to be mil lions of references long [13]. Another problem, espe cially relevant to scientific numerical codes, is that the identity of the program components and structure that generated the address trace is lost, and therefore it is difficult to make conclusions about how to modify the source code to improve performance. Static ....
....for a Loop 2.1 Architectural model and parameters A MIMD, distributed memory architecture is assumed, with message passing communications between the processors. Each processor contains an execution unit and a memory hierarchy that includes at least one level of cache memory. As defined in [13], a cache is the first level of memory closest to the processor. It generally has access times that are commensurate with the instruction cycle time of the processor, and is therefore several times faster than main mem ory access time. Cache is an associatively addressed memory which at any ....
H. S. Stone. High-Performance Computer Architecture. Addison-Wesley, 1990.
....and reliability availability from computer systems. Higher levels of integration and newer techniques in VLSI design have made hardware with high performance and reliability, relatively inexpensive. Software, on the other hand, is becoming a major component in the overall cost of these systems [27]. Often, though, the software poses performance and reliability bottlenecks which should be discovered and eliminated. Improvements in software assessment methods for the design phase of the software life cycle are required to minimize costly redesigns and changes due to unanticipated performance ....
Stone, H. S. High-Performance Computer Architecture. Addison-Wesley, Reading, Massachusetts, 1987. 26
....concurrently with the cache upon a write. This guarantees that the main memory reflects the last write performed in the system. 10 However, a cached value can be out of date . With write back, a cache line is only written back to main memory when it is replaced and only if it has been modified [10, 8]. Write back and write through may be accompanied with two variations or optimizations: write allocate and write once. Write allocate means that a line is read into cache if an attempted write misses the cache. Otherwise, if write allocate is not used, the cache is bypassed and the copy is ....
Stone HS. High-Performance Computer Architecture Addison-Wesley Publishing Company, 1990.
....In such systems blocking is the only reasonable alternative. Recently there is much interest in wait free primitives, which are more suitable to truly parallel systems [68] Examples include various readmodify write instructions such as test and set [69] fetch and add [15] and compare and swap [70]. ParC provides a repertoire of low level primitives, that cover di erent synchronization behaviors, both blocking (semaphores and barrier) and wait free (fetch and add) High level primitives may be added if programming experience indicates that certain constructs are especially useful. ....
H. S. Stone, High-Performance Computer Architecture. Addison-Wesley, 2nd ed. (1990). 28
....benchmarks have similar distributions regardless of the number of memory chips and physical page placement policy. Related studies trying to statistically characterize cache misses have not been successful because the distributions that best characterize the behavior do not have finite variance [8]. For our purposes it is sufficient to approximate the Benchmark Result Compress95 Pass Go Pass Netscape Pass Acroread Fail PowerPoint Fail Winword Fail Table 2: Chi Square Test Results gap distribution. Thus, Figure 2 also plots the exponential distribution with the same mean gap size. ....
Harold S. Stone. High-Performance Computer Architecture, chapter Memory System Design, pages 76--84. Addison Wesley, 1993.
....(a) the size and distribution of the input matrices A and M are not known a priori and (b) residual effects such as the contents of the cache and synchronization delays from previous kernel calls are important. The previous cache contents can and do greatly change the cost of memory accesses [Sto90] because of the change in cache hit ratios. For example, the vector copy low level kernel has the ratio t limit =t small = 8:1. The AET simulator assumes the use of a least recently used cache replacement policy with a correction constant for other policies such as the random replacement policy ....
Stone H. S. (1990) High Performance Computer Architecture. Addison-Wesley, Reading, MA, second edition.
....While modern processors can issue multiple instructions per cycle, they lack the features required to address fundamental issues in multiprocessing systems: latency, bandwidth and synchronization overheads. A well designed parallel system must balance the trade off between a fine task granularity [9] and the impact of communication latencies on performance. Coarse grain parallel systems can tolerate long School of Computer Science, McGill University, Montreal, Canada, Email: prasad cs.mcgill.ca y Dept. of Electrical and Computer Engineering, CAPSL, University of Delaware, 140, Evans Hall, ....
....systems do not fully exploit the parallelism existing in irregular parallelism. Finegrain parallelism, on the other hand, enables further parallelization of many applications, but has proved to be difficult to support due to the higher relative cost of communication and synchronization latencies [9]. EARTH Efficient Architecture for Running THreads [5, 10] is a multi threaded architecture and program execution model that supports fine grain, non preemptive, lightweight threads, or fibers. EARTH is designed to allow the implementation of a multi threaded execution model with off the shelf ....
Harold S. Stone. High-Performance Computer Architecture. Addison-Wesley Publishing Company, 3rd edition, 1993.
....performance by reducing the amount of concurrent processing. Scalability is improved by reducing the occurrence of communication bottlenecks caused by coordination, through design that evenly distributes the required communication. Thorough treatments of the topic of scalability are given in [56,125,153,154]. 1.2.1 Improving Availability and Scalability The following three techniques in distributed systems research can be applied to the design of distributed coordination protocols to improve availability and scalability. Fault tolerant coordination guarantees correct coordination even if some ....
Stone, H. S., High Performance Computer Architecture, Third Edition, Addison-Wesley, New York, 1993, 512 pages.
.... is measured by the , previously defined, processor power ratio (PPR) ffl The heterogeneity of the application is measured by its intrinsic serial fraction, F s , which is obtained through the same procedure used in calculating T norm [13] ffl The intrinsic communication processing ratio [15], CPR, which is obtained dividing the overall communication demand of the application by the overall computation demand. 4.3 Evaluation Techniques We use an analytical method to obtain the overall execution time for a parallel application submitted to a certain scheduling policy. This analytical ....
Stone, Harold S., High-Performance Computer Architecture, Addison-Wesley Publishing Company, 1987.
....proposed here. Chapter 6 presents the results of this performance analysis. It is important to note that reduction networks have been proposed and constructed in practice, to support global computations such as barrier synchronization, summation, determining maxima and parallel prefix computation [Ston90, Hosh89, CrKn28, Blel89]. One such network is the control network in the CM 5 [Ponn93] It is used to perform nonlocal data distribution operations such as broadcasting, combining (reduction and parallel prefix) bit wise operations and barrier synchronizations, very rapidly. For bit wise logical OR operations, it can ....
Stone, H., "High performance computer architecture", Addison-Wesley, Reading, MA, 1990 175
....distribution of its t ij k cycle activities is another issue. The whole path, as shown in Figure 6 is similar to a pipeline system with N stages in which each stage requires t ij k cycles. The difference, however, with conventional pipelines is in the scheduling method. In conventional pipelining [Ston90], we define the pipeline clock period to be equal to the slowest stage delay and then schedule the activities accordingly. In our problem, we don t want to devise too many registers in the interface between cores to pile up all data packets. Instead, we have to implement an innovative mechanism by ....
....to packetize (serial to parallel or parallel to serial) the test data. For example, 16 bit test data would be dis assembled into four packets (of 4 bit each) in 4 cycles to transfer through Core 1. Three bypass scheduling choices are shown. We used space time table similar to the reservation table [Ston90] in pipelining. Each row corresponds to a core and each column corresponds to a time step. An entry (C1, C2 or C3) in the table shows that the corresponding core is bypassing a packet of data in that cycle. For example, in all three schedules shown in this figure Core 1 bypasses a packet of 4 bit ....
H. Stone, High Performance Computer Architecture, Addison Wesley, 1990.
....cycles. However, the distribution of its t ij k cycle activities is another issue. The whole path is similar to a pipeline system with N stages in which each stage requires t ij k cycles. The difference, however, with conventional pipelines is in the scheduling method. In conventional pipelining [Ston90], we define the pipeline clock period to be equal to the slowest stage delay and then schedule the activities accordingly. In our problem, we don t want to devise too many registers in the interface between cores to pile up all data packets. Instead, we have to implement an innovative mechanism by ....
....to packetize (serial to parallel or parallel to serial) the test data. For example, 16 bit test data would be dis assembled into four packets (of 4 bit each) in 4 cycles to transfer through Core 1. Three bypass scheduling choices are shown. We used space time table similar to the reservation table [Ston90] in pipelining. Each row corresponds to a core and each column corresponds to a time step. An entry (C1, C2 or C3) in the table shows that the corresponding core is bypassing a packet of data in that cycle. For example, in all three schedules shown in this figure Core 1 bypasses a packet of 4 bit ....
H. Stone, High Performance Computer Architecture, Addison Wesley, 1990.
....pipelining. This section presents brief results for six design examples presented in the literature. This is the main parallelizable formula in our algorithm. There are many parallel architectures (such as mesh, hypercube and connection machine) by which this formula can be computed in parallel [14]. The following tables give a summary of the design results produced by our method. In Table 1, we compare our method with some methods found in the literature. We only consider the results for the fifth order elliptical filter in a non pipelined system. It contains 26 additions and 8 ....
Harold S. Stone, High Performance Computer Architecture,
....memory [31, 12] distributed memory units are multiply accessed through software. This mechanism may be seen as an extension to the classical virtual memory mechanism used in modern operating systems. 1. 3 Synchronism The various technological choices led to a great diversity in parallel machines [1, 2, 3, 18, 25, 26, 51]. Flynn introduced a classification [20] still authoritative if slightly dated. The classification is based on only two criteria: the type of instruction flow and the type of data flow treated by elementary processors. The flows are either simple or multiple. It is difficult to classify some ....
H. S. Stone. High-Performance Computer Architecture. Addison-Wesley, 1987.
.... cold start bias were compared: COLD each sample starts with a cold cache, HALF initialize the cache during the first half of each sample and only collect data from the second half, PRIME only simulate the cache on accesses to sets that are initialized by earlier references in the sample [37][70], STITCH reuse the cache state from the end of the previous sample [1] and INITMR estimate the fraction of cold start misses that would have missed even if the cache state were known [80] The results showed that, for the given traces, INITMR was the most effective at reducing cold start ....
H. S. Stone, High-Performance Computer Architecture, second ed., Reading, MA, AddisonWesley, 1990.
No context found.
H. S. Stone. High Performance Computer Architecture. Addison-Wesley 1993 (3rd ed.). ISBN: 0-201-52688-3.
No context found.
H. S. Stone, High-Performance Computer Architecture, Second Edition, Addison Wesley, Reading, MA (1990).
No context found.
H.S. Stone. High-performance Computer Architecture. Addison-Wesley, Reading, MA, 1990.
No context found.
H. S. Stone. High-Performance Computer Architecture. Addison-Wesley, 1993.
No context found.
Harold S. Stone. High Performance Computer Architecture. Addison-Wesley Publishing Company, 3rd edition, 1993.
No context found.
Stone, H. High-performance Computer Architecture. Reading, Massachusetts, Addison-Wesley, 1993.
No context found.
H.S. Stone, High-Performance Computer Architecture, Addison-Wesley Publishing Company, 1987.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC