| Daniel F. Zucker, Architecture and Arithmetic for Multime- dia Enhanced Processors, Ph.D. thesis, Stanford University, June 1997. |
....enhanced with a vector unit and show that this architecture can achieve good performance benefits. 37 Zucker et al. study MPEG video decode applications and show the benefits from I O prefetching, software restructuring to use SIMD without hardware support, and profiledriven software prefetching [132, 133]. However, the studies assume a simplistic processor model with blocking loads and do not study the effect of media ISA extensions. Bilas et al. develop two parallel versions of the MPEG decoder and present results for multiprocessor speedup, memory requirements, load balance, synchronization, and ....
Daniel F. Zucker. Architecture and Arithmetic for Multimedia Enhanced Processors. PhD thesis, Department of Electrical Engineering, Stanford University, June 1997.
....on video application traces [5] have found that hybrid memory architectures with stream buffers provide better performance than cache only memory hierarchies. However, these studies are based on trace driven simulations assuming perfect branch prediction and memory disambiguation. Zucker et al. [6][7] also examined the impact of streaming memory structures such as stream buffers, stride prediction tables, and stream caches, and also found considerable benefit with these structures in multimedia applications. Consequently, we expect some hybrid of cache and prefetching will provide the best ....
....memory latency is the dominant memory bottleneck in media processing. 9 CONCLUSIONS This paper explored the issues in memory hierarchy design for programmable media processors by evaluating the performance of a multi level cache memory hierarchy. Prior work focused on the L1 cache level [5][6][10] 12] so particular emphasis was placed on the L2 cache and external memory. Surprisingly, it was found that the second level cache had little impact on memory performance. L2 cache size and access latency have only nominal effects on performance. The L2 cache will be important, however, for ....
D. F. Zucker, "Architecture and Arithmetic for Multimedia Enhanced Processors," Ph.D. Thesis, Dept. of Electrical Engineering, Stanford University, June 1997.
....set to be contained on chip, reducing penalties when re accessing previously used data. Also, multimedia applications often have predictable memory access patterns. Prefetching memory structures have been shown to be effective at reducing the memory latencies for many multimedia applications [57][58] It is not currently known what types of memory 32 hierarchies will be most suitable for multimedia, but providing additional silicon resources for on chip memory hierarchies will enable innovative memory hierarchy designs that combine memory size and prefetching to most effectively reduce ....
....hierarchies [103] and typical DSPs use local memory with some form of prefetching such as DMA, 147 the memory hierarchy for media processors remains an unresolved issue. We believe some hybrid of the two will provide the best performance. Studies by Wu and Wolf on video application traces [58][57] have examined both cache memory systems and hybrid memory architectures that combine a stream buffer or stride prediction table with cache. These results concluded that cache memory combined with stream buffers had the best performance, however these studies are based on trace driven simulations ....
D. F. Zucker, "Architecture and Arithmetic for Multimedia Enhanced Processors," Ph.D. Thesis, Dept. of Electrical Engineering, Stanford University, June 1997.
.... a reduction in the CPU component because of the reduction in instructions and better scheduling when loops are unrolled for the prefetching algorithm [14] show the benefits from I O prefetching, software restructuring to use SIMD without hardware support, and profiledriven software prefetching [25, 26]. However, the studies assume a simplistic processor model with blocking loads and do not study the effect of media ISA extensions. Bilas et al. develop two parallel versions of the MPEG decoder and present results for multiprocessor speedup, memory requirements, load balance, synchronization, and ....
D. F. Zucker. Architecture and Arithmetic for Multimedia Enhanced Processors. PhD thesis, Department of Electrical Engineering, Stanford University, June 1997.
....not the case. Adding additional prefetch instructions increases the number of instructions that must be executed in the benchmark. A trade off is made between the number of new cycles that must be executed in prefetch instructions and the number of cycles saved from eliminating cache misses. In [14] it was determined that approximately 15 of all instructions causing useful prefetches cause over 90 of useful prefetches. It was also determined that inserting instructions to cause only 90 of useful prefetches resulted in the best trade off between increased overhead and decreased cache ....
....program execution, while prefetch misses are non blocking. For software prefetching, execution time incorporates the cost of the additional prefetch instruc tions executed. These assumptions are made to show the effect of prefetching on a processor unlimited by other resource constraints. In [14], the effect of modifying these as sumptions is investigated. Execution time is calculated considering the cost for partially completed prefetches. Furthermore, an instruction mix in which memory operations make up only a fraction of total instructions is considered. Finally, different bus models ....
[Article contains additional citation context not shown here]
Daniel F. Zucker, Architecture and Arithmetic for Multime- dia Enhanced Processors, Ph.D. thesis, Stanford University, June 1997.
....in a real system. Miss rates for these applications run in a baseline cache with no enhancements are shown in Fig. 1. Characteristics of the movies used in the benchmark executions are summarized in Table I. Complete miss rate data for a range of cache line sizes and associativities is given in [15]. B. Performance Metrics Fraction of misses eliminated is the primary performance metric reported. This metric judges the performance of a given prefetch scheme, independent of the particular cache implementation. A perfect prefetching scheme would eliminate all memory misses. This would have a ....
....and 2 way Fig. 16. Performance data for different numbers of prefetch instructions inserted for movie hula in a direct mapped cache. a) Fraction of misses eliminated. b) Relative execution time for a 128 kB main cache. associative and a number of instruction mix parameters is available in [15]. We assume that all instructions except cache misses execute in one cycle. We show data for three movies in Fig. 18. In general, the maximum benefit added by prefetching is increased as the memory access time is increased. As the memory access time is increased, the fraction of execution time ....
D. F. Zucker, "Architecture and arithmetic for multimedia enhanced processors, " Ph.D. dissertation, Stanford Univ., Stanford, CA, June 1997.
....is not the case. Adding additional prefetch instructions increases the number of instructions that must be executed in the benchmark. A trade off is made between the number of new cycles that must be executed in prefetch instructions and the number of cycles saved from eliminating cache misses. In [14] it was determined that approximately 15 of all instructions causing useful prefetches cause over 90 of useful prefetches. It was also determined that inserting instructions to cause only 90 of useful prefetches resulted in the best trade off between increased overhead and decreased cache ....
....block program execution, while prefetch misses are non blocking. For software prefetching, execution time incorporates the cost of the additional prefetch instructions executed. These assumptions are made to show the effect of prefetching on a processor unlimited by other resource constraints. In [14], the effect of modifying these assumptions is investigated. Execution time is calculated considering the cost for partially completed prefetches. Furthermore, an instruction mix in which memory operations make up only a fraction of total instructions is considered. Finally, different bus models ....
[Article contains additional citation context not shown here]
Daniel F. Zucker, Architecture and Arithmetic for Multimedia Enhanced Processors, Ph.D. thesis, Stanford University, June 1997.
No context found.
D. F. Zucker, "Architecture and arithmetic for multimedia enhanced processors," Ph.D. Thesis, Dept. of Electrical Engineering, Stanford University, Jun.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC