| T. Austin, G. Sohi, "High-Bandwidth Address Translation for Multiple-Issue Processors," 23rd ISCA, May 1996, pp. 158--167. |
....[31, 15, 23] have looked at hardware TLB structures organization and their impact on system performance in terms of capacity and or associativity. While some of these have focussed on single (monolithic) TLBs, there have been studies which have investigated the benefits of multi level TLBs [6, 2]. There are also implementations of multi level TLBs in commercial processors such as MIPS R4000, Hal s SPARC64, IBM AS 400 PowerPC, We would like to differentiate between the terms software TLB management and software TLB handling in this paper. We use the latter to denote that the miss ....
....TLB management has not been investigated extensively. AMD K 7 and Intel Itanium. With instruction level parallelism (ILP) being exploited by most current processors, there is a need to provide multi ported TLBs to allow several concurrent instruction streams to access the TLB. Austin and Sohi [2] show how multiple ports can impact access latencies, and argue for interleaved and multi level designs. TLB miss handling costs need to be kept low for good performance. Commercial processors use either a hardware mechanism or a software mechanism to fill the TLB on a miss. Unlike hardware ....
T. M. Austin and G. S. Sohi. High Bandwidth Address Translation for Multiple Issue Processors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, 1996.
.... caches require more area and have more access time [8] Also, widening requires the same number of address translations, while replication requires more addresses translated per cycle (this affects the number of ports of the TLB, causing an increase in cycle time and die area of the TLB [1]) Register file (RF) in our proposal widening is applied to the buses, the FPUs and the register file. Every register in the RF increases its width in bits, but they have the same number of ports per bit. Applying replication increases the number of ports per bit. Both techniques increase the ....
T.M. Austin and G.S. Sohi. High-bandwidth address translation for multiple-issue processors. In Proc. of the ISCA-23, pp 158-167. May 1996.
....and queue sizes are sufficiently large. 6.1.3 Combining Memory Requests In addition to reordering, the cache access rate can be increased by using a technique called reference combining. Combining is a technique concurrently developed by Wilson, Olokotun and Rosenblum [62] and Austin and Sohi [65], which attempts to combine references to the same cache line into a single request. Combining focuses cache resources on areas in the design that can benefit from spatial locality and works as follows: Accessing a storage element in a conventional cache can be thought of as indexing into a two ....
Todd Austin and Gurindar Sohi, High-Bandwidth Address Translation for Multiple-Issue Processors, Proceedings of the 23rd Annual International Symposium on Computer Architecture, (1996) 158-167.
....a processor of degree 8. As can be seen, Level 0 filters most of the data traffic and adding more ports from Level 0 to Level 1 does not significantly increase performance. However, performance remains significantly worse than true multi ported caches, unlike the same design applied to TLBs, see [AS96] Performance is close to that of multi banked caches. 6.3 Alternative Data Layout Bank conflicts are one of the two performance bottlenecks of multi banked caches. The data distribution in banks can have a strong impact on the occurrence of bank conflicts. Word distribution can be an ....
....processor to cache ports are more efficiently used over time. 4 entry fifos placed above each bank improve the baseline multi banked configuration performance by 0.20 IPC in average, see Figure 10. These fifos can be used to implement further optimizations. For multi banked TLBs, Austin et al. AS96] have 8 The HP 8000 primary cache is located off chip so chip area is less a concern. 9 128 byte lines are used in the MIPS R8000 floating point cache. 10 The dual banked floating point cache of the MIPS R8000 is 1Mbyte to 16 Mbyte large. 1 2 4 8 16 32 Number of Banks 2.0 2.5 3.0 3.5 4.0 ....
[Article contains additional citation context not shown here]
Todd M. Austin and Gurindar S. Sohi. Highbandwidth address translation for multiple-issue processors. In Proceedings of the 23rd ACM International Symposium on Computer Architecture, Philadelphia, May 1996.
....the TLB miss handler. Improving TLB performance requires either reducing the number of TLB misses incurred by an application, or reducing the cost of individual TLB misses. While some recent research has focused on reducing the cost of TLB misses [Chen et al. 92, Bala et al. 94, Uhlig et al. 94, Austin Sohi 96] TLB miss paths are already highly optimized, with a single miss costing as little as 10 30 cycles [Kane Heinrich 92, Dutton et al. 92] The other alternative for improving TLB performance is to reduce the number of TLB misses. One approach would be to intervene at the level of the user or ....
T. M. Austin and G. S. Sohi. High-Bandwidth Address Translation for Multiple-Issue Processors. In Proceedings of the 22nd Annual Symposium on Computer Architecture, pages 158--167. IEEE, May 1996.
....cache configuration of each level [SPH88] i.e. cache size, cache line size and associativity. In [JW94] a similar but more precise study is based on the utilization of both a cycle time model and a memory area model but the effect of varying the number of ports was no studied. Austin et al. AS96] have demonstrated two level TLB hierarchies can also be used to achieve high bandwidth address translation. However, TLBs exploit spatial locality at a much coarser grain than caches so that techniques that perform well with TLB may behave differently with caches. We experimented the idea of ....
....a processor of degree 8. As can be seen, Level 0 filters most of the data traffic and adding more ports from Level 0 to Level 1 does not significantly increase performance. However, performance remains significantly worse than true multi ported caches, unlike the same design applied to TLBs, see [AS96] Performance is close to that of multi banked caches. 6.3 Alternative Data Layout Bank conflicts are one of the two performance bottlenecks of multi banked caches. The data distribution in banks can have a strong impact on the occurrence of bank conflicts. Word distribution can be an ....
[Article contains additional citation context not shown here]
Todd M. Austin and Gurindar S. Sohi. Highbandwidth address translation for multipleissue processors. In Proceedings of the 23rd ACM International Symposium on Computer Architecture, Philadelphia, May 1996.
....time in some workloads[24] 25] 26] The increasing pressure on the TLB comes from speed requirement and lack of scalability. Being in the critical path of every instruction and data access, the TLB latency and bandwidth requirements increase with the clock rate and instruction level parallelism[2]. Moreover, the memory system of modern processors such as dynamically scheduled [33] or VLIW processors[12] need to satisfy multiple memory and TLB accesses in each cycle. It is more and more difficult and costly to implement a large TLB meeting these requirements. At the same time, the working ....
....mechanism. Wang et al. 29] proposed the idea of a two level virtualreal cache hierarchy where the TLB is after the FLC. We have called this system L1 TLB. They proposed to store pointers in the two caches to solve the synonym and the writeback problems and to enforce inclusion. Austin and Sohi[2] showed the bandwidth requirement on TLB in L0 TLB for multiple issue processors. Instead of brute force multi ported TLBs, they evaluated several method to expand TLB bandwidth, such as interleaved TLB, multi level TLB, piggyback ports which send the translation to simultaneous arriving requests, ....
Todd Austin and Gurindar Sohi. "High-Bandwidth Address Translation for Multiple-Issue Processors," In Proceedings of the 22nd Annual International Symposium on Computer Architecture(ISCA), pages 158-167, 1996.
.... caches require more area and have more access time [JNT97] Also, widening requires the same number of addresses translations, while replication requires more addresses translated per cycle (that affect the number of ports of the TLB, causing an increase of the cycle time and die area of the TLB [AS96]) Register file (RF) in our proposal widening is applied to the buses, the FPUs and the register file. Every register in the RF increases its width in bits, but they have the same number of ports per bit. Applying replication increases the number of ports per bit. Both techniques increase the ....
T.M. Austin and G.S. Sohi. High-bandwidth address translation for multiple-issue processors. In ISCA-23, pp 158-167. May 1996.
....processors. Efficient functions must, however, be weighed against implementation complexity and the possibility of lengthening the cache access time. This therefore renders accurate, but complex selection functions highly unattractive for cache design. In our experiments, we use bit selection [8] (see Figure 2c) a simple function which uses a portion of the effective address as the bank number; data layout in cache is thus line interleaved. As we see shortly, the choice of a selection function may not be as critical as we thought since much of the loss of bandwidth due to same bank ....
....references access the same line in the same cache bank. This inherent spatial locality in program reference patterns can be exploited to improve multi bank delivered bandwidth through access combining. Access combining, a technique developed concurrently by Wilson et al. 7] and Austin and Sohi [8], attempts to combine references to the same cache line into a single request. Combining devotes additional cache resources to areas in the design that can best exploit spatial locality. Combining works as follows: Accessing stored data in a conventional cache can be viewed as an indexing ....
[Article contains additional citation context not shown here]
T. M. Austin and G. S. Sohi, "High-Bandwidth Address Translation for Multiple-Issue Processors," Proceedings of ISCA-23, May 1996.
No context found.
T. Austin, G. Sohi, "High-Bandwidth Address Translation for Multiple-Issue Processors," 23rd ISCA, May 1996, pp. 158--167.
No context found.
Todd M. Austin and Gurindar S. Sohi, "High-bandwidth address translation for multiple-issue processors," In Proceedings of the 23 rd ACM Int'l Symp. on Computer Architecture,pp.158-167,May 1996.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC