Reducing Remote Conflict Misses in Shared-Memory Multiprocessors: NUMA with Remote Cache and COMA
Abstract:
Many future applications for scalable shared-memory multiprocessors are likely to have large working sets that overflow secondary or tertiary caches. Two possible solutions to this problem are to add a very large cache called remote cache that caches remote data (NUMA-RC), or organize the machine as a cache-only memory architecture (COMA). This paper tries to determine which solution is best. To compare the performance of the two organizations for the same amount of total memory, we introduce a model of data sharing. The model uses three data sharing patterns: replication, read-mostly migration, and read-write migration. Replication data is accessed in read-mostly mode by several processors, while migration data is accessed largely by one processor at a time. For large working sets, the weight of the migration data largely determines whether COMA outperforms NUMA-RC. Ideally, COMA only needs to fit the replication data in its extra memory; the migration data will simply be swapped between attraction memories. The remote cache of NUMA-RC, instead, needs to house both the replication and the migration data. However, simulations of seven Splash2 applications show that COMA does not outperform NUMA-RC. This is due to several reasons beyond the fact that COMA memory accesses are more expensive. First, the extra memory added has more associativity in NUMA-RC than in COMA and, therefore, can be utilized better by the working set. Second, simple data mastership assignment algorithms in COMA may cause what we call false replication. Finally, many of the Splash2 applications have been optimized for a cache-coherent NUMA machine. Overall, since NUMA-RC is cheaper, NUMA-RC is more cost-effective for these applications.
Citations
| 149 | DDM - A Cache-Only Memory Architecture – Hagersten, Landin, et al. - 1992 |
| 133 | STiNG: A CC-NUMA Computer System for the Commercial Marketplace – Lovett, Clapp - 1996 |
| 111 | Comparative performance evaluation of cachecoherent NUMA and COMA architectures – Stenstrom, Joe, et al. - 1992 |
| 69 | An Argument for Simple COMA – Saulsbury, Wilkinson, et al. - 1995 |
| 62 | Simulation of Multiprocessors: Accuracy and Performance – Goldschmidt - 1993 |
| 50 | The S3.mp Scalable Shared Memory Multiprocessor – Nowatzyk - 1995 |
| 40 | Evaluating the Memory Overhead Required for COMA Architectures – JOE, HENNESSY - 1994 |
| 16 | COMA-F: A Non-Hierarchical Cache Only Memory Architecture – Joe - 1995 |
| 4 | The SPLASH-2 Programs: Chracterization and Methodological Considerations – Woo, Ohara, et al. - 1995 |

