Results 1 -
6 of
6
Memory management in NUMA multicore systems: Trapped between cache contention and interconnect overhead
- In Proceedings of ISMM’11
"... Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, the issues of memory management and process mappi ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, the issues of memory management and process mapping must be revisited. We find that optimizing only for data locality can counteract the benefits of cache contention avoidance and vice versa. Therefore, system software must take both data locality and cache contention into account to achieve good performance, and memory management cannot be decoupled from process scheduling. We present a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem. We describe two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and its extension, N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention. N-MASS is fine-tuned to support memory management on NUMA-multicores and improves performance up to 32%, and 7% on average, over the default setup in current Linux implementations.
Memory System Performance in a NUMA Multicore Multiprocessor
"... Modern multicore processors with an on-chip memory controller form the base for NUMA (non-uniform memory architecture) multiprocessors. Each processor accesses part of the physical memory directly and has access to the other parts via the memory controller of other processors. These other processors ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Modern multicore processors with an on-chip memory controller form the base for NUMA (non-uniform memory architecture) multiprocessors. Each processor accesses part of the physical memory directly and has access to the other parts via the memory controller of other processors. These other processors are reached via the cross-processor interconnect. As a consequence a processor’s memory controller must satisfy two kinds of requests: those that are generated by the local cores and those that arrive via the interconnect from other processors. On the other hand, a core (respectively the core’s cache) can obtain data from multiple sources: data can be supplied by the local memory controller or by a remote memory controller on another processor. In this paper we experimentally analyze the behavior of the memory controllers of a commercial multicore processor, the Intel Xeon 5520 (Nehalem). We develop a simple model to characterize the sharing of local and remote memory bandwidth. The uneven treatment of local and remote accesses has implications for mapping applications onto such a NUMA multicore multiprocessor. Maximizing data locality does not always minimize execution time; it may be more advantageous to allocate data on a remote processor (and then to fetch these data via the cross-processor interconnect) than to store the data of all processes in local memory (and consequently overloading the on-chip memory controller).
Application-to-Core Mapping Policies to Reduce Interference in On-Chip Networks
, 2011
"... As the industry moves toward many-core processors, Network-on-Chips (NoCs) will likely become the communication backbone of future microprocessor designs. The NoC is a critical shared resource and its effective utilization is essential for improving overall system performance and fairness. In this p ..."
Abstract
- Add to MetaCart
As the industry moves toward many-core processors, Network-on-Chips (NoCs) will likely become the communication backbone of future microprocessor designs. The NoC is a critical shared resource and its effective utilization is essential for improving overall system performance and fairness. In this paper, we propose application-to-core mapping policies to reduce the contention in network-on-chip and memory controller resources and hence improve overall system performance. First, we introduce the notion of clusters: cores are grouped into clusters, and a memory controller is assigned to each cluster. The memory controller assigned for a cluster is primarily responsible for servicing the data requested by the applications assigned to that cluster. We propose and evaluate page allocation and page replacement policies that ensure that network traffic of a core is restricted to its cluster with high probability. Second, we develop algorithms that distribute applications between clusters. Our inter-cluster mapping algorithm separates interference-sensitive applications from aggressive ones by mapping them to different clusters to improve system performance, while maintaining a reasonable network load balance among different clusters. Contrary to the conventional wisdom of balancing network/memory load across clusters, we observe that it is also important to ensure that applications that are more sensitive to network latency experience little interference from applications that are networkintensive. Finally, we develop algorithms to map applications to cores within a cluster. The key idea of intra-cluster mapping is to map those applications that benefit more from being close to the memory controller, closer to the controller. We evaluate the proposed application-to-core mapping policies on a 60-core CMP with an 8x8 mesh NoC using a suite of 35 diverse applications. Averaged over 128 randomly generated multiprogrammed workloads, the final proposed policy improves system throughput by 16.7 % in terms of weighted speedup over a baseline manycore processor, while also reducing system unfairness by 22.4 % and interconnect power consumption by 52.3%.
USIMM: the Utah SImulated Memory Module A Simulation Infrastructure for the JWAC Memory Scheduling Championship
, 2012
"... USIMM, the Utah SImulated Memory Module, is a DRAM main memory system simulator that is being released for use in the Memory Scheduling Championship (MSC), organized in conjunction with ISCA-39. MSC is part of the JILP Workshops on Computer Architecture Competitions (JWAC). This report describes the ..."
Abstract
- Add to MetaCart
USIMM, the Utah SImulated Memory Module, is a DRAM main memory system simulator that is being released for use in the Memory Scheduling Championship (MSC), organized in conjunction with ISCA-39. MSC is part of the JILP Workshops on Computer Architecture Competitions (JWAC). This report describes the simulation infrastructure and how it The Journal of Instruction Level Parallelism (JILP) organizes an annual Workshop on Computer Architecture Competitions (JWAC). The 2012 competition is a Memory Scheduling Championship (MSC) that will be held in conjuction with ISCA-39 in Portland, Oregon. The memory sub-system is an important component in all computer systems, accounting
StagedReads: Mitigating the Impactof DRAM Writeson DRAM Reads ∗
"... Mainmemorylatencieshavealwaysbeenaconcernfor system performance. Given that reads are on the critical path for CPU progress, reads must be prioritized over writes. However, writes must be eventually processed and theyoftendelaypendingreads. Infact,asinglechannelin the main memory system offers almos ..."
Abstract
- Add to MetaCart
Mainmemorylatencieshavealwaysbeenaconcernfor system performance. Given that reads are on the critical path for CPU progress, reads must be prioritized over writes. However, writes must be eventually processed and theyoftendelaypendingreads. Infact,asinglechannelin the main memory system offers almost no parallelism between reads and writes. This is because a single off-chip memory bus is shared by reads and writes and the direction of the bus has to be explicitly turned around when switching from writes to reads. This is an expensive operation and its cost is amortized by carrying out a burst of writes or reads every time the bus direction is switched. As a result, no reads can be processed while a memory channel is busy servicing writes. This paper proposes a
Improving Writeback Efficiency with Decoupled Last-Write Prediction
"... In modern DDRx memory systems, memory write requests compete with read requests for available memory resources,significantlyincreasingtheaveragereadrequest servicetime. Cachesareusedtomitigatelongmemoryread latency that limits system performance. Dirty blocks in the last-level cache(LLC) that will n ..."
Abstract
- Add to MetaCart
In modern DDRx memory systems, memory write requests compete with read requests for available memory resources,significantlyincreasingtheaveragereadrequest servicetime. Cachesareusedtomitigatelongmemoryread latency that limits system performance. Dirty blocks in the last-level cache(LLC) that will not be written againbefore they are evicted will eventually be written back to memory. We refer to these blocks as last-write blocks. In this paper, we propose an LLC writeback technique that improves DRAM efficiency by scheduling predicted last-write blocks early. We propose a low overhead last-write predictor for the LLC. The predicted last-write blocks are made available to the memory controller for scheduling. This techniqueeffectivelyre-distributesthememoryrequestsandexpands writes scheduling opportunities, allowing writes to be serviced efficiently by DRAM. The technique is flexible enough to be applied to any LLC replacement policy. Our evaluation with multi-programmed workloads shows that the techniquesignificantly improves performance by 6.5%-11.4 % on averageover the traditionalwriteback technique in an eight-core processor with various DRAM configurationsrunningmemoryintensivebenchmarks. 1

