Results 1 -
5 of
5
Nocmsg: Scalable noc-based message passing
- in International Symposium on Cluster, Cloud and Grid Computing
, 2014
"... Abstract—Current processor design with ever more cores may ensure that theoretical compute performance still follows past increases (resting from Moore’s law), but they also increas-ingly present a challenge to hardware and software alike. As the core count increases, the network-on-chip (NoC) topol ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Current processor design with ever more cores may ensure that theoretical compute performance still follows past increases (resting from Moore’s law), but they also increas-ingly present a challenge to hardware and software alike. As the core count increases, the network-on-chip (NoC) topology has changed from buses over rings and fully connected meshes to 2D meshes. The question is which programming paradigm provides the scalability needed to ensure performance is close to theoretical peak, where 2D meshes provide the most scalable design to date. This work contributes NoCMsg, a low-level message passing abstraction over NoCs. NoCMsg is specifically designed for large core counts in 2D meshes. Its design ensures deadlock free messaging for wormhole Manhattan-path routing over the NoC. Experimental results on the TilePro hardware platform show that NoCMsg can significantly reduce communication times by up to 86 % for single packet messages and up to 40% for larger messages compared to other NoC-based message approaches. Results further demonstrate the potential of NoC messaging to outperform shared memory abstractions by up to 93 % as core counts and inter-process communication increase, i.e., we observe that shared memory scales up to about 16 cores while message passing performs well beyond that threshold on this platform. To the best of our knowledge, this is the first head-on comparison of shared memory and advanced message passing specifically designed for NoCs on an actual hardware platform with larger core counts on a single socket. I.
Distributed Load Balancing for Parallel Agent-based Simulations
"... Abstract—We focus on agent-based simulations where a large number of agents move in the space, obeying to some simple rules. Since such kind of simulations are computational intensive, it is challenging, for such a contest, to let the number of agents to grow and to increase the quality of the simul ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract—We focus on agent-based simulations where a large number of agents move in the space, obeying to some simple rules. Since such kind of simulations are computational intensive, it is challenging, for such a contest, to let the number of agents to grow and to increase the quality of the simulation. A fascinating way to answer to this need is by exploiting parallel architectures. In this paper, we present a novel distributed load balancing schema for a parallel implementation of such simulations. The purpose of such schema is to achieve an high scalability. Our approach to load balancing is designed to be lightweight and totally distributed: the calculations for the balancing take place at each computational step, and influences the successive step. To the best of our knowledge, our approach is the first distributed load balancing schema in this context. We present both the design and the implementation that allowed us to perform a number of experiments, with up-to 1,000,000 agents. Tests show that, in spite of the fact that the load balancing algorithm is local, the workload distribution is balanced while the communication overhead is negligible.
Accelerating text mining workloads in a mapreduce-based distributed gpu environment
- J. Parallel Distrib. Comput
, 2013
"... Scientific computations have been using GPU-enabled computers success-fully, often relying on distributed nodes to overcome the limitations of device memory. Only a handful of text mining applications benefit from such infras-tructure. Since the initial steps of text mining are typically data-intens ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Scientific computations have been using GPU-enabled computers success-fully, often relying on distributed nodes to overcome the limitations of device memory. Only a handful of text mining applications benefit from such infras-tructure. Since the initial steps of text mining are typically data-intensive, and the ease of deployment of algorithms is an important factor in develop-ing advanced applications, we introduce a flexible, distributed, MapReduce-based text mining workflow that performs I/O-bound operations on CPUs with industry-standard tools and then runs compute-bound operations on GPUs which are optimized to ensure coalesced memory access and effec-tive use of shared memory. We have performed extensive tests of our algo-rithms on a cluster of eight nodes with two NVidia Tesla M2050 attached to each, and we achieve considerable speedups for random projection and self-organizing maps.
ABSTRACT ZIMMER, CHRISTOPHER J. Bringing Efficiency and Predictability to Massive Multi-core
"... Massive multi-core network-on-chip (NoC) processors represent the next stage in both embedded and general purpose computing. These novel architecture designs with abundant processing resources and increased scalability address the frequency limits of modern processors, power/leakage constraints, and ..."
Abstract
- Add to MetaCart
(Show Context)
Massive multi-core network-on-chip (NoC) processors represent the next stage in both embedded and general purpose computing. These novel architecture designs with abundant processing resources and increased scalability address the frequency limits of modern processors, power/leakage constraints, and the scalability limits of system bus interconnects. NoC architectures are particularly interesting in both the real-time embedded and high-performance computing domains. Abundant processing resources have the potential to simplify scheduling and represent a shift away from single core utilization concerns e.g., within the model of the “dark silicon ” abstraction that promotes a 1-to-1 task-to-core mapping with frequent core activations/deactivations. Additionally, due to silicon constraints, massive multi-core processors often contain simplified processor pipelines that provide an increase in predictability analysis beneficial for real-time systems. Also, simplified processor pipelines coupled with high-performance interconnects often result in low power utilization that is beneficial in high-performance systems. While suitable in many ways, these architectures are not without their own challenges. Reliance on shared memory and the strain that massive multi-core processors can put on memory controllers represent a significant challenge to predictability and performance. Resilience is
ANoCMsg: A Scalable Message Passing Abstraction for Network-on-Chips
"... The number of cores of contemporary processors is constantly increasing and thus continues to deliver ever higher peak performance (following Moore’s transistor law). Yet, high core counts present a challenge to hardware and software alike. Following this trend, the network-on-chip (NoC) topology ha ..."
Abstract
- Add to MetaCart
The number of cores of contemporary processors is constantly increasing and thus continues to deliver ever higher peak performance (following Moore’s transistor law). Yet, high core counts present a challenge to hardware and software alike. Following this trend, the network-on-chip (NoC) topology has changed from buses over rings and fully connected meshes to 2D meshes. This work contributes NoCMsg, a low-level message-passing abstraction over NoCs, which is specifically designed for large core counts in 2D meshes. NocMsg ensures deadlock free messaging for wormhole Manhattan-path routing over the NoC via a polling-based message abstraction and non-flow controlled communication for selective communication patterns. Experimental results on the TilePro hardware platform show that NoCMsg can significantly reduce communication times by up to 86 % for single packet messages and up to 40 % for larger messages compared to other NoC-based message approaches. On the TilePro platform, NoCMsg outperforms shared memory abstractions by up to 93 % as core counts and inter-process communication increase. Results for fully pipelined double precision numerical codes show speedups of up to 64 % for message passing over shared memory at 32 cores. Overall, we observe that shared memory scales up to about 16 cores on this platform while message passing performs well beyond that threshold. These results generalize to similar NoC-based platforms.