Results 1 - 10
of
15
A multilevel parallelization framework for high-order stencil computations,” in Euro-Par
, 2009
"... Abstract. Stencil based computation on structured grids is a common kernel to broad scientific applications. The order of stencils increases with the required precision, and it is a challenge to optimize such high-order stencils on multicore architectures. Here, we propose a multilevel parallelizati ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Stencil based computation on structured grids is a common kernel to broad scientific applications. The order of stencils increases with the required precision, and it is a challenge to optimize such high-order stencils on multicore architectures. Here, we propose a multilevel parallelization framework that combines: (1) inter-node parallelism by spatial decomposition; (2) intra-chip parallelism through multithreading; and (3) data-level parallelism via singleinstruction multiple-data (SIMD) techniques. The framework is applied to a 6 th order stencil based seismic wave propagation code on a suite of multicore architectures. Strong-scaling scalability tests exhibit superlinear speedup due to increasing cache capacity on Intel Harpertown and AMD Barcelona based clusters, whereas weak-scaling parallel efficiency is 0.92 on 65,536 BlueGene/P processors. Multithreading+SIMD optimizations achieve 7.85-fold speedup on a dual quad-core Intel Clovertown, and the data-level parallel efficiency is found to depend on the stencil order.
Parallel Lattice Boltzmann Flow Simulation on a Lowcost Playstation 3 Cluster
- International Journal of Computer Science
, 2008
"... Abstract. A parallel Lattice Boltzmann Method (pLBM), which is based on hierarchical spatial decomposition, is designed to perform large-scale flow simulations. The algorithm uses critical section-free, dual representation in order to expose maximal concurrency and data locality. Performances of eme ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
(Show Context)
Abstract. A parallel Lattice Boltzmann Method (pLBM), which is based on hierarchical spatial decomposition, is designed to perform large-scale flow simulations. The algorithm uses critical section-free, dual representation in order to expose maximal concurrency and data locality. Performances of emerging multi-core platforms—PlayStation3 (Cell Broadband Engine) and Compute Unified Device Architecture (CUDA)—are tested using the pLBM, which is implemented with multi-thread and message-passing programming. The results show that pLBM achieves good performance improvement, 11.02 for Cell over a traditional Xeon cluster and 8.76 for CUDA graphics processing unit (GPU) over a Sempron central processing unit (CPU). The results provide some insights into application design on future many-core platforms.
Parallel Multidimensional Scaling Performance on Multicore Systems at workshop on
- Advances in High-Performance E-Science Middleware and Applications in Proceedings of eScience 2008 Indianapolis IN December 7-12 2008 http://grids.ucs.indiana.edu/ptliupages/publications/eScience 2008_bae3.pdf
"... Multidimensional scaling constructs a configuration points into the target low-dimensional space, while the interpoint distances are approximated to corresponding known dissimilarity value as much as possible. SMA-COF algorithm is an elegant gradient descent approach to solve Multidimensional scalin ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Multidimensional scaling constructs a configuration points into the target low-dimensional space, while the interpoint distances are approximated to corresponding known dissimilarity value as much as possible. SMA-COF algorithm is an elegant gradient descent approach to solve Multidimensional scaling problem. We design parallel SMACOF program using parallel matrix multiplication to run on a multicore machine. Also, we propose a block decomposition algorithm based on the number of threads for the purpose of keeping good load balance. The proposed block decomposition algorithm works very well if the number of row blocks is at least a half of the number of threads. In this paper, we investigate performance results of the implemented parallel SMACOF in terms of the block size, data size, and the number of threads. The speedup factor is almost 7.7 with 2048 points data over 8 running threads. In addition, performance comparison between jagged array and two-dimensional array in C # language is carried out. The jagged array data structure performs at least 40 % better than the two-dimensional array structure. 1.
FPGA vs. Multi-Core CPUs vs. GPUs: Hands-on Experience with a Sorting Application
"... Abstract. Currently there are several interesting alternatives for lowcost high-performance computing. We report here our experiences with an N-gram extraction and sorting problem, originated in the design of a real-time network intrusion detection system. We have considered FPGAs, multi-core CPUs i ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Currently there are several interesting alternatives for lowcost high-performance computing. We report here our experiences with an N-gram extraction and sorting problem, originated in the design of a real-time network intrusion detection system. We have considered FPGAs, multi-core CPUs in symmetric multi-CPU machines and GPUs and have created implementations for each of these platforms. After carefully comparing the advantages and disadvantages of each we have decided to go forward with the implementation written for multi-core CPUs. Arguments for and against each platform are presented – corresponding to our hands-on experience – that we intend to be useful in helping with the selection of the hardware acceleration solutions for new projects.
Making sense of performance counter measurements on supercomputing applications
, 2010
"... The computation nodes of modern supercomputers consist of multiple multicore chips. Many scientific and engineering application codes have been migrated to these systems with little or no optimization for multicore architectures, effectively using only a fraction of the number of cores on each chip ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
The computation nodes of modern supercomputers consist of multiple multicore chips. Many scientific and engineering application codes have been migrated to these systems with little or no optimization for multicore architectures, effectively using only a fraction of the number of cores on each chip or achieving suboptimal performance from the cores they do utilize. Performance optimization on these systems require both different measurements and different optimization techniques than those for single core chips. This paper describes primary performance bottlenecks unique to multicore chips, sketching the roles that several commonly used measurement tools can most effectively play in performance optimization. The HOMME benchmark code from NCAR is used as a representative case study on several multicore based supercomputers to formulate and interpret measurements and derive characterizations relevant to modern multicore performance bottlenecks. Finally, we describe common pitfalls in performance measurements on multicore chips and how they may be avoided along with a novel high level multicore optimization technique that increased performance up to 35%. 1 0
HYBRID MESSAGE-PASSING AND SHARED-MEMORY PROGRAMMING IN A MOLECULAR DYNAMICS APPLICATION ON MULTICORE CLUSTERS
"... Hybrid programming, whereby shared-memory and mes-sage-passing programming techniques are combined within a single parallel application, has often been dis-cussed as a method for increasing code performance on clusters of symmetric multiprocessors (SMPs). This paper examines whether the hybrid model ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Hybrid programming, whereby shared-memory and mes-sage-passing programming techniques are combined within a single parallel application, has often been dis-cussed as a method for increasing code performance on clusters of symmetric multiprocessors (SMPs). This paper examines whether the hybrid model brings any perform-ance benefits for clusters based on multicore processors. A molecular dynamics application has been parallelized using both MPI and hybrid MPI/OpenMP programming models. The performance of this application has been examined on two high-end multicore clusters using both Infiniband and Gigabit Ethernet interconnects. The hybrid model has been found to perform well on the higher-latency Gigabit Ethernet connection, but offers no perform-ance benefit on low-latency Infiniband interconnects. The changes in performance are attributed to the differing com-munication profiles of the hybrid and MPI codes. Key words: message passing, shared memory, multicore, clusters, hybrid programming 1
Balancing Locality and Parallelism on Shared-cache Multi-core Systems
"... Abstract—The emergence of multi-core systems opens new opportunities for thread-level parallelism and dramatically in-creases the performance potential of applications running on these systems. However, the state of the art in performance enhancing software is far from adequate in regards to the ex- ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract—The emergence of multi-core systems opens new opportunities for thread-level parallelism and dramatically in-creases the performance potential of applications running on these systems. However, the state of the art in performance enhancing software is far from adequate in regards to the ex-ploitation of hardware features on this complex new architecture. As a result, much of the performance capabilities of multi-core systems are yet to be realized. This research addresses one facet of this problem by exploring the relationship between data-locality and parallelism in the context of multi-core architectures where one or more levels of cache are shared among the different cores. A model is presented for determining a profitable synchronization interval for concurrent threads that interact in a producer-consumer relationship. Experimental results suggest that consideration of the syn-chronization window, or the amount of work individual threads can be allowed to do between synchronizations, allows for parallelism- and locality-aware performance optimizations. The optimum synchronization window is a function of the number of threads, data reuse patterns within the workload, and the size and configuration of the last-level of cache that is shared among processing units. By considering these factors, the calculation of the optimum synchronization window incorporates parallelism and data locality issues for maximum performance. Index Terms—shared-cache, parallelism, performance tuning, memory hierarchy optimization I.