• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

The impact of multicore on computational science software. CTWatch Quarterly (2007)

by J Dongarra, D Gannon, G Fox, K Kennedy
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 15
Next 10 →

A multilevel parallelization framework for high-order stencil computations,” in Euro-Par

by Hikmet Dursun, Ken-ichi Nomura, Liu Peng, Richard Seymour, Weiqiang Wang, Rajiv K. Kalia, Aiichiro Nakano, Priya Vashishta , 2009
"... Abstract. Stencil based computation on structured grids is a common kernel to broad scientific applications. The order of stencils increases with the required precision, and it is a challenge to optimize such high-order stencils on multicore architectures. Here, we propose a multilevel parallelizati ..."
Abstract - Cited by 11 (1 self) - Add to MetaCart
Abstract. Stencil based computation on structured grids is a common kernel to broad scientific applications. The order of stencils increases with the required precision, and it is a challenge to optimize such high-order stencils on multicore architectures. Here, we propose a multilevel parallelization framework that combines: (1) inter-node parallelism by spatial decomposition; (2) intra-chip parallelism through multithreading; and (3) data-level parallelism via singleinstruction multiple-data (SIMD) techniques. The framework is applied to a 6 th order stencil based seismic wave propagation code on a suite of multicore architectures. Strong-scaling scalability tests exhibit superlinear speedup due to increasing cache capacity on Intel Harpertown and AMD Barcelona based clusters, whereas weak-scaling parallel efficiency is 0.92 on 65,536 BlueGene/P processors. Multithreading+SIMD optimizations achieve 7.85-fold speedup on a dual quad-core Intel Clovertown, and the data-level parallel efficiency is found to depend on the stencil order.
(Show Context)

Citation Context

...ls increases with the required precision, and it is a challenge to optimize such high-order stencils on multicore architectures. Here, we propose a multilevel parallelization framework that combines: =-=(1)-=- inter-node parallelism by spatial decomposition; (2) intra-chip parallelism through multithreading; and (3) data-level parallelism via singleinstruction multiple-data (SIMD) techniques. The framework...

Parallel Lattice Boltzmann Flow Simulation on a Lowcost Playstation 3 Cluster

by Liu Peng, Ken-ichi Nomura, Takehiro Oyakawa, Rajiv K. Kalia, Aiichiro Nakano, Priya Vashishta - International Journal of Computer Science , 2008
"... Abstract. A parallel Lattice Boltzmann Method (pLBM), which is based on hierarchical spatial decomposition, is designed to perform large-scale flow simulations. The algorithm uses critical section-free, dual representation in order to expose maximal concurrency and data locality. Performances of eme ..."
Abstract - Cited by 9 (0 self) - Add to MetaCart
Abstract. A parallel Lattice Boltzmann Method (pLBM), which is based on hierarchical spatial decomposition, is designed to perform large-scale flow simulations. The algorithm uses critical section-free, dual representation in order to expose maximal concurrency and data locality. Performances of emerging multi-core platforms—PlayStation3 (Cell Broadband Engine) and Compute Unified Device Architecture (CUDA)—are tested using the pLBM, which is implemented with multi-thread and message-passing programming. The results show that pLBM achieves good performance improvement, 11.02 for Cell over a traditional Xeon cluster and 8.76 for CUDA graphics processing unit (GPU) over a Sempron central processing unit (CPU). The results provide some insights into application design on future many-core platforms.
(Show Context)

Citation Context

...e free-ride era (i.e., legacy software will run faster on newer chips), resulting in a dichotomy—subsiding speed-up of conventional software and exponential speed-up of scalable parallel applications =-=[5]-=-. Recent progresses in highperformance technical computing have identified key technologies for parallel computing with portable scalability. An example is an Embedded Divide-andConquer (EDC) algorith...

doi:10.1093/nar/gkp366 Pomelo II: finding differentially expressed genes

by Edward R. Morrissey, Ramón Diaz-uriarte , 2009
"... Pomelo II ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
Abstract not found

Parallel Multidimensional Scaling Performance on Multicore Systems at workshop on

by Seung-hee Bae - Advances in High-Performance E-Science Middleware and Applications in Proceedings of eScience 2008 Indianapolis IN December 7-12 2008 http://grids.ucs.indiana.edu/ptliupages/publications/eScience 2008_bae3.pdf
"... Multidimensional scaling constructs a configuration points into the target low-dimensional space, while the interpoint distances are approximated to corresponding known dissimilarity value as much as possible. SMA-COF algorithm is an elegant gradient descent approach to solve Multidimensional scalin ..."
Abstract - Cited by 6 (2 self) - Add to MetaCart
Multidimensional scaling constructs a configuration points into the target low-dimensional space, while the interpoint distances are approximated to corresponding known dissimilarity value as much as possible. SMA-COF algorithm is an elegant gradient descent approach to solve Multidimensional scaling problem. We design parallel SMACOF program using parallel matrix multiplication to run on a multicore machine. Also, we propose a block decomposition algorithm based on the number of threads for the purpose of keeping good load balance. The proposed block decomposition algorithm works very well if the number of row blocks is at least a half of the number of threads. In this paper, we investigate performance results of the implemented parallel SMACOF in terms of the block size, data size, and the number of threads. The speedup factor is almost 7.7 with 2048 points data over 8 running threads. In addition, performance comparison between jagged array and two-dimensional array in C # language is carried out. The jagged array data structure performs at least 40 % better than the two-dimensional array structure. 1.
(Show Context)

Citation Context

...re. 1. Introduction Since multicore architectures were invented, multicore architectures have been getting important in software development and effecting on client, server and supercomputing systems =-=[2, 8, 10, 22]-=-. As [22] mentioned, the parallelism has become a critical issue to develop softwares for the purpose of getting maximum performance gains of multicore machines. Intel proposed that the Recoginition, ...

FPGA vs. Multi-Core CPUs vs. GPUs: Hands-on Experience with a Sorting Application

by Cristian Grozea, Zorana Bankovic, Pavel Laskov
"... Abstract. Currently there are several interesting alternatives for lowcost high-performance computing. We report here our experiences with an N-gram extraction and sorting problem, originated in the design of a real-time network intrusion detection system. We have considered FPGAs, multi-core CPUs i ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
Abstract. Currently there are several interesting alternatives for lowcost high-performance computing. We report here our experiences with an N-gram extraction and sorting problem, originated in the design of a real-time network intrusion detection system. We have considered FPGAs, multi-core CPUs in symmetric multi-CPU machines and GPUs and have created implementations for each of these platforms. After carefully comparing the advantages and disadvantages of each we have decided to go forward with the implementation written for multi-core CPUs. Arguments for and against each platform are presented – corresponding to our hands-on experience – that we intend to be useful in helping with the selection of the hardware acceleration solutions for new projects.
(Show Context)

Citation Context

... devices and various boards using them (ranging from prototyping boards to FPGA based accelerators, sometimes general purpose, sometimes specialized) [29]; multi-core CPUs, with 2, 3, 4 cores or more =-=[10]-=-; many-core GPUs, with 64,128, 240 or even more cores [25]; the Cell processor [13, 32]. The former three solutions were considered and implemented in the course of the current study. Let us now descr...

Making sense of performance counter measurements on supercomputing applications

by Jeff Diamond, John D. Mccalpin, Martin Burtscher, Stephen W. Keckler, James C. Browne , 2010
"... The computation nodes of modern supercomputers consist of multiple multicore chips. Many scientific and engineering application codes have been migrated to these systems with little or no optimization for multicore architectures, effectively using only a fraction of the number of cores on each chip ..."
Abstract - Cited by 4 (2 self) - Add to MetaCart
The computation nodes of modern supercomputers consist of multiple multicore chips. Many scientific and engineering application codes have been migrated to these systems with little or no optimization for multicore architectures, effectively using only a fraction of the number of cores on each chip or achieving suboptimal performance from the cores they do utilize. Performance optimization on these systems require both different measurements and different optimization techniques than those for single core chips. This paper describes primary performance bottlenecks unique to multicore chips, sketching the roles that several commonly used measurement tools can most effectively play in performance optimization. The HOMME benchmark code from NCAR is used as a representative case study on several multicore based supercomputers to formulate and interpret measurements and derive characterizations relevant to modern multicore performance bottlenecks. Finally, we describe common pitfalls in performance measurements on multicore chips and how they may be avoided along with a novel high level multicore optimization technique that increased performance up to 35%. 1 0
(Show Context)

Citation Context

...compilers optimize code for multicore processors. Jack Dongarra’s wrote a series on the impact of multicore processors on various programming areas, one of 44 which focused on scientific applications =-=[35]-=-. In this paper, he emphasized as we do that multiple cores on a chip cannot be treated as a traditional SMP due to shared on-chip resources, and as a result, new scientific code would become much mor...

HYBRID MESSAGE-PASSING AND SHARED-MEMORY PROGRAMMING IN A MOLECULAR DYNAMICS APPLICATION ON MULTICORE CLUSTERS

by Martin J. Chorley, David W. Walker, Martyn F. Guest
"... Hybrid programming, whereby shared-memory and mes-sage-passing programming techniques are combined within a single parallel application, has often been dis-cussed as a method for increasing code performance on clusters of symmetric multiprocessors (SMPs). This paper examines whether the hybrid model ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
Hybrid programming, whereby shared-memory and mes-sage-passing programming techniques are combined within a single parallel application, has often been dis-cussed as a method for increasing code performance on clusters of symmetric multiprocessors (SMPs). This paper examines whether the hybrid model brings any perform-ance benefits for clusters based on multicore processors. A molecular dynamics application has been parallelized using both MPI and hybrid MPI/OpenMP programming models. The performance of this application has been examined on two high-end multicore clusters using both Infiniband and Gigabit Ethernet interconnects. The hybrid model has been found to perform well on the higher-latency Gigabit Ethernet connection, but offers no perform-ance benefit on low-latency Infiniband interconnects. The changes in performance are attributed to the differing com-munication profiles of the hybrid and MPI codes. Key words: message passing, shared memory, multicore, clusters, hybrid programming 1
(Show Context)

Citation Context

...UTHERN CALIFORNIA on August 13, 2010hpc.sagepub.comDownloaded froms201HYBRID PROGRAMMING ON MULTICORE 4 Multicore Performance The use of multicore processors in modern HPC systems has created issues (=-=Dongarra et al., 2007-=-) that must be considered when looking at application performance. Among these are the issues of memory bandwidth (the ability to obtain data from memory to the processing cores), and the effects of c...

Balancing Locality and Parallelism on Shared-cache Multi-core Systems

by Michael Jason Cade, Apan Qasem
"... Abstract—The emergence of multi-core systems opens new opportunities for thread-level parallelism and dramatically in-creases the performance potential of applications running on these systems. However, the state of the art in performance enhancing software is far from adequate in regards to the ex- ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract—The emergence of multi-core systems opens new opportunities for thread-level parallelism and dramatically in-creases the performance potential of applications running on these systems. However, the state of the art in performance enhancing software is far from adequate in regards to the ex-ploitation of hardware features on this complex new architecture. As a result, much of the performance capabilities of multi-core systems are yet to be realized. This research addresses one facet of this problem by exploring the relationship between data-locality and parallelism in the context of multi-core architectures where one or more levels of cache are shared among the different cores. A model is presented for determining a profitable synchronization interval for concurrent threads that interact in a producer-consumer relationship. Experimental results suggest that consideration of the syn-chronization window, or the amount of work individual threads can be allowed to do between synchronizations, allows for parallelism- and locality-aware performance optimizations. The optimum synchronization window is a function of the number of threads, data reuse patterns within the workload, and the size and configuration of the last-level of cache that is shared among processing units. By considering these factors, the calculation of the optimum synchronization window incorporates parallelism and data locality issues for maximum performance. Index Terms—shared-cache, parallelism, performance tuning, memory hierarchy optimization I.
(Show Context)

Citation Context

...ttained by hardware alone. In order to realize the full potential of CMP systems much of the responsibility to find and exploit opportunities for parallelism is now placed on software and programmers =-=[6]-=-. In many cases, the state of the art in performance enhancing tools lacks the sophistication required to make use of the full throughput and energy savings potential in modern CMP systems. The proble...

Media Informatics

by Im Rahmen Des Studiums, Florian Stabel, Betreuung Ao. Univ. Prof, Dipl. -ing Dr. Techn, Eduard Gröller, Mitwirkung Privatdoz, Dipl. -geograf Dr, Annett Bartsch, Technische Universität Wien, Florian Stabel, Technische Universität Wien, Florian Stabel
"... by ..."
Abstract - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...gn related restrictions, however, have brought an end 1to this condition. These so called walls stopped the further pursuit of the previously successful practice of raising single-thread performance =-=[15, 52]-=-. • Clock frequencies cannot be increased anymore without unjustifiable raise of chip heat, power dissipation and leak voltage. • Exploitation of instruction level parallelism has reached its limits. ...

SCALABLE HIGH PERFORMANCE MULTIDIMENSIONAL SCALING

by Seung-hee Bae , 2012
"... ..."
Abstract - Add to MetaCart
Abstract not found
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2016 The Pennsylvania State University