Results 1 - 10
of
24
Understanding PARSEC performance on contemporary CMPs
- In Proceedings of the 2009 International Symposium on Workload Characterization
, 2009
"... PARSEC is a reference application suite used in industry and academia to assess new Chip Multiprocessor (CMP) designs. No investigation to date has profiled PARSEC on real hardware to better understand scaling properties and bottlenecks. This understanding is crucial in guiding future CMP designs fo ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
PARSEC is a reference application suite used in industry and academia to assess new Chip Multiprocessor (CMP) designs. No investigation to date has profiled PARSEC on real hardware to better understand scaling properties and bottlenecks. This understanding is crucial in guiding future CMP designs for these kinds of emerging workloads. We use hardware performance counters, taking a systems-level approach and varying common architectural parameters: number of out-of-order cores, memory hierarchy configu-rations, number of multiple simultaneous threads, number of memory channels, and processor frequencies. We find these programs to be largely compute-bound, and thus lim-ited by number of cores, micro-architectural resources, and cache-to-cache transfers, rather than by off-chip memory or system bus bandwidth. Half the suite fails to scale lin-early with increasing number of threads, and some applica-tions saturate performance at few threads on all platforms tested. Exploiting thread level parallelism delivers greater payoffs than exploiting instruction level parallelism. To re-duce power and improve performance, we recommend in-creasing the number of arithmetic units per core, increasing support for TLP, and reducing support for ILP.
Characterizing task usage shapes in Google compute clusters
- Proc. 5th International Workshop on Large Scale Distributed Systems and Middleware
, 2011
"... The increase in scale and complexity of large compute clusters motivates a need for representative workload benchmarkstoevaluatetheperformanceimpactofsystemchanges, so as to assist in designing better scheduling algorithms and in carrying out management activities. To achieve this goal, it is necess ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
(Show Context)
The increase in scale and complexity of large compute clusters motivates a need for representative workload benchmarkstoevaluatetheperformanceimpactofsystemchanges, so as to assist in designing better scheduling algorithms and in carrying out management activities. To achieve this goal, it is necessary to construct workload characterizations from which realistic performance benchmarks can be created. In this paper, wefocus on characterizingrun-timetaskresource usage for CPU, memory and disk. The goal is to find an accurate characterization that can faithfully reproduce the performance of historical workload traces in terms of key performance metrics, such as task wait time and machine resource utilization. Through experiments using workload traces from Google production clusters, we find that simply using the mean of task usage can generate synthetic workload traces that accurately reproduce resource utilizations and task waiting time. This seemingly surprising result can be justified by the fact that resource usage for CPU, memory and disk are relatively stable over time for the majority of the tasks. Our work not only presents a simple technique for constructing realistic workload benchmarks, but also provides insights into understanding workload performance in production compute clusters. 1.
Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime
- in Proc. of 2011 ACM/IEEE conference on Supercomputing
, 2011
"... A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-funded sustainable petascale machine. Simulating this large molecular system on a petascale machine presents great challenges, including handling I/O, large memory footprint and getting good strong-scalin ..."
Abstract
-
Cited by 12 (7 self)
- Add to MetaCart
(Show Context)
A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-funded sustainable petascale machine. Simulating this large molecular system on a petascale machine presents great challenges, including handling I/O, large memory footprint and getting good strong-scaling results. In this paper, we present parallel I/O techniques to enable the simula-tion. A new SMP model is designed to efficiently utilize ubiquitous wide multicore clusters by extending the CHARM++ asynchronous message-driven runtime. We exploit node-aware techniques to op-timize both the application and the underlying SMP runtime. Hi-erarchical load balancing is further exploited to scale NAMD to the full Jaguar PF Cray XT5 (224,076 cores) at Oak Ridge Na-tional Laboratory, both with and without PME full electrostatics, achieving 93 % parallel efficiency (vs 6720 cores) at 9 ms per step for a simple cutoff calculation. Excellent scaling is also obtained on 65,536 cores of the Intrepid Blue Gene/P at Argonne National Laboratory. 1.
B.: Impact of NUMA Effects on High-Speed Networking with Multi-Opteron Machines
- In: The 19th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2007
, 2007
"... The ever-growing level of parallelism within the multi-core and multi-processor nodes in clusters leads to the generalization of distributed memory banks and busses with nonuniform access costs. These NUMA effects have been mostly studied in the context of threads scheduling and are known to have an ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
(Show Context)
The ever-growing level of parallelism within the multi-core and multi-processor nodes in clusters leads to the generalization of distributed memory banks and busses with nonuniform access costs. These NUMA effects have been mostly studied in the context of threads scheduling and are known to have an influence on high-performance networking in clusters. We present an evaluation of their impact on communication performance in multi-OPTERON machines. NUMA effects exhibit a strong and asymmetric impact on highbandwidth communications while the impact on latency remains low. We then describe the implementation of an automatic NUMA-aware placement strategy which achieves as good communication performance as a careful manual placement, and thus ensures performance portability by gathering hardware topology information and placing communicating tasks accordingly.
Minimizing Startup Costs for Performance-Critical Threading
- In Proceedings of the IEEE International Parallel and Distributed Processing Symposium
, 2009
"... Abstract—Using the well-known ATLAS and LAPACK dense linear algebra libraries, we demonstrate that the parallel management overhead (PMO) can grow with problem size on even statically scheduled parallel programs with minimal task interaction. Therefore, the widely held view that these thread managem ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
Abstract—Using the well-known ATLAS and LAPACK dense linear algebra libraries, we demonstrate that the parallel management overhead (PMO) can grow with problem size on even statically scheduled parallel programs with minimal task interaction. Therefore, the widely held view that these thread management issues can be ignored in such computationally intensive libraries is wrong, and leads to substantial slowdown on today’s machines. We survey several methods for reducing this overhead, the best of which we have not seen in the literature. Finally, we demonstrate that by applying these techniques at the kernel level, performance in applications such as LU and QR factorizations can be improved by almost 40 % for small problems, and as much as 15 % for large O(N 3) computations. These techniques are completely general, and should yield significant speedup in almost any performance-critical operation. We then show that the lion’s share of the remaining parallel inefficiency comes from bus contention, and, in the future work section, outline some promising avenues for further improvement. I.
Performance Consistency on Multi-socket AMD Opteron Systems
, 2008
"... Compute nodes with multiple sockets each of which has multiple cores are starting to dominate in the area of scientific computing clusters. Performance inconsistencies from one execution to the next makes any performance debugging or tuning difficult. The resulting performance inconsistencies are bi ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Compute nodes with multiple sockets each of which has multiple cores are starting to dominate in the area of scientific computing clusters. Performance inconsistencies from one execution to the next makes any performance debugging or tuning difficult. The resulting performance inconsistencies are bigger for memory-bound applications but still noticeable for all but the most compute-intensive applications. Memory and thread placement across sockets has significant impact on performance of these systems. We test overall performance and performance consistency for a number of OpenMP and pthread benchmarks including Stream, pChase, the NAS Parallel Benchmarks and SPEC OMP. The tests are run on a variety of multi-socket quad-core AMD Opteron systems. We examine the benefits of explicitly pinning each thread to a different core before any data initialization, thus improving and reducing the variability of performance due to data-to-thread co-location. Execution time variability falls to less than 2 % and for one memory-bound application peak performance increases over 40%. For applications running on hundreds or thousands of nodes, reducing variability will improve load balance and total application performance. Careful memory and thread placement is critical for the successful performance tuning of nodes on a modern scientific compute cluster.
Impact of Network Sharing in Multi-core Architectures
, 2008
"... As commodity components continue to dominate the realm of high-end computing, two hardware trends have emerged as major contributors—high-speed networking technologies and multi-core architectures. Communication middleware such as the Message Passing Interface (MPI) uses the network technology for c ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
As commodity components continue to dominate the realm of high-end computing, two hardware trends have emerged as major contributors—high-speed networking technologies and multi-core architectures. Communication middleware such as the Message Passing Interface (MPI) uses the network technology for communicating between processes that reside on different physical nodes, while using shared memory for communicating between processes on different cores within the same node. Thus, two conflicting possibilities arise: (i) with the advent of multi-core architectures, the number of processes that reside on the same physical node and hence share the same physical network can potentially increase significantly, resulting in increased network usage, and (ii) given the increase in intra-node shared-memory communication for processes residing on the same node, the networkusage can potentiallydecrease significantly. In this paper, we address these two conflicting possibilities and study the behavior of network usage in multi-core environments with sample scientific applications. Specifically, we analyze trends that result in increase or decrease of network usage, and we derive insights into application performance based on these. We also study the sharing of different resources inthesystem inmulti-coreenvironments andidentify the contribution of the network in this mix. In addition, we study different process allocation strategies and analyze their impact on such network sharing.
Collaborative Scheduling of DAG Structured Computations on Multicore Processors ABSTRACT
"... Many computational solutions can be expressed as directed acyclic graphs (DAGs), in which the nodes represent tasks to be executed and edges represent precedence constraints among the tasks. A fundamental challenge in parallel computing is to schedule such DAGs onto multicore processors while preser ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Many computational solutions can be expressed as directed acyclic graphs (DAGs), in which the nodes represent tasks to be executed and edges represent precedence constraints among the tasks. A fundamental challenge in parallel computing is to schedule such DAGs onto multicore processors while preserving the precedence constraints. In this paper, we propose a lightweight scheduling method for DAG structured computations on multicore processors. We distribute the scheduling activities across the cores and let the schedulers collaborate with each other to balance the workload. In addition, we develop a lock-free local task list for the scheduler to reduce the scheduling overhead. We experimentally evaluated the proposed method by comparing with various baseline methods on state-of-the-art multicore processors. For a representative set of DAG structured computations from both synthetic and real problems, the proposed scheduler with lock-free local task lists achieved 15.12 × average speedup on a platform with four quadcore processors, compared to 8.77 × achieved by lock-based baseline methods. The observed overhead of the proposed scheduler was less than 1 % of the overall execution time.
RCRTool: Design Document Version 0.1
, 2010
"... RCRTool, Resource Centric Reflection Tool, will allow application programmers to better understand resource contention between multiple threads of a single application or between simultaneously active applications sharing varying levels of hardware. The improved knowledge of how the entire system is ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
RCRTool, Resource Centric Reflection Tool, will allow application programmers to better understand resource contention between multiple threads of a single application or between simultaneously active applications sharing varying levels of hardware. The improved knowledge of how the entire system is performing will be available to applications and runtimes for dynamic performance tuning. This document provides some of the motivation and the initial design of the entire system including access of hardware and OS performance counters, system modeling with that data, API that allow access to the data by runtimes and applications, and a data logging facility for post-run analysis. The design attempts to allow the same tool to be used with a future single shared address node (with tens of cores) and with a distributed memory system with tens of thousands of nodes and hundreds of thousands of cores. The difference between these systems, should be contained by difference in what parts of the system are watched for potential bottlenecks and the granularity of available dynamic feedback. At the center of RCRTool will be the RCRdaemon. It will have several jobs, including watching the hardware and OS for performance bottlenecks using performance models. RCRTool will supply some models, but mechanisms for the user to add their own will exist. RCRdaemon will also be responsible for transmitting the current state of the system to applications and the OS for dynamic tuning. A third function of the daemon will be logging the information for post-execution analysis. 1 1