• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node. (2010)

by J Zhai, W Chen, W Zheng
Venue:In ACM Sigplan Notices,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 32
Next 10 →

ScalaExtrap: Trace-based communication extrapolation for SPMD programs

by Xing Wu, Frank Mueller - In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming , 2011
"... Performance modeling for scientific applications is important for assessing potential application performance and systems procurement in high-performance computing (HPC). Recent progress on communication tracing opens up novel opportunities for communication modeling due to its lossless yet scalable ..."
Abstract - Cited by 18 (9 self) - Add to MetaCart
Performance modeling for scientific applications is important for assessing potential application performance and systems procurement in high-performance computing (HPC). Recent progress on communication tracing opens up novel opportunities for communication modeling due to its lossless yet scalable trace collection. Estimating the impact of scaling on communication efficiency still remains non-trivial due to execution-time variations and exposure to hardware and software artifacts. This work contributes a fundamentally novel modeling scheme. We synthetically generate the application trace for large numbers of nodes by extrapolation from a set of smaller traces. We devise an innovative approach for topology extrapolation of single program, multiple data (SPMD) codes with stencil or mesh communication. The extrapolated trace can subsequently be (a) replayed to assess communication requirements before porting an application, (b) transformed to auto-generate communication benchmarks for various target platforms, and (c) analyzed to detect communication inefficiencies and scalability limitations. To the best of our knowledge, rapidly obtaining the communication behavior of parallel applications at arbitrary scale with the availability of timed replay, yet without actual execution of the application at this scale is without precedence and has the potential to enable otherwise infeasible system simulation at the exascale level.
(Show Context)

Citation Context

...tion tracing. Based on the FACT framework, Zhai et al. employ a deterministic replay technique to predict the sequential computation time of one process in a parallel application on a target platform =-=[25]-=-. The main idea is to use the information recorded in the trace to simulate the execution result of MPI calls when there is actually only one MPI process, and utilize the deterministic data replay to ...

Single Node On-Line Simulation of MPI Applications with SMPI

by Pierre-nicolas Clauss, Mark Stillwell, Stéphane Genaud, Frédéric Suter, Henri Casanova, Martin Quinson - In 25th IEEE International Parallel and Distributed Processing Symposium (IPDPS’11), Anchorage (Alaska) USA, May 16-20 2011. Accurate and Fast Simulations of LSDC Systems
"... Abstract—Simulation is a popular approach for predicting the performance of MPI applications for platforms that are not at one’s disposal. It is also a way to teach the principles of parallel programming and high-performance computing to students without access to a parallel computer. In this work w ..."
Abstract - Cited by 13 (7 self) - Add to MetaCart
Abstract—Simulation is a popular approach for predicting the performance of MPI applications for platforms that are not at one’s disposal. It is also a way to teach the principles of parallel programming and high-performance computing to students without access to a parallel computer. In this work we present SMPI, a simulator for MPI applications that uses on-line simulation, i.e., the application is executed but part of the execution takes place within a simulation component. SMPI simulations account for network contention in a fast and scalable manner. SMPI also implements an original and validated piecewise linear model for data transfer times between cluster nodes. Finally SMPI simulations of large-scale applications on largescale platforms can be executed on a single node thanks to techniques to reduce the simulation’s compute time and memory footprint. These contributions are validated via a large set of experiments in which SMPI is compared to popular MPI implementations with a view to assess its accuracy, scalability, and speed. Index Terms—Message Passing Interface; On-line simulation; Performance prediction.

Assessing the Performance of MPI Applications Through Time-Independent Trace Replay

by Frédéric Desprez, George S. Markomanolis , Martin Quinson , Frédéric Suter , 2010
"... ..."
Abstract - Cited by 6 (4 self) - Add to MetaCart
Abstract not found

Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes

by Alexandru Calotoiu, Torsten Hoefler, Marius Poke, Felix Wolf
"... Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made—a point where remediation can be difficult. However, creati ..."
Abstract - Cited by 6 (4 self) - Add to MetaCart
Many parallel applications suffer from latent performance limitations that may prevent them from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only when an attempt to scale the code is actually being made—a point where remediation can be difficult. However, creating analytical performance models that would allow such issues to be pinpointed earlier is so laborious that application developers attempt it at most for a few selected kernels, running the risk of missing harmful bottlenecks. In this paper, we show how both coverage and speed of this scalability analysis can be substantially improved. Generating an empirical performance model automatically for each part of a parallel program, we can easily identify those parts that will reduce performance at larger core counts. Using a climate simulation as an example, we demonstrate that scalability bugs are not confined to those routines usually chosen as kernels.
(Show Context)

Citation Context

... et al. compare a set of different schemes for automated machine-based performance learning and prediction [22]. Zhai, Chen, and Zheng extrapolate single-node performance to complex parallel machines =-=[40]-=-. Wu and Müller [37] extrapolate traces to larger process counts and can thus predict communication operations. Their extrapolation relies on a trace compression scheme that assumes regular communicat...

Kismet: Parallel Speedup Estimates for Serial Programs

by Donghwan Jeon, Saturnino Garcia, Chris Louie, Michael Bedford Taylor
"... Software engineers now face the difficult task of refactoring serial programs for parallel execution on multicore processors. Currently, they are offered little guidance as to how much benefit may come from this task, or how close they are to the best possible parallelization. This paper presents Ki ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
Software engineers now face the difficult task of refactoring serial programs for parallel execution on multicore processors. Currently, they are offered little guidance as to how much benefit may come from this task, or how close they are to the best possible parallelization. This paper presents Kismet, a tool that creates parallel speedup estimates for unparallelized serial programs. Kismet differs from previous approaches in that it does not require any manual analysis or modification of the program. This difference allows quick analysis of many programs, avoiding wasted engineering effort on those that are fundamentally limited. To accomplish this task, Kismet builds upon the hierarchical critical path analysis (HCPA) technique, a recently developed dynamic analysis that localizes parallelism to each of the potentially nested regions in the target program. It then uses a parallel execution time model to compute an approximate upper bound for performance, modeling constraints that stem from both hardware parameters and internal program structure. Our evaluation applies Kismet to eight high-parallelism NAS Parallel Benchmarks running on a 32-core AMD multicore system, five low-parallelism SpecInt benchmarks, and six medium-parallelism benchmarks running on the finegrained MIT Raw processor. The results are compelling. Kismet is able to significantly improve the accuracy of parallel speedup estimates relative to prior work based on critical path analysis.
(Show Context)

Citation Context

... 30] but often look only at abstract models of execution and do not provide realistic estimates on the speedup of the refactored program. Yet other tools examine the scalability of a parallel program =-=[10, 53]-=- but look only at an existing implementation and therefore do not provide insight into the fundamental scalability of a program. Furthermore, a vast majority of existing tools assume that there is alr...

Baz, “Scalable performance predictions of distributed peer-to-peer application

by Bogdan Florin Cornea, Julien Bourgeois, Tung Nguyen, Didier El Baz - Proc. 14th IEEE International Conference on High Performance Computing and Communication , 2012
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
(Show Context)

Citation Context

...han to actually run the analyzed code. Most performance prediction tools have a slowdown greater than one, i.e.[22, 23]. This places our implementation in the group of rapid prediction tools, such as =-=[14, 24]-=- . C. Reducing the slowdown Knowing that an important metric for classifying performance prediction tools is the slowdown, we previously proposed an optimized block benchmarking technique presented in...

Estima: Extrapolating scalability of in-memory applications.

by EPFL Georgios Chatzopoulos , Aleksandar Dragojević - In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’16, , 2016
"... Abstract This paper presents ESTIMA, an easy-to-use tool for extrapolating the scalability of in-memory applications. ESTIMA is designed to perform a simple, yet important task: given the performance of an application on a small machine with a handful of cores, ESTIMA extrapolates its scalability t ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract This paper presents ESTIMA, an easy-to-use tool for extrapolating the scalability of in-memory applications. ESTIMA is designed to perform a simple, yet important task: given the performance of an application on a small machine with a handful of cores, ESTIMA extrapolates its scalability to a larger machine with more cores, while requiring minimum input from the user. The key idea underlying ESTIMA is the use of stalled cycles (e.g. cycles that the processor spends waiting for various events, such as cache misses or waiting on a lock). ESTIMA measures stalled cycles on a few cores and extrapolates them to more cores, estimating the amount of waiting in the system. ESTIMA can be effectively used to predict the scalability of in-memory applications. For instance, using measurements of memcached and SQLite on a desktop machine, we obtain accurate predictions of their scalability on a server. Our extensive evaluation on a large number of in-memory benchmarks shows that ESTIMA has generally low prediction errors.
(Show Context)

Citation Context

...c 2016 ACM 978-1-4503-4092-2/16/03. . . $15.00 DOI: http://dx.doi.org/10.1145/http://dx.doi.org/10.1145/2851141.2851159 applications in a simple way, without having to understand in detail the internals of the application or the machine it will run on. ESTIMA enables developers to visualize the scalability of their applications, as well as to discover bottlenecks that might not be evident during initial performance benchmarking. ESTIMA can be applied with little effort to any parallel in-memory application, in contrast to other approaches that heavily rely on application-specific information [4, 6, 22, 24, 25, 30, 44]. Instead, ESTIMA leverages stalled cycles to extrapolate the scalability of an application. These are cycles the application spends on non-useful work, such as waiting for a cache line to be fetched from memory or waiting on a busy lock. Contention for shared resources typically increases with the number of cores used by an application, resulting in an increase in stalled cycles that directly impact the application’s scalability. The application’s performance keeps improving as long as adding more cores increases the number of useful cycles. As soon as adding more cores mostly results in stal...

SWAPP: A Framework for Performance Projections of HPC Applications Using Benchmarks

by Sameh Sharkawi, Don Desota, Raj P, Stephen Stevens, Valerie Taylor, Xingfu Wu
"... Surrogate-based Workload Application Performance Projection (SWAPP) is a framework for performance projections of High Performance Computing (HPC) applications using benchmark data. Performance projections of HPC applications onto various hardware platforms are important for hardware vendors and HPC ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Surrogate-based Workload Application Performance Projection (SWAPP) is a framework for performance projections of High Performance Computing (HPC) applications using benchmark data. Performance projections of HPC applications onto various hardware platforms are important for hardware vendors and HPC users. The projections aid hardware vendors in the design of future systems and help HPC users with system procurement. SWAPP assumes that one has access to a base system and only benchmark data for a target system; the target system is not available for running the HPC application. Projections are developed using the performance profiles of the benchmarks and application on the base system and the benchmark data for the target system. SWAPP projects the performances of compute and communication components separately then combine the two projections to get the full application projection. In this paper SWAPP was used to project the performance of three NAS Multi-Zone benchmarks onto three systems (an IBM POWER6 575 cluster and an IBM Intel Westmere x5670 both using an Infiniband interconnect and an IBM BlueGene/P with a 3D Torus and Collective Tree interconnects); the base system is an IBM POWER5+ 575 cluster. The projected performance of the three benchmarks was within 11.44 % average error magnitude and standard deviation of 2.64 % for the three systems. 1.
(Show Context)

Citation Context

...5%. In contrast, with respect to the target architecture our approach utilizes the published execution times of the SPEC CPU and IMB benchmarks, resulting in an error of at most 15%. The PHANTOM tool =-=[15]-=- uses deterministic replay techniques to execute any process of a parallel application on a single node of the target system at real speed, hence measuring computation performance. This assumes that a...

Employing Checkpoint to Improve Job Scheduling in Large-Scale Systems

by Shuangcheng Niu, Jidong Zhai, Xiaosong Ma, Mingliang Liu, Yan Zhai, Wenguang Chen, Weimin Zheng
"... Abstract. The FCFS-based backfill algorithm is widely used in scheduling high-performance computer systems. The algorithm relies on runtime estimate of jobs which is provided by users. However, statistics show the accuracy of user-provided estimate is poor. Users are very likely to provide a much lo ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract. The FCFS-based backfill algorithm is widely used in scheduling high-performance computer systems. The algorithm relies on runtime estimate of jobs which is provided by users. However, statistics show the accuracy of user-provided estimate is poor. Users are very likely to provide a much longer runtime estimate than its real execution time. In this paper, we propose an aggressive backfilling approach with checkpoint based preemption to address the inaccuracy in user-provided runtime estimate. The approach is evaluated with real workload traces. The results show that compared with the FCFS-based backfill algorithm, our scheme improves the job scheduling performance in waiting time, slowdown and mean queue length by up to 40%. Meanwhile, only 4 % of the jobs need to perform checkpoints.
(Show Context)

Citation Context

...ed runtime. Chiang et al. suggested a test run before running to acquire more accurate estimated runtime [13]. Zhai et al. developed a performance prediction tool to assist an accurate estimated time =-=[14]-=-. Tang et al. got usable information from historical information [15]. In order to improve the estimate accuracy, Cynthia Bailey Lee et al. gave a detailed survey. However, performance prediction is a...

Exploiting gpu hardware saturation for fast compiler optimization. GPGPU-7

by Alberto Magni, Christophe Dubach , 2014
"... Graphics Processing Units (GPUs) are efficient devices capa-ble of delivering high performance for general purpose com-putation. Realizing their full performance potential often requires extensive compiler tuning. This process is partic-ularly expensive since it has to be repeated for each target pr ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
Graphics Processing Units (GPUs) are efficient devices capa-ble of delivering high performance for general purpose com-putation. Realizing their full performance potential often requires extensive compiler tuning. This process is partic-ularly expensive since it has to be repeated for each target program and platform. In this paper we study the utilization of GPU hardware re-sources across multiple input sizes and compiler options. In this context we introduce the notion of hardware saturation. Saturation is reached when an application is executed with a number of threads large enough to fully utilize the available hardware resources. We give experimental evidence of hard-ware saturation and describe its properties using 16 OpenCL kernels on 3 GPUs from Nvidia and AMD. We show that in-put sizes that saturates the GPU show performance stability across compiler transformations. Using the thread-coarsening transformation as an exam-ple, we show that compiler settings maintain their relative performance across input sizes within the saturation region. Leveraging these hardware and software properties we pro-pose a technique to identify the input size at the lower bound of the saturation zone, we call it Minimum Saturation Point (MSP). By performing iterative compilation on the MSP input size we obtain results effectively applicable for much large input problems reducing the overhead of tuning by an order of magnitude on average.
(Show Context)

Citation Context

... consider the problem of finding the hardware saturation point. Other researchers have looked at using techniques such as deterministic replay coupled with clustering to select representative replays =-=[11]-=-. A trace of the application is extracted from a host machine and is then replayed locally on a single node. Interestingly, it is possible to use this technique to execute multiple replays on the same...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University