Results 1  10
of
14
Reducing power with performance constraints for parallel sparse applications
 In Proceedings of IPDPS 2005, the 19th IEEE International Parallel and Distributed Processing Symposium, page 8 pp., Apr. 2005. inria00584944, version 1  11 Apr 2011
"... Sparse and irregular computations constitute a large fraction of applications in the dataintensive scientific domain. While every effort is made to balance the computational workload in such computations across parallel processors, achieving sustained near machinepeak performance with closetoide ..."
Abstract

Cited by 28 (1 self)
 Add to MetaCart
(Show Context)
Sparse and irregular computations constitute a large fraction of applications in the dataintensive scientific domain. While every effort is made to balance the computational workload in such computations across parallel processors, achieving sustained near machinepeak performance with closetoideal load balanced computationtoprocessor mapping is inherently difficult. As a result, most of the time, the loads assigned to parallel processors can exhibit significant variations. While there have been numerous past efforts that study this imbalance from the performance viewpoint, to our knowledge, no prior study has considered exploiting the imbalance for reducing power consumption during execution. Power consumption in largescale clusters of workstations is becoming a critical issue as noted by several recent research papers from both industry and academia. Focusing on sparse matrix computations in which underlying parallel computations and data dependencies can be represented by trees, this paper proposes and evaluates different schemes that save power through voltage/frequency scaling. Our goal is to reduce overall energy consumption by scaling the voltages/frequencies of those processors that are not in the critical path; i.e., our approach is oriented towards saving power without incurring performance penalties. The experiments with matrices extracted from real applications as well as with model matrices indicate that the proposed strategies are very effective in saving power, and the savings achieved come close to the optimal limits. Our results also show that the proposed approach can also be used to study powerperformance tradeoffs in environments where certain performance degradation is tolerable. 1
Amesos: A Set of General Interfaces to Sparse Direct Solver Libraries
 Proceedings of PARA’06 Conference
, 2006
"... Abstract. We present the Amesos project, which aims to define a set of general, flexible, consistent, reusable and efficient interfaces to direct solution software libraries for systems of linear equations on both serial and distributed memory architectures. Amesos is composed of a collection of pu ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We present the Amesos project, which aims to define a set of general, flexible, consistent, reusable and efficient interfaces to direct solution software libraries for systems of linear equations on both serial and distributed memory architectures. Amesos is composed of a collection of pure virtual classes, as well as several concrete implementations in the C++ language. These classes allow access to the linear system matrix and vector elements and their distribution, and control the solution of the linear system. We report numerical results that show that the overhead induced by the objectoriented design is negligible under typical conditions of usage. We include examples of applications, and we comment on the advantages and limitations of the approach. 1
A PARALLEL SWEEPING PRECONDITIONER FOR HETEROGENEOUS 3D HELMHOLTZ EQUATIONS∗
"... Abstract. A parallelization of a sweeping preconditioner for 3D Helmholtz equations without internal resonance is introduced and benchmarked for several challenging velocity models. The setup and application costs of the sequential preconditioner are shown to be O(γ2N4/3) and O(γN logN), where γ(ω) ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
Abstract. A parallelization of a sweeping preconditioner for 3D Helmholtz equations without internal resonance is introduced and benchmarked for several challenging velocity models. The setup and application costs of the sequential preconditioner are shown to be O(γ2N4/3) and O(γN logN), where γ(ω) denotes the modestly frequencydependent number of grid points per Perfectly Matched Layer. Several computational and memory improvements are introduced relative to using blackbox sparsedirect solvers for the auxiliary problems, and competitive runtimes and iteration counts are reported for highfrequency problems distributed over thousands of cores. Two opensource packages are released along with this paper: Parallel Sweeping Preconditioner (PSP) and the underlying distributed multifrontal solver, Clique.
Sandia Report
, 1997
"... This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, nor any of the contractors, subcontractors, or their employees, makes any warranty, express or implied, or a ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, nor any of the contractors, subcontractors, or their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government, any agency thereof or any of their contractors or subcontractors. The views and opinions expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or any of their contractors or subcontractors. SAND978234 Unlimited Release Printed February 1997 UC406 MixedConvective, Conjugate Heat Transfer during Molten Salt Quenching of Small Parts Sandia National Laboratories SUMMARY It is common in free quenching immersion heat treatment calculations to locally apply constant or surfaceaveraged heattransfer coefficients obtained from either free or forced steady convection over simple shapes withsmall temperature differences from the ambient fluid. This procedure avoids the solution of highly transient, nonBoussinesq conjugate heat transfer problems which often involve mixed convection, but it leaves great uncertainty about the general adequacy of the results. In this paper we demonstrate for small parts (dimensions of the order of inches rather than feet) quenched...
TimeMemory TradeOffs Using Sparse Matrix Methods for LargeScale Eigenvalue Problems ⋆
"... Abstract. Iterative methods such as Lanczos and JacobiDavidson are typically used to compute a small number of eigenvalues and eigenvectors of a sparse matrix. However, these methods are not effective in certain largescale applications, for example, “global tight binding molecular dynamics.” Such ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Iterative methods such as Lanczos and JacobiDavidson are typically used to compute a small number of eigenvalues and eigenvectors of a sparse matrix. However, these methods are not effective in certain largescale applications, for example, “global tight binding molecular dynamics.” Such applications require all the eigenvectors of a large sparse matrix; the eigenvectors can be computed a few at a time and discarded after a simple update step in the modeling process. We show that by using sparse matrix methods, a directiterative hybrid scheme can significantly reduce memory requirements while requiring less computational time than a banded direct scheme. Our method also allows a more scalable parallel formulation for eigenvector computation through spectrum slicing. We discuss our method and provide empirical results for a wide variety of sparse matrix test problems. 1
HIGHPERFORMANCE DIRECT SOLUTION OF FINITE ELEMENT PROBLEMS ON MULTICORE PROCESSORS
, 2010
"... by ..."
(Show Context)
RT/APO/12/6 (Submitted) Improving
, 2012
"... multifrontal methods by means of block lowrank representations ..."
(Show Context)
ETH Zurich and
"... PyTrilinos is a collection of Python modules that are useful for serial and parallel scientific computing. This collection contains modules that cover serial and parallel dense linear algebra, serial and parallel sparse linear algebra, direct and iterative linear solution techniques, domain decompos ..."
Abstract
 Add to MetaCart
PyTrilinos is a collection of Python modules that are useful for serial and parallel scientific computing. This collection contains modules that cover serial and parallel dense linear algebra, serial and parallel sparse linear algebra, direct and iterative linear solution techniques, domain decomposition and multilevel preconditioners, nonlinear solvers and continuation algorithms. Also included are a variety of related utility functions and classes, including distributed I/O, coloring algorithms and matrix generation. PyTrilinos vector objects are integrated with the popular NumPy Python module, gathering together a variety of highlevel distributed computing operations with serial vector operations. PyTrilinos is a set of interfaces to existing, compiled libraries. This hybrid framework uses Python as frontend, and efficient precompiled libraries for all computationally expensive tasks. Thus, we take advantage of both the flexibility and ease of use of Python, and the efficiency of the underlying C++, C and FORTRAN numerical kernels. The presented numerical results show that, for many important problem classes, the overhead required by the Python interpreter is negligible. To run in parallel, PyTrilinos simply requires a standard Python interpreter. The fundamental MPI calls are encapsulated under an abstract layer that manages all interprocessor communications. This makes serial and parallel scripts using PyTrilinos virtually identical.