Results 1  10
of
50
NAMD2: Greater Scalability for Parallel Molecular Dynamics
 JOURNAL OF COMPUTATIONAL PHYSICS
, 1998
"... Molecular dynamics programs simulate the behavior of biomolecular systems, leading to insights and understanding of their functions. However, the computational complexity of such simulations is enormous. Parallel machines provide the potential to meet this computational challenge. To harness this ..."
Abstract

Cited by 322 (45 self)
 Add to MetaCart
Molecular dynamics programs simulate the behavior of biomolecular systems, leading to insights and understanding of their functions. However, the computational complexity of such simulations is enormous. Parallel machines provide the potential to meet this computational challenge. To harness this potential, it is necessary to develop a scalable program. It is also necessary that the program be easily modified by applicationdomain programmers. The
A Framework for Optimizing Parallel I/O
, 1994
"... There has been a great deal of recentinterest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architectures that include multiple disks or disk arrays. The major emphasis of the paper is ..."
Abstract

Cited by 65 (9 self)
 Add to MetaCart
There has been a great deal of recentinterest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architectures that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.
Practical Parallel Algorithms for Dynamic Data Redistribution, Median Finding, and Selection (Extended Abstract)
, 1996
"... David A. Bader* Joseph jjfit Institute for Advanced Computer Studies, and Department of Electrical Engineering, University of Maryland, College Park, MD 20742 Email: {dbader, j oseph}umiacs. umd. edu Abstract A common statistical problem is that of finding the median element in a set of data ..."
Abstract

Cited by 27 (10 self)
 Add to MetaCart
David A. Bader* Joseph jjfit Institute for Advanced Computer Studies, and Department of Electrical Engineering, University of Maryland, College Park, MD 20742 Email: {dbader, j oseph}umiacs. umd. edu Abstract A common statistical problem is that of finding the median element in a set of data. This paper presents a fast and portable parallel algorithm for finding the median given a set of elements distributed across a parallel machine. In fact, our algorithm solves the general selection problem that requires the determination of the element of rank i, for an arbitrarily given integer i. Practical algorithms needed by our selection algorithm for the dynamic redistribution of data are also discussed. Our general framework is a dis tributed memory programming model enhanced by a set of communication primitives. We use efficient techniques for distributing, coalescing, and load balancing data as well as efficient combinations of task and data parallelism. The algorithms have been coded in SPLITC and run on a varie ,ty of platforms, including the Thinking Machines CM5, IBM SP1 and SP2, Cray Research T3D, Meiko Scientific CS2, Intel Paragon, and workstation clusters. Our experimental results illustrate the scalability and efficiency of our algorithms across different platforms and improve upon all the related experimental results known to the authors.
A scalable fpgabased multiprocessor
 In Proceedings of the 13th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines
, 2006
"... It has been shown that a small number of FPGAs can significantly accelerate certain computing tasks by up to two or three orders of magnitude. However, particularly intensive largescale computing applications, such as molecular dynamics simulations of biological systems, underscore the need for eve ..."
Abstract

Cited by 22 (4 self)
 Add to MetaCart
(Show Context)
It has been shown that a small number of FPGAs can significantly accelerate certain computing tasks by up to two or three orders of magnitude. However, particularly intensive largescale computing applications, such as molecular dynamics simulations of biological systems, underscore the need for even greater speedups to address relevant length and time scales. In this work, we propose an architecture for a scalable computing machine built entirely using FPGA computing nodes. The machine enables designers to implement largescale computing applications using a heterogeneous combination of hardware accelerators and embedded microprocessors spread across many FPGAs, all interconnected by a flexible communication network. Parallelism at multiple levels of granularity within an application can be exploited to obtain the maximum computational throughput. By focusing on applications that exhibit a high computationtocommunication ratio, we narrow the extent of this investigation to the development of a suitable communication infrastructure for our machine, as well as an appropriate programming model and design flow for implementing applications. By providing a simple, abstracted communication interface with the objective of being able to scale to thousands of FPGA nodes, the proposed architecture appears to the programmer as a unified, extensible FPGA fabric. A programming model based on the MPI messagepassing standard is also presented as a means for partitioning an application into independent computing tasks that can be implemented on our architecture. Finally, we demonstrate the first use of our design flow by developing a simple molecular dynamics simulation application for the proposed machine, which runs on a small platform of development boards. 1.
A Parallel Software Infrastructure for Dynamic BlockIrregular Scientific Calculations
, 1995
"... ..."
ProtoMol: A molecular dynamics framework with incremental parallelization
 In Proc. of the Tenth SIAM Conf. on Parallel Processing for Scientific Computing (PP01), Proceedings in Applied Mathematics
, 2001
"... Molecular dynamics (MD) for a classical unconstrained simulation of bimolecular systems requires the solution of Newton’s equations of motion. At each step, one evaluates the contribution of interacting forces, and these are applied to the system using a numerical integrator. The most computationall ..."
Abstract

Cited by 18 (10 self)
 Add to MetaCart
(Show Context)
Molecular dynamics (MD) for a classical unconstrained simulation of bimolecular systems requires the solution of Newton’s equations of motion. At each step, one evaluates the contribution of interacting forces, and these are applied to the system using a numerical integrator. The most computationally expensive part is the force evaluation among atoms.
Interprocedural Compilation of Irregular Applications for Distributed Memory Machines
 IN PROCEEDINGS OF SUPERCOMPUTING '95
, 1995
"... Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications having irregular data access patterns, ..."
Abstract

Cited by 17 (6 self)
 Add to MetaCart
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications having irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (IPRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D compilation system as the necessary infrastructure. We prese...
MSA: Multiphase specifically shared arrays
 In Proceedings of the 17th International Workshop on Languages and Compilers for Parallel Computing
, 2004
"... Abstract. Shared address space (SAS) parallel programming models have faced difficulty scaling to large number of processors. Further, although in some cases SAS programs are easier to develop, in other cases they face difficulties due to a large number of race conditions. We contend that a multipa ..."
Abstract

Cited by 16 (8 self)
 Add to MetaCart
(Show Context)
Abstract. Shared address space (SAS) parallel programming models have faced difficulty scaling to large number of processors. Further, although in some cases SAS programs are easier to develop, in other cases they face difficulties due to a large number of race conditions. We contend that a multiparadigm programming model comprising a distributedmemory model with a disciplined form of sharedmemory programming may constitute a “complete ” and powerful parallel programming system. Optimized coherence mechanisms based on the specific access pattern of a shared variable show significant performance benefits over general DSM coherence protocols. We present MSA, a system that supports such specifically shared arrays that can be shared in readonly, writemany, and accumulate modes. These simple modes scale well and are general enough to capture the majority of shared memory access patterns. MSA does not support a general readwrite access mode, but a single array can be shared in readonly mode in one phase and writemany in another. MSA coexists with the messagepassing paradigm (MPI) and the processor virtualizationbased messagedriven paradigm(Charm++). We present the model, its implementation, programming examples and preliminary performance results. 1 1
Exchange of Messages of Different Sizes
 In IRREGULAR '98
"... In this paper, we study the exchange of messages among a set of processors linked through an interconnection network. We focus on general, nonuniform versions of alltoall (or complete) exchange problems in asynchronous systems with a linear cost model and messages of arbitrary sizes. We exten ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
(Show Context)
In this paper, we study the exchange of messages among a set of processors linked through an interconnection network. We focus on general, nonuniform versions of alltoall (or complete) exchange problems in asynchronous systems with a linear cost model and messages of arbitrary sizes. We extend previous complexity results to show that the general asynchronous problems are NPcomplete. We present several approximation algorithms and determine which heuristics are best suited to several parallel systems. We conclude with experimental results that show that our algorithms outperform the native alltoall exchange algorithm on an IBM SP2 when the number of processors is odd.
Overcoming instabilities in VerletI/rRESPA with the mollified impulse method
"... The primary objective of this paper is to explain the derivation of symplectic molli ed VerletI/rRESPA (MOLLY) methods that overcome linear and nonlinear instabilities that arise as numerical artifacts in VerletI/rRESPA. These methods allow for lengthening of the longest time step used in molec ..."
Abstract

Cited by 13 (7 self)
 Add to MetaCart
(Show Context)
The primary objective of this paper is to explain the derivation of symplectic molli ed VerletI/rRESPA (MOLLY) methods that overcome linear and nonlinear instabilities that arise as numerical artifacts in VerletI/rRESPA. These methods allow for lengthening of the longest time step used in molecular dynamics (MD). We provide evidence that MOLLY methods can take a longest time step that is 50% greater than that of VerletI/rRESPA, for a given drift, including no drift. A 350% increase in the timestep is possible using MOLLY with mild Langevin damping while still computing dynamic properties accurately. Furthermore, longer time steps also enhance the scalability of multiple time stepping integrators that use the popular Particle Mesh Ewald method for computing full electrostatics, since the parallel bottleneck of the fast Fourier transform associated with PME is invoked less often. An additional objective of this paper is to give sucient implementation details for these molli ed integrators, so that interested users may implement them into their MD codes, or use the program ProtoMol in which we have implemented these methods.