| Stephen E. DeBolt and Peter A. Kollman. Ambercube md, parallelization of amber's molecular dynamics module for distributed--memory hypercube computers. Journal of Computational Chemistry, 14(3):312--329, 1993. |
....floating point operations are accounted for on a per (non vector) element basis. The data in Table 1 show branch instruction percentages ranging from a low of 4. 2 for 3 The parallelization of shake can be problematic for parallel algorithms due to the interprocessor dependences that can arise [4, 7, 13]. case 1, opt. 0x) and (case 2, opt. 0x) to a high of 9.9 for (case 2, opt. 2x) These data also show roughly a factor of two increase in the number of branches going from case 1 (one pairlist update) to case 2 (ten pairlist updates) For example, case 1, opt. 2x) executed 0:15 Theta 10 9 ....
Stephen E. DeBolt and Peter A. Kollman. Ambercube md, parallelization of amber's molecular dynamics module for distributed--memory hypercube computers. Journal of Computational Chemistry, 14(3):312--329, 1993.
.... architectures [18, 26, 32] languages and compilers [7, 13, 17, 25] and software systems [10] Efforts to improve molecular dynamics performance include sequential algorithms addressing the pairlist calculation [1, 22, 30] and numerous vectorization [3, 14, 15, 29] and parallelization efforts [4, 5, 6, 11, 12, 23]. A common application of a benchmark uses total time to ascertain the efficacy of an algorithm or computing system. The details about the benchmark that should be considered will vary according to the goals of the study and what is being measured. To compare two parallel molecular dynamics ....
....is used by hpm for the MFLOP figures in Table 1. 4 Arrived at for (case 0, opt. 0x) by dividing the number of floating point operations by the number of instructions. 5 The parallelization of shake can be problematic for parallel algorithms due to the interprocessor dependences that can arise [7, 11, 21]. 6 Force Pairlist Shake Integration SQRT RTOI npl=1, 0 74 13 2 1 9 shake 0x 84 7 2 1 4 solute 2 69 10 6 2 10 2x 54 7 19 7 3 Table 2: Percent time for program sections comprising 90 or more of total execution time for myoglobin simulations with Gromos. The data were obtained using the ....
Stephen E. DeBolt and Peter A. Kollman. AMBERCUBE MD, parallelization of AMBER's molecular dynamics module for distributed--memory hypercube computers. Journal of Computational Chemistry, 14(3):312--329, 1993.
....of this F matrix to different processors. 3 Replicated Data Method The most commonly used technique for parallelizing MD simulations of molecular systems is known as the replicated data (RD) method [24] Numerous parallel algorithms and simulations have been developed based on this approach [5, 8, 9, 16, 17, 18, 22, 25]. Typically, each processor is assigned a subset of atoms and updates their positions and velocities for the duration of the simulation, regardless of where they move in the physical domain. To explain the method, we first define x and f as vectors of length N which store the position and total ....
....faster on the Paragon, the two sets of RD timings (filled and open squares) are similar. Both curves show a marked roll off in parallel efficiency above 64 128 processors due to the poor scaling of the expand and fold operations. This is typical of the results reported in references [8, 9, 16, 17, 18, 22, 25] for RD implementations of other macromolecular codes such as CHARMM, AMBER, and GROMOS on a variety of parallel machines. Parallel efficiencies as low as 10 15 on a few dozens to hundreds of processors are reported and in some cases the overall speed up is even reduced as more processors are ....
S. E. DeBolt and P. A Kollman. AMBERCUBE MD, Parallelization of AMBER's molecular dynamics module for distributed--memory hypercube computers. J. Comp. Chem., 14:312--329, 1993.
....[13, 33] and is beyond the scope of this paper. The most commonly used technique for parallelizing short range MD simulations of molecular systems is known as the replicated data (RD) method [31] Numerous parallel algorithms and simulations have been developed based on this approach [7, 11, 12, 18, 21, 23, 29, 32]. Typically, each processor stores a copy of all the atom positions in the simulation. It uses this vector of information to compute non bonded forces for the subset of atoms assigned to it. The bonded force computation can be simply parallelized in this scheme, since each processor can compute ....
....is called a fold operation [4, 16, 25] and scales optimally as N , the volume of data in the force vector f . We note that the fold operation is less costly than a global sum operation where each processor ends up with the total force on all N atoms, as is done in the RD algorithms discussed in [12, 21, 23, 32]. A global sum operation typically scales as N log 2 (P ) and so on 256 processors is 8 times more expensive than a fold. The N=P forces resulting from the fold are used to update atom positions and velocities in step (4) Finally, in step (5) the new atom positions in x z are shared among all P ....
[Article contains additional citation context not shown here]
S. E. DeBolt and P. A Kollman. AMBERCUBE MD, Parallelization of AMBER's molecular dynamics module for distributed--memory hypercube computers. J. Comp. Chem., 14:312--329, 1993.
....must communicate with many surrounding processors to acquire needed information. The extra communication results in lower parallel efficiencies. For these reasons, particle decomposition methods have been the method of choice in organic MD simulation codes that have been parallelized to date [5, 7, 12]. They have the additional advantage that the extra 2 , 3 , and 4 body forces that must be computed in organic simulations within the topology of the molecules are easily divided among the processors in a load balanced fashion because each processor knows the positions of all atoms. Recently, ....
S. E. DeBolt and P. A Kollman. AMBERCUBE MD, Parallelization of AMBER's molecular dynamics module for distributed--memory hypercube computers. J. Comp. Chem., 14:312--329, 1993.
....with cycles) that induce data dependencies. The original SHAKE algorithm is intrinsically sequential. While we experimented with several graph based parallelizations of SHAKE from MD literature, the algorithm giving the best convergence and concurrency results, the Reciprocal Exchange algorithm [9], still has limited concurrency and produces speedup that saturates around 16 nodes for our data set. 3.4. Summary The major phases of IC CEDAR all exhibit highly irregular work distribution and data access patterns. While good performance is the final objective, convenient and high level ....
S. E. DeBolt and P. A. Kollman. AMBERCUBE MD, parallelization of AMBER's molecular dynamics module for distributed-memory hypercube computers. J. Comp. Chem., 14(3):312--329, 1993.
....The conventional algorithm used to resolve the constraints, a variant of the nonlinear Gauss Seidel algorithm, appears at first to be essentially sequential. However, the (nonlinear) constraint matrix is frequently (e.g. for biological systems) tightly banded, and this allows efficient pipelining [9] in the resolution of the constraints. Alternatively, the nonlinear Gauss Seidel algorithm can be modified by adding Jacobi like iterations for certain constraints [5] In many cases, one can show that this degrades convergence only slightly [29] 3. Explicit predictor corrector methods The ....
S.E. DeBolt & P.A. Kollman, AMBERCUBE MD, parallelization of AMBER 's molecular dynamics module for distributed-memory hypercube computers, J. Comp. Chem. 14 (1993), 312-329,
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC