18 citations found. Retrieving documents...
P. Liu and S. N. Bhatt, Experiences with parallel N-body simulation, in "Proc. 6th ACM Symposium on Parallel Algorithms and Architectures, ACM SIGACT and SIGARCH, Cape May, NJ, 1994," pp. 122--131.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
A Comparison of Three Programming Models for Adaptive.. - Shan, Singh, Oliker.. (2000)   (9 citations)  (Correct)

....a cost balanced local tree on each processor. A communication step is finally required to appropriately distribute the particle and cell information, thus allowing each processor to build its locally essential tree. We also implemented the ORB version in a manner similar to that reported in [5, 12], and found no significant performance differences with our costzones approach. Instead, using costzones allowed us to make easier comparisons with the CC SAS implementation of the N Body problem. The CC SAS version of the N Body simulation is obtained from the SPLASH 2 suite [21] and further ....

P. Liu and S.N. Bhatt, "Experiences with parallel N-body simulation," 6th ACM Symposium on Parallel Algorithms and Architectures, 1994, 122--131.


Rapid Simulation of Wireless Systems - Felipe Perrone David (1998)   (1 citation)  (Correct)

.... transmitter tuned to c and also all others tuned to adjacent channels (assumed to be no greater than 2) Given an even distribution of transmitters to channels, the per mobile cost is O(N C) considering all mobiles the total cost is O(N 2 C) This cost might be reduced using N Body algorithms [10]. In our experiments we limit the cost by restricting the subdomain considered to a#ect a cell; this is for computational expediency and exploration of interval jumping rather than knowledge that it is numerically acceptable. Observing the numerical algorithm embodied in Equation 2, we see that ....

Pangfeng Liu and Sandeep N. Bhatt. Experiences with parallel N-body simulation. In 6th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '94), pages 122--131, 1994.


Implementing N-body Algorithms Efficiently in Data-Parallel.. - Hu, Johnsson   (Correct)

....used different algorithms, different problem sizes and parameters controlling the accuracies. The Barnes Hut O(N log N ) algorithm has been implemented using the message passing programming paradigm by Salmon and Warren [14, 15, 16] on the Intel Touchstone Delta and by Liu and Bhatt [17, 18] on the CM 5. Both groups used assembly language for time critical kernels and achieved efficiencies in the range 24 28 and 30 , respectively. Zhao and Johnsson [19] developed a data parallel implementation of Zhao s method on the CM 2, and achieved an efficiency of 12 for expansions in ....

....are evaluated hierarchically. 4 HU AND JOHNSSON Table 1. Efficiencies of various parallel implementations of hierarchical N body methods of Peak Author Programming model efficiency Machine Salmon, Warren Salmon [14, 15, 16] F77 message passing 24 28 512 node Intel Delta Liu Bhatt [17, 18] C message passing assembly 30 256 node CM 5 Leathrum Board [20, 21] F77 20 32 node KSR 1 Elliott Board [22] F77 14 32 node KSR 1 Zhao Johnsson [19] Lisp assembly 12 256 node (8k) CM 2 Hu Johnsson [this article] CMF 27 35 256 node CM 5 5E The hierarchy of computational elements ....

P. Liu and S. N. Bhatt, "Experiences with parallel N-body simulation," in Proc. 6th Annual ACM Symposium on Parallel Algorithms and Architecture, Cape May, NJ, June 1994.


CS 395T Programming Parallel Algorithms Spring 1996 - Section Professor   (Correct)

....for an implementation project. The Barnes Hut N body algorithm. The Barnes Hut algorithm simulates the interactions among N objects, such as stars. The algorithm uses a hierarchical spatial decomposition in order to compute the interactions in O(N lg N) time instead of the naive O(N 2 ) time [3, 2, 16]. A parallel implementation using Cilk is probably a twoperson project. A branch and bound algorithm to solve the traveling salesman problem (TSP) Branch and bound is a heuristic technique used to solve combinatorial problems such as the traveling salesman problem [1, 13] The space of ....

Pangfeng Liu and Sandeep N. Bhatt. Experiences with parallel N-body simulation. In Proceedings of the Sixth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 122--131, Cape May, New Jersey, June 1994.


A Parallel Adaptive Fast Multipole Algorithm For N-Body Problems - Sanjeev Krishnan (1995)   (2 citations)  (Correct)

....high message latencies. Thus it is not possible to get good performance with fine grained, receiver initiated communication, as is possible for shared memory machines [11] We have extensively used a sender initiated advance send protocol to reduce communication overhead in our implementation. In [4, 9] a form of advance send is used in the context of the Barnes Hut method by sending particle data to remote nodes instead of requesting for their particles. For the AFMA, the basic problem to be solved before using advance send is for each processor to determine which other processors will be ....

P. Liu and S. Bhatt. Experiences with parallel n-body simulation. In 6th Annual ACM Symposium on Parallel Algorithms and Architectures, 1994.


Efficient Data Parallel Implementations of Highly Irregular Problems - Hu (1997)   (Correct)

....programming models, as summarized in Table 3.4. Barnes and Hut s O(N log N) method has been implemented using the message passing programming paradigm by Salmon et al. Sal90, WS92, WS93] on the Intel Touchstone Delta Chapter 3. Hierarchical N body Methods 28 and by Liu and Bhatt [Liu94, LB94] on the CM 5. Salmon et al. achieved efficiencies in the range 24 28 1 , while Liu using assembly language for time critical kernels achieved 30 efficiency. Zhao and Johnsson developed a SIMD implementation of Zhao s method on the CM 2, and achieved an efficiency of 12 for expansions ....

....O(N log N ) methods Salmon [Sal90] BH, quadrupole MP Ncube Warren Salmon [WS92] BH, quadrupole MP 8.78M 26 180K 512 Intel Delta Warren Salmon [WS93] BH, ffl 1 = 10 Gamma3 MP 8. 78M 28 266K 512 Intel Delta Warren Salmon [WS95] BH, ffl 1 = 10 Gamma2 MP 2M 111K 256 CM 5E Liu Bhatt [LB94] BH, quadrupole MP 10M 30 97K 256 CM 5 Singh et al. SHG92] BH SM DASH, KSR 1 nonadaptive O(N ) methods Leathrum Board [Lea92] GR, p=8 100K 65 250K 1 RS 6000 360 GR, p=8 SM 1M 20 32 KSR 1 Elliott Board [EB94] GR, FFT, p=8 100K 73 200K 1 RS 6000 360 GR, FFT, p=8 SM 1M 14 32 KSR 1 ....

Pangfeng Liu and Sandeep N. Bhatt. Experiences with parallel N-body simulation. In 6th Annual ACM Symposium on Parallel Algorithms and Architecture, pages 122-- 131. ACM, 1994.


Automating Runtime Optimizations For Parallel Object-Oriented.. - Krishnan   (Correct)

....high message latencies. Thus it is not possible to get good performance with fine grained, receiver initiated communication, as is possible for shared memory machines [89] We have extensively used a sender initiated advance send protocol to reduce communication overhead in our implementation. In [88, 90] a form of advance send is used in the context of the Barnes Hut method by sending particle data to remote nodes instead of requesting for their particles. For the AFMA, the basic problem to be solved before using advance send is for each processor to determine which other processors will be ....

P. Liu and S. Bhatt. Experiences with parallel n-body simulation. In 6th Annual ACM Symposium on Parallel Algorithms and Architectu res, 1994.


Highly Portable and Efficient Implementations of Parallel.. - Blackston, Suel (1997)   (4 citations)  (Correct)

....(e.g. see [4, 6, 17, 22, 27, 29] and of the O(N log N) adaptive 1 Computer Science Division, University of California, Berkeley. Email: davidb cs.berkeley.edu. 2 Information Sciences Research Center, Bell Laboratories. Email: suel research.bell labs.com. Barnes Hut algorithm (e.g. see [20, 4, 24, 28, 26]) there are only very few parallel implementations of the O(N) adaptive methods. In this paper, we focus on the efficient parallel implementation of O(N) adaptive tree codes, and in particular on the adaptive Fast Multipole Method [7] and the closely related adaptive version of Anderson s Method ....

....and discuss their relation to our work. Due to the large amount of previous work, we can only discuss the most closely related work. 3. 1 Barnes Hut Algorithm Over the last decade, there have been a large number of parallel implementations of the Barnes Hut algorithm; a few examples are [20, 4, 24, 28, 26]. In [28] Warren and Salmon describe a fast message passing implementation of the Barnes Hut algorithm. They propose the use of locally essential trees to obtain a purely sender driven protocol for replicating nodes that are accessed by several processors during the force computation phase. This ....

[Article contains additional citation context not shown here]

Pangfeng Liu and Sandeep N. Bhatt. Experiences with parallel N-body simulations. In 6th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'94), pages 122--131, 1994.


BSPlib: The BSP Programming Library - Hill, McColl, Stefanescu.. (1998)   (58 citations)  (Correct)

....inserting all bodies that are located in its computational domain and computing the centres of mass of that tree. Then appropriate subtrees, called locally essential trees, are exchanged between the processors, using a replication scheme similar to those of Warren and Salmon [41] and Liu and Bhatt [29]. Afterwards, every processor has a local oct tree that contains all the data needed to perform the tree traversal on its bodies, and whose structure is consistent with that of the global oct tree constructed by the sequential algorithm. The BSPlib implementation was obtained by porting a code ....

P. Liu and S. N. Bhatt. Experiences with parallel N-body simulations. In Sixth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 122-- 131, 1994.


Parallel Progressive Radiosity with Adaptive Meshing - Yu, Ibarra, Yang (1997)   (1 citation)  (Correct)

....all processors. Since the amount of computation differs for different patches, it is not sufficient to only make each processor have an equal number of patches. The problem of partitioning irregular data space has been studied in other scientific application areas, for example n body simulations [7]. Usually the quadtree(octree) partitioning method is used and neighboring leafs are directly mapped to the same processor. This mapping may not be effective for our problem, which will be justified in our performance analysis. Dynamic load balancing that migrates patches between processors looks ....

Liu, P., Bhatt, S.: Experiences with parallel N-body simulation. Proceedings of 6th Annual ACM Symposium on Parallel Algorithms and Architectures(1994) 122-131.


A Data-Parallel Implementation of O(N) Hierarchical N-body.. - Hu, Johnsson   (Correct)

....particle adaptive O(N log N) methods Salmon [27] BH, quadrupole MP Ncube Warren Salmon [33] BH, quadrupole MP 8.78M 26 180K 512 Intel Delta Warren Salmon [34] BH, ffl 1 = 10 Gamma3 MP 8. 78M 28 266K 512 Intel Delta Warren Salmon [35] BH, ffl 1 = 10 Gamma2 MP 2M 111K 256 CM 5E Liu Bhatt [24] BH, quadrupole MP 10M 30 97K 256 CM 5 Singh et al. 29] BH SM DASH, KSR 1 nonadaptive O(N) methods Leathrum Board [23] GR, p=8 100K 65 250K 1 RS 6000 360 GR, p=8 SM 1M 20 32 KSR 1 Elliott Board [9] GR, FFT, p=8 100K 73 200K 1 RS 6000 360 GR, FFT, p=8 SM 1M 14 32 KSR 1 Schmidt Lee ....

....not distinguish nodal architecture, e.g. superscalar architectures can perform multiple operations per cycle. Barnes and Hut s O(N log N) method has been implemented using the message passing programming paradigm by Salmon and Warren [27, 33, 34] on the Intel Touchstone Delta and by Liu and Bhatt [24] on the CM 5. Both groups used assembly language for time critical kernels. Salmon and Warren achieved efficiencies in the range 24 28 , while Liu and Bhatt achieved 30 efficiency. Recently, Warren and Salmon [35] extended their code to incorporate multipole and local expansions and made it ....

P. Liu and S. N. Bhatt. Experiences with parallel N-body simulation. In 6th Annual ACM Symposium on Parallel Algorithms and Architecture, pages 122--131. ACM, 1994.


Portable and Efficient Parallel Computing Using the BSP .. - Goudreau, Lang, Rao.. (1998)   (7 citations)  (Correct)

....Barnes Hut algorithm [4] which uses an irregular oct tree structure, called BH tree, to hierarchically group bodies into clusters according to their distribution in three dimensional space. The basic structure of our implementation is similar to those of Warren and Salmon [61] and Liu and Bhatt [45]. In particular, we use the ORB partitioning scheme to partition the bodies among the processors. Instead of repartitioning the bodies after each iteration as in [61] we only do so if the load imbalance reaches a certain threshold, as suggested in [45] The positions of the bodies are updated in ....

....of Warren and Salmon [61] and Liu and Bhatt [45] In particular, we use the ORB partitioning scheme to partition the bodies among the processors. Instead of repartitioning the bodies after each iteration as in [61] we only do so if the load imbalance reaches a certain threshold, as suggested in [45]. The positions of the bodies are updated in discrete time steps. In each step, the BH tree is first constructed locally inside each processor. Then appropriate subtrees, called locally essential trees, are exchanged between every pair of processors, such that afterwards every processor has a ....

P. Liu and S. N. Bhatt, "Experiences with parallel N-body simulations," in 6th Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 122--131, 1994.


Towards Efficiency and Portability: Programming with .. - Goudreau, Lang.. (1996)   (43 citations)  (Correct)

....is based on the BarnesHut algorithm [3] which uses an irregular oct tree structure, called BH tree, to hierarchically group bodies into clusters according to their distribution in three dimensional space. Our parallel implementation is similar to those of Warren and Salmon [37] and Liu and Bhatt [23]. In particular, we use the ORB partitioning scheme to partition the bodies among the processors. Instead of repartitioning the bodies after each iteration as in [37] we only do so if the load imbalance reaches SGI (16 procs) Cenju (16 procs) PCs LAN (8 procs) time spdp time spdp time spdp ....

....0.34 0.4 0.32 9562 62 3.92 6.3 sp 40k 0.28 0.26 0.26 2820 101 1.88 2.54 msp 40k 3.64 4.71 3.58 39874 138 39.57 44.36 matmult 576 2.09 2.42 1.97 124416 7 31.21 27.53 Figure 3.2: Algorithmic and model summaries for large problem size on 16 processor SGI system. a certain threshold, as suggested in [23]. The positions of the bodies are updated in discrete time steps. In each step, the BH tree is first constructed locally inside each processor. Then appropriate subtrees, called essential trees , are exchanged between every pair of processors, such that afterwards every processor has a local BH ....

P. Liu and S. Bhatt. Experiences with parallel N-body simulation. Proc. 6th ACM Symp. on Parallel Algorithms and Architectures, pages 122--131, June, 1994.


Highly Portable and Efficient Implementations of Parallel.. - David Blackston (1997)   (4 citations)  (Correct)

....Research Center, Bell Laboratories. Email: suel research.bell labs.com. tuned for high performance. As a result, while there have been many parallel implementations of non adaptive O(N) methods (e.g. see [4, 6, 17, 22, 27, 29] and of the O(N log N) adaptive Barnes Hut algorithm (e.g. see [20, 4, 24, 28, 26]) there are only very few parallel implementations of the O(N) adaptive methods. In this paper, we focus on the efficient parallel implementation of O(N) adaptive tree codes, and in particular on the adaptive Fast Multipole Method [7] and the closely related adaptive version of Anderson s Method ....

....and discuss their relation to our work. Due to the large amount of previous work, we can only discuss the most closely related work. 3. 1 Barnes Hut Algorithm Over the last decade, there have been a large number of parallel implementations of the Barnes Hut algorithm; a few examples are [20, 4, 24, 28, 26]. In [28] Warren and Salmon describe a fast message passing implementation of the Barnes Hut algorithm. They propose the use of locally essential trees to obtain a purely sender driven protocol for replicating nodes that are accessed by several processors during the force computation phase. This ....

[Article contains additional citation context not shown here]

Pangfeng Liu and Sandeep N. Bhatt. Experiences with parallel N-body simulations. In 6th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA'94), pages 122--131, 1994.


Fast Simulation of Wireless Systems - Perrone   (Correct)

....assigned to different processors. Each processor will run a simulation with the complete set of channels and the computation of noise will require that all processors cooperate exchanging transmitter power values. Fortunately, there has been extensive research on parallelization of N body methods [34, 43, 66, 70, 62, 61, 69] and we expect to draw from this pool of knowledge, if necessary. Our first step towards a parallel implementation will be the development of a sequential simulator. This program should be used as a basis of comparison for performance analysis of its parallel counterpart (speedup) and also as an ....

Pangfeng Liu and Sandeep N. Bhatt. Experiences with parallel N-body simulation. In SPAA 94, pages 122--131, 1994.


A Framework for Parallel Tree-Based Scientific Simulations - Liu, Wu (1997)   (2 citations)  Self-citation (Liu)   (Correct)

....is to develop a general N body framework that eases the difficult task of writing efficient parallel N body codes. The framework was developed based on our previous CM 5 implementations [3, 4, 10] in which we developed sound techniques that have been carefully studied both experimentally [9] and mathematically [8] We expect that these proven techniques will guide us towards the ultimate goal of writing efficient parallel N body programs with ease. The remainder of this paper is organized as follows. Section 2 explains the Barnes and Hut s algorithm. Section 3 briefly describes our ....

....desired number of time steps. 1 In practice it is more efficient to truncate each branch when the number of particles in its subtree decreases below a certain fixed bound 3 Parallel Implementation In the following subsections, we point out the differences between our parallel implementations [3, 4, 9, 10] and the generic sequential Barnes Hut algorithm. 3.1 Data partitioning The default strategy that we use to distribute bodies among processors is orthogonal recursive bisection (ORB) The space bounding all the bodies is recursively partitioned into as many boxes as there are processors, and ....

[Article contains additional citation context not shown here]

P. Liu and S. Bhatt. Experiences with parallel nbody simulation. In 6th Annual ACM Symposium on Parallel Algorithms and Architecture, 1994.


Tree codes for vortex dynamics: Application of a.. - Bhatt, Liu.. (1995)   (1 citation)  Self-citation (Liu Bhatt)   (Correct)

....[34] reported experiments on the 512 node Intel Touchstone Delta, and later developed hashed implementations of a global tree structure which they report in [35, 18] They have used their codes for astrophysical simulations and also for vortex dynamics. This paper builds on our CM 5 implementation [25] of the Barnes Hut algorithm for astrophysical simulations and contrasts our approach and conclusions with the aforementioned efforts. This abstract is organized as follows. Section 2 describes the application problem in some detail, and outlines the Barnes Hut fast summation algorithm. Section 3 ....

....successor particles along a filament, and also for building the global BH tree. Furthermore, ORB preserves data locality reasonably well 1 and permits simple load balancing. While it is expensive to recompute the ORB at each time step [28] the cost of incremental load balancing is negligible [25]. The ORB decomposition is incrementally updated in parallel as follows. At the end of an iteration, each ORB tree node is assigned a weight equal to the total number of operations performed in updating the states of particles in each of the processors which is a descendant of the node. A node is ....

P. Liu and S. Bhatt. Experiences with parallel nbody simulation. In 6th Annual ACM Symposium on Parallel Algorithms and Architecture, 1994.


A Comparison of Three Programming Models for Adaptive.. - Shan, Singh, Oliker.. (2002)   (9 citations)  (Correct)

No context found.

P. Liu and S. N. Bhatt, Experiences with parallel N-body simulation, in "Proc. 6th ACM Symposium on Parallel Algorithms and Architectures, ACM SIGACT and SIGARCH, Cape May, NJ, 1994," pp. 122--131.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC