| G.E. Blelloch, P.B. Gibbons, Y. Mattias, and M. Zagha. Accounting for memory bank contention and delay in high-bandwidth multiprocessors. IEEE Trans. Par. Dist. Sys., 8(9):943-- 958, 1997. |
....analysis. In recent years, a number of BSP variants have been formulated in the literature, whose definitions incorporate additional provisions aimed at improving the model s effectiveness relative to actual platforms without affecting its usability and portability significantly (see e.g. BGMZ95,BDM95,JW96b,DK96] Among these variants, the E BSP (Extended BSP) by [JW96b] and the D BSP (Decomposable BSP) by [DK96] are particularly relevant for this paper. E BSP aims at predicting more accurately the cost of supersteps with unbalanced communication patterns, where the average number h ave ....
G.E. Blelloch, P.B. Gibbons, Y. Matias, and M. Zagha. Accounting for memory bank contention and delay in high-bandwidth multiprocessors. In Proc. of the 7th ACM Symp. on Parallel Algorithms and Architectures, pages 84--94, Santa Barbara, CA, July 1995.
....tree based barrier and a heavy weight barrier and placing in our libraries architecture specific information that can replace the heavy barrier with the light weight one whenever the architecture permits it. 3. 2 Complexity model for shared memory Various cost models have been proposed for SMPs [1, 2, 3, 4, 6, 16, 18, 38, 45]; we chose the Helman and JaJa model [18] because it gave us the best match between our analyses and our experimental results. Since the number of processors used in our experiments is relatively small (not exceeding 64) contention at the memory location is negligible compared to the contention ....
G. E. Blelloch, P. B. Gibbons, Y. Matias, and M. Zagha. Accounting for Memory Bank Contention and Delay in High-Bandwidth Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 8(9):943--958, 1997.
....can be equipped with cache memories which have about the same cycle time as the processors or can be partitioned into multibanks. Since the cost of the cache memory is high and its size is limited, the multibank partition has mostly been adopted, especially in shared memory multiprocessors [3]. However, the effectiveness of such a memory partition can be limited by memory conflicts, that occur when there are many references to the same memory bank while accessing the same memory pattern. To exploit to the fullest extent the performance of the multibank partition, mapping schemes can be ....
G.E. Blelloch, P.B. Gibbons, Y. Mattias and M. Zagha, "Accounting for Memory Bank Contention and Delay in High-Bandwidth Multiprocessors", IEEE Trans. on Parallel and Distrib. Systems, Vol. 8, 1997, pp. 943-958.
....only sequentially since they are all stored in the same memory bank. Thus, access to different templates need different mappings which makes it hard to solve the problem in general by a unified scheme. The relevance of the memory bank contention problem is such that recently Blelloch et al. [3] have extended Valiant s Bulk Synchronous Parallel (BSP) model with two additional parameters the bank delay and expansion factor. The bank delay (d) is the throughput at a memory bank and the expansion factor (x) is the ratio of the number of memory banks to the number of processors. This new ....
....the number of memory banks to the number of processors. This new model, called the (d; x) BSP, can be used predict the performance of parallel machines with fairly good accuracy in the presence of a large bank delay or a large number of processors (i.e. high contentions) Experimental results in [3] show that if the expansion factor is sufficiently large and the memory access pattern is irregular, a random mapping of the memory locations to banks is enough to balance the memory references across all the banks, and therefore to limit the contentions. Nevertheless, in all cases where memory is ....
G.E.Blelloch, P.B. Gibbons, Y. Mattias and M. Zagha, "Accounting for memory bank Contention and Delay in High-Bandwidth Multiprocessors", IEEE Trans. on Parallel and Distrib. Systems, Sept 1997, pp. 943-958.
....simpler programming style is perhaps to be preferred. Other bridging models have been proposed. Candidate Type Architecture (CTA) AGL98, Sny86] is an early two parameter model (communication cost L and number of processor p) that was the result of a multidisciplinary effort. Blelloch et al. [BGM95] propose the (d; x) BSP model as a refinement for BSP 14 that provides more detailed modeling of memory bank contention and delay. LogP HMM [LMR95] extends the LogP model with a hierarchical memory model characterizing each processor. Each of the models discussed above are distributed memory ....
Guy Blelloch, Phil Gibbons, Yossi Matias, and Marco Zagha. "Accounting for Memory Bank Contention and Delay in High-Bandwidth Multiprocessors." In Seventh ACM Symposium on Parallel Algorithms and Architectures, pp. 84--94, June 1995. 124
.... for the development of portable software (see e.g. 9] Although the original BSP is meant to model messagepassing architectures, two BSP variants specifically tailored to shared memory systems have been recently developed, namely, the Queuing Shared Memory (QSM) 8] and the (d; x) BSP [4], which both embody some aspects of memory contention. In particular, QSM s cost function includes a parameter that accounts for the maximum number of concurrent accesses to the same memory location, while (d; x) BSP s cost function accounts for memory bank contention (parameters d and x ....
G.E. Blelloch, P.B. Gibbons, Y. Mattias, and M. Zagha. Accounting for memory bank contention and delay in highbandwidth multiprocessors. IEEE Trans. Par. Dist. Sys., 8(9):943--958, 1997.
.... Although the original BSP is meant to model distributed memory architectures, where communication is realized via message passing, two BSP variants specifically tailored to shared memory systems have been recently developed, namely, the Queuing Shared Memory (QSM ) GMR99] and the (d; x) BSP [BGMZ97] which both embody some aspects of memory contention. In particular, QSM s cost function includes a parameter that accounts for the maximum number of concurrent accesses to the same memory location, while (d; x) BSP s cost function accounts for memory bank contention (parameters d and x ....
G.E. Blelloch, P.B. Gibbons, Y. Matias, and M. Zagha. Accounting for memory bank contention and delay in high-bandwidth multiprocessors. IEEE Trans. on Parallel and Distributed Systems, 8(9):943--958, 1997.
....symmetric multiprocessors. All the models mentioned so far focus on the relative cost of accessing different levels of memory. On the other hand, a number of shared memory models have focused instead on the contention caused by multiple processors competing to access main memory. Blelloch et al. [6] proposed the (d,x) BSP model, an extension to the Bulk Synchronous Parallel model, in which main memory is partitioned amongst px banks. In this model, the time required for execution is modeled by five variables, which together describe the amount of time required for computation, the maximum ....
G.E. Blelloch, P.B. Gibbons, Y. Matias, and M. Zagha. Accounting for Memory Bank Contention and Delay in High-Bandwidth Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 8(9):943--958, 1997.
....that particular mode. A complete description of F, including detailed validation results, can be found in [2] We present here a synopsis of the results. The function F was compared with three BSP like cost functions based, respectively, on the Queuing Shared Memory (QSM) 15] and the (d; x) BSP [8], which both embody some aspects of memory contention, and the Extended BSP (EBSP) model [19] which extends the BSP to account for unbalanced communication. Since the BSP like functions do not account for the memory hierarchy, we determined an optimistic (min) version and a pessimistic (max) ....
G.E. Blelloch, P.B. Gibbons, Y. Mattias, and M. Zagha. Accounting for memory bank contention and delay in high-bandwidth multiprocessors. IEEE Trans. Par. Dist. Sys., 8(9):943-- 958, 1997.
....for the development of portable software (see e.g. 9] Although the original BSP is meant to model messagepassing architectures, two BSP variants specifically tailored to shared memory systems have been recently developed. Namely, the Queuing Shared Memory (QSM) 8] and the (d; x) BSP [4], which both embody some aspects of memory contention. In particular, QSM s cost function includes a parameter that accounts for the maximum number of concurrent accesses to the same memory location, while (d; x) BSP s cost function accounts for memory bank contention (parameters d and x ....
G.E. Blelloch, P.B. Gibbons, Y. Mattias, and M. Zagha. Accounting for memory bank contention and delay in highbandwidth multiprocessors. IEEE Trans. Par. Dist. Sys., 8(9):943--958, 1997.
....memory in unit time. However, in practice, the available per processor bandwidth to shared memory can be quite small. Access to shared memories is slowed by such factors as long message send overheads [CKP 96] contention at memory banks, the fact that memory banks are much slower than processors [BGMZ95], and bandwidth limitations of the network connecting processors to memory banks. Similar difficulties exist in distributed memory parallel machines. The parameter m of the PRAM(m) model focuses attention on this bottleneck, by enforcing the condition that the shared memory can service only m ....
G. Blelloch, P. Gibbons, Y. Matias and M. Zagha. Accounting for Memory Bank Contention and Delay in High-Bandwidth Multiprocessors. In Proc. 7 th ACM Symposium on Parallel Algorithms and Architectures, pp. 84-94, 1995.
....model that balances these conflicting requirements has proved a difficult task, a fact amply demonstrated by the proliferation of models in the literature over the years. The BSP [1] and the LogP [3, 2] models have been proposed in this context and have attracted considerable attention (see [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] for BSP and [16, 17, 18, 10, 19] for LogP) In both models the communication capabilities of the machine are summarized by a few parameters that broadly capture bandwidth and latency properties. In BSP, the fundamental primitives are global barrier synchronization and the routing of arbitrary ....
G.E. Blelloch, P.B. Gibbons, Y. Matias, and M. Zagha. Accounting for memory bank contention and delay in high-bandwidth multiprocessors. IEEE Trans. on Parallel and Distributed Systems, 8(9):943--958, September 1997.
....symmetric multiprocessors. All the models mentioned so far focus on the relative cost of accessing different levels of memory. On the other hand, a number of shared memory models have focused instead on the contention caused by multiple processors competing to access main memory. Blelloch et al. [6] proposed the (d,x) BSP model, an extension to the Bulk Synchronous Parallel model, in which main memory is partitioned amongst px banks. In this model, the time required for execution is modeled by five variables, which together describe the amount of time required for computation, the maximum ....
G.E. Blelloch, P.B. Gibbons, Y. Matias, and M. Zagha. Accounting for Memory Bank Contention and Delay in High-Bandwidth Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 8(9):943--958, 1997.
....are unnecessarily complicated to describe the behavior of existing symmetric multiprocessors. Other models have been proposed which focus instead on the contention caused by multiple processors competing to access the same location in main memory, including the (d,x) BSP model of Blelloch et al. [4] and the Queuing Shared Memory (QSM) of Gibbons et al. 5] The difficulty with these models is that while they address an issue which has an important impact on performance, the contention they describes depends on specific implementation details such as the memory map which may be entirely ....
....cache. Finally, the [S] benchmark is designed so that where possible the successor is always a constant stride away. Even though we chose the stride to be 1001, so that each step of the sublist traversal should involve accessing a non contiguous location Number of Threads Benchmark Step: 1] 2] [4] [8] 16] R] S] O] R] S] O] R] S] O] R] S] O] R] S] O] 1) 3) 0.59 0.87 0.66 0.34 0.40 0.34 0.18 0.21 0.18 0.10 0.12 0.10 0.08 0.08 0.08 (4) 6.69 1.86 2.33 3.40 1.08 1.17 1.75 0.57 0.59 0.96 0.31 0.30 0.74 0.22 0.18 (5) 0.01 0.12 0.01 0.01 0.04 0.01 0.01 0.05 0.01 0.01 0.06 ....
G.E. Blelloch, P.B. Gibbons, Y. Matias, and M. Zagha. Accounting for Memory Bank Contention and Delay in High-Bandwidth Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 8(9):943--958, 1997.
.... its suitability for the development of portable software (see e.g. GHM 96] However, as has been often observed, the simple BSP cost model offers only a coarse level of predictivity, since it disregards architectural features of real machines which may have a dramatic impact on performance [BGMZ97, JW96] Similar in spirit to BSP, but based on different programming paradigms, are the CG Model [DFRC93] which focuses on algorithmic rather than architectural issues, and the LOGP model [CKP 96] a fully asynchronous model whose performance predictions tend to be more accurate than those ....
.... are the CG Model [DFRC93] which focuses on algorithmic rather than architectural issues, and the LOGP model [CKP 96] a fully asynchronous model whose performance predictions tend to be more accurate than those provided by BSP but are still subject to the same limitations [BHP 96, BGMZ97] The BSP opened the way to a rich line of research which resulted in the definition of a number of variants that maintain the basic bulk synchronous approach to parallel programming but try to enhance the predictive quality of the associated cost model. Among these, we mention the BSP [BDM95] ....
[Article contains additional citation context not shown here]
G.E. Blelloch, P.B. Gibbons, Y. Matias, and M. Zagha. Accounting for memory bank contention and delay in high-bandwidth multiprocessors. IEEETrans. on Parallel and Distributed Systems, 8(9):943--958, 1997.
....parallel performance models neglect the effects of shared memory contention and synchronization on application performance. For several decades, many papers have presented analytic or queuing models for single tier shared memory system performance under a variety of assumptions (e.g. [105, 17, 79, 122, 139, 46, 84, 31, 99, 24, 114, 26]) Most of this work relies on assumptions regarding typical application behavior, and it is not clear how to apply these models to obtain performance predictions for a given application. Additionally, this dissertation has demonstrated that interactions between shared memory activity and ....
G. E. Blelloch, P. B. Gibbons, Y. Matias, and M. Zagha. Accounting for memory bank contention and delay in high-bandwidth multiprocessors. In Proceedings of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 1--12, Santa Barbara, CA, July 1995.
....h relation. In particular, the BSP model will assign a a higher cost to small h relations in which a processor exchanges packets with many other processors, and a lower cost to small h relations where all h packets are sent to the same destination. The dx BSP ( deluxe BSP ) model proposed in [12] attempts to model the performance of high bandwidth shared memory machines such as the Cray C90 in which the memory banks are significantly slower than the processors. Finally, the Extended BSP (E BSP) model [38] provides a more accurate cost function for unbalanced communication in networks ....
G. Blelloch, P. Gibbons, Y. Matias, and M. Zagha, "Accounting for memory bank contention and delay in high-bandwidth multiprocessors," in Seventh ACM Symposium on Parallel Algorithms and Architectures, pp. 84--94, June 1995.
....(as is needed for prams) Efficient bulk synchronization is an option on these machines, but is not imposed. An extensive study of algorithms and results for the qrqw pram can be found in [GMR96b, GMR96a] In addition, experimental results for the qrqw pram on the Cray C90 and J90 can be found in [BGMZ95] The model we study in this paper, the qrqw asynchronous pram model, permits more asynchronous behavior than the bulk synchrony imposed by the qrqw pram. Thus it can be used to design and analyze algorithms for machines such as the MTA in contexts in which bulk synchrony is not employed. ....
G. E. Blelloch, P. B. Gibbons, Y. Matias, and M. Zagha. Accounting for memory bank contention and delay in high-bandwidth multiprocessors. In Proc. 7th ACM Symp. on Parallel Algorithms and Architectures, pages 84--94, July 1995.
....7:6 s, 6:5 s, and o = 1:8 s. On the Meiko CS 2, g = 13:6 s, 7:5 s, and o = 1:7 s. Since the local instruction rate at a processor is tens of nanoseconds per instruction or faster, the normalized values for these parameters are in the hundreds to a few thousand. In contrast, Blelloch et al. [17] considered two shared memory vector multiprocessors, reporting (normalized) gap parameter values of g = 1:2 for the Cray C90 and g = 1:8 for the Cray J90. 3.2 Comparison to bsp In this section we compare the qsm and the bsp in terms of their effectiveness as a bridging model for parallel ....
G. E. Blelloch, P. B. Gibbons, Y. Matias, and M. Zagha. Accounting for memory bank contention and delay in high-bandwidth multiprocessors. IEEE Trans. on Parallel and Distributed Systems, 8(9):943--958, 1997.
....analysis (e.g. decision support, data mining, OLAP, data warehousing) implies that the most important modeling issue going forward concerns how best to model disk I O. This position paper summarizes a number of results appearing in several previous papers and manuscripts by the author, including [12, 27, 28, 29, 30]. Co authors on one or more of these papers are Guy Blelloch, Yossi Matias, Vijaya Ramachandran, and Marco Zagha. The positions expressed in this paper are those of the author, although they arose from discussions with these co authors and are shared to varying extents by them. This paper ....
....barriers, with an explicit charge (s or L) for each 4 0.0 10.0 20.0 30.0 40.0 50.0 60.0 Time per element (clock periods) 16384 65536 262144 1048576 Permutation size Optimized EREW Sort QRQW Dart Throwing Figure 1: The utility of the QSM contention metric. This figure is from [12]. Shown are measured times on an 8 processor Cray J90 comparing two algorithms for generating a random permutation: a queue read queue write (QRQW) dart throwing algorithm and an exclusive read exclusive write (EREW) sorting based algorithm. The QRQW algorithm performs better over a wider range of ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, Y. Matias, and M. Zagha. Accounting for memory bank contention and delay in high-bandwidth multiprocessors. In Proc. 7th ACM Symp. on Parallel Algorithms andArchitectures, pages 84--94, July 1995.
....(as is needed for prams) Efficient bulk synchronization is an option on these machines, but is not imposed. An extensive study of algorithms and results for the qrqw pram can be found in [GMR96b, GMR96a] In addition, experimental results for the qrqw pram on the Cray C90 and J90 can be found in [BGMZ95] The model we study in this paper, the qrqw asynchronous pram model, permits more asynchronous behavior than the bulk synchrony imposed by the qrqw pram. Thus it can be used to design and analyze algorithms for machines such as the MTA in contexts in which bulk synchrony is not employed. Indeed, ....
G. E. Blelloch, P. B. Gibbons, Y. Matias, and M. Zagha. Accounting for memory bank contention and delay in high-bandwidth multiprocessors. In Proc. 7th ACM Symp. on Parallel Algorithms and Architectures, pages 84--94, July 1995.
....in the emulation are small, then the high level model becomes an attractive general purpose bridging model. We substantiate the ability of the qsm to serve as a bridging model by providing a simple work preserving emulation of the qsm on both the bsp, and on a related model, the (d; x) bsp [16], and arguing for the practicality of this emulation. Thus the qsm can be effectively realized on machines that can effectively realize the bsp, as well as on machines that are better modeled by the (d; x) bsp. We also describe scenarios in which the high level qsm is more suited for analyzing ....
....in Section 4 provides some intuition for this rather surprising result. The particular instance of the Queuing Shared Memory model in which the gap parameter, g, equals 1 is essentially the Queue Read Queue Write (qrqw) pram model defined by the authors [34] Previous work on the qrqw pram [34, 32, 16] has been focused primarily on contention issues, unlike this paper, which is primarily concerned with bridging models and bandwidth issues. 3.1 Model comparison Table 1 compares the qsm model to a number of other models in the literature. The first column of the table gives the name of the ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, Y. Matias, and M. Zagha. Accounting for memory bank contention and delay in high-bandwidth multiprocessors. In Proc. 7th ACM Symp. on Parallel Algorithms and Architectures, pages 84--94, July 1995.
....in the emulation are small, then the high level model becomes an attractive general purpose bridging model. We substantiate the ability of the qsm to serve as a bridging model by providing a simple work preserving emulation of the qsm on both the bsp, and on a related model, the (d; x) bsp [9], and arguing for the practicality of this emulation. Thus the qsm can be effectively realized on machines that can effectively realize the bsp, as well as on machines that are better modeled by the (d; x) bsp. We also describe scenarios in which the high level qsm is more suited for analyzing ....
....in Section 3 provides some intuition for this rather surprising result. The particular instance of the Queuing Shared Memory model in which the gap parameter, g, equals 1 is essentially the Queue Read Queue Write (qrqw) pram model defined by the authors [20] Previous work on the qrqw pram [20, 17, 9] has been focused primarily on contention issues, unlike this paper, which is primarily concerned with bridging models and bandwidth issues. 2.1 Model comparison Table 1 compares the qsm model to a number of other models in the literature. The first column of the table gives the name of the ....
[Article contains additional citation context not shown here]
G. E. Blelloch, P. B. Gibbons, Y. Matias, and M. Zagha. Accounting for memory bank contention and delay in high-bandwidth multiprocessors. In Proc. 7th ACM Symp. on Parallel Algorithms and Architectures, pages 84--94, July 1995.
No context found.
G.E. Blelloch, P.B. Gibbons, Y. Mattias, and M. Zagha. Accounting for memory bank contention and delay in high-bandwidth multiprocessors. IEEE Trans. Par. Dist. Sys., 8(9):943-- 958, 1997.
No context found.
G. E. Blelloch, P. B. Gibbons, Y. Matias, and M. Zagha. Accounting for memory bank contention and delay in high-bandwidth multiprocessors. In Proc. 7th ACM Symp. on Parallel Algorithms and Architectures, pages 84--94, July 1995.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC