| R.W. Hockney, Performance parameters and benchmarking of supercomputers, in: J.J. Dongarra and W. Gentzsch, eds., Computer Benchmarks (Elsevier Science Publishers, Holland, 1993) 41-63. |
....of the software development it is 7 possible to make initial predictions of the performance, by identifying the principal operations. An example of principal operation characterisation in the performance studies of scientific applications are those based on floating point operations (orfiops) [12, 16, 19]. The number of flops, that are to be credited for different types of floating point operations can be based on the scheme suggested by McMahon [25] Using Table 1 one can use flops as the RUV for each node of the software execution graph. Then by using the flop rate (Mflop s) of a specific system ....
R.W. Hockney, Performance parameters and benchmarking of supercomputers, in: J.J. Dongarra and W. Gentzsch, eds., Computer Benchmarks (Elsevier Science Publishers, Holland, 1993) 41-63.
.... Data) programming model, in which every parallel process executes the same program text; further details are given in [10] To compute the predicted BSP cost on a given platform we determined the value of the BSP parameters for the Convex SPP1000 and the IBM SP2 using the notions introduced in [5], according to Miller s refinement of the BSP cost model [11] The paper is organized as follows. In Section 2 we recall the main characteristics of the BSP model. In Section 3 we report the cost analysis of the different algorithms obtained combining the two basic methods for solving triangular ....
R. W. Hockney. Performance Parameters and benchmarking of supercomputers. Parallel Computing, 17:1111-1130, 1991.
....of today interconnection networks, for which startup times represent a great amount of the total communication cost when short messages are transmitted. By the contrary, as the message size increases this effect becomes negligible. Miller [13] refines the BSP cost model using Hockney s model [8] for including the effect of message granularity on communication cost. In the refined model g is defined as a function g(k) where k is the message size, and the constant depending of the communication hardware is g 1 , termed so because it represents the asymptotic communication cost for very ....
R. W. Hockney. Performance Parameters and benchmarking of supercomputers. Parallel Computing, 17:1111-1130, 1991.
....which can be your own wrist watch. The time interval is started and ended by a keyboard input. We suggest you time an interval of 2 minutes. The remaining synthetic code fragments can then be run to measure performance parameters related to vectorization, communication and synchronisation [8]. These can be useful in interpreting the results obtained from the more complex benchmarks. 2. Run sequential FORTRAN77 versions of kernels Measure single node performance over a range of problem sizes upto the maximum possible on a single node. Also investigate effect of compiler options and ....
R.W. Hockney. Performance parameters and benchmarking of supercomputers. Parallel Computing, 17(10 - 11):1111--1130, 1991.
....t trans n (1) where t 0 is the message startup time, t trans is the transmission time per byte, and n is the message length in bytes. Perhaps the simplest measure of message passing performance is the pingpong or echo benchmark between a pair of nodes, which has been described by several authors [9,5,2]. In this test one node sends a message to the other, which in turn, receives the message and sends it back immediately to the first node. Half the time for this test is recorded as the time to send a message of the given length. The message passing performance can be characterised by extending ....
....as the time to send a message of the given length. The message passing performance can be characterised by extending Hockney s performance description for vector pipelined processors over the above model using the asymptotic performance, r 1 , and the half performance length, n 1 2 , parameters [5]. t com = n c 1 2 n r c 1 (2) Although the communication subsystems of tightly coupled parallel computers have been improving very quickly over the last few years, this same simple model is also valid for multistage interconnection networks based on high speed communication switches ....
R. Hockney. Performance Parameters and Benchmarking of Supercomputers. Parallel Computing, 17(10/11):1111--1130, 1991.
....of the numerical algorithms, the flow of control is rather simple; that makes it much easier to predict their run time. It is also well known that the run time of many numerical algorithms largely depends on the following parameters: 1. the Hockney parameters of the interconnection network [23], 2. the support for broadcast and global accumulation operations, and 3. the performance of the processors. It is fairly easy to measure the network parameters 1) and 2) but measuring and modeling the performance of the processors is the hard part. The processor performance depends on the ....
R. Hockney. Performance parameters and benchmarking of supercomputers. Parallel Computing, 17:1111--1130, 1991.
....between two communicating parties on the size of the message. For example, various hardware and software overheads in a parallel environment that are modeled by a fixed component, independent of the message size, and by a variable component, proportional to the message size, are identified in [1, 3, 4, 8, 9]. However, such models (with constant coefficients) cannot accommodate contention in a general fashion. Schemes for partially avoiding contention in routing architectures (e.g. a hypercube in [15] and for obtaining probabilistic guarantees for propagation times are proposed, but the problem of ....
Hockney, R. W. Performance Parameters and Benchmarking of Supercomputers. Parallel Computing, 17 (1991), 1111-1130.
.... a given source and destination, the communication time can be parameterised by Hockney s (n 1=2 ; r 1 ) parameters [3] in terms of which the communication time t for a message of n bytes is: t = n n 1=2 ) r 1 These parameters can be measured using Hockney s comms1 or ping pong benchmark [4], in which the time is measured for a message to be sent to its destination and received back again, the receiver returning the message as soon as it arrives. The communication time t is measured for a range of message lengths n. A plot of t versus n 7 should form a straight line, from whose ....
R.W. Hockney. `Performance parameters and benchmarking of supercomputers'. Parallel Computing 17 (10-11) 1111-1130, 1991.
.... Another meaningful and elegant way of characterizing the communication performance is to extend Hockney s performance description for vector pipelined processors over the above model using the asymptotic bandwidth, r 1 , and the half performance message length, N half (n 1 2 ) parameters [7]. t com = n 1 2 n r 1 (2) The power of this simple formula is the fact that it provides a single parameter for both the computer system and the application software. In a communication pipeline the maximum (asymptotic) bandwidth occurs for infinitely long messages and is designated by r 1 ....
....1 Figure 1: The ping pong test. Of interest is the average time required for sequential, point to point passing of messages. Perhaps the simplest measure of message passing performance is the ping pong or echo benchmark between a pair of processes, which has been described by several authors [13, 7, 4]. In this test (see Figure 1) one process sends a single fixed length buffer to the other, which in turn, receives the 4 message and sends it back immediately to the first process. This buffer contains an array of primitive data types. The buffer size is graduated across a broad range of sizes so ....
R. Hockney. Performance Parameters and Benchmarking of Supercomputers. Parallel Computing, 17(10-11):1111--1130, 1991.
.... ffl VMMP, A practical tool for the development of portable and efficient programs for multiprocessors, Gabber [66] Further references for performance metrics: ffl A case study using ParaGraph, Roger Hockney [67] ffl Performance Parameters and Benchmarking of Supercomputers, Roger Hockney [68]. ffl Parameterization of Computer Performance, Roger Hockney [69] ffl f 1=2 : A parameter to characterize memory and communication bottlenecks, Roger Hockney and I. Curington [70] ffl A framework for benchmark performance analysis, Roger Hockney [71] ffl Parallel Computers 2: ....
Roger Hockney, "Performance parameters and benchmarking of supercomputers ", Parallel Computing, vol. 17, pp. 1111--1130, 1991. 136
....overheads, then the cost of sending every single message may become significant and it will no longer be permissible, for instance, to treat n messages of length 1 as one message of length n. To account for such situations Miller [11] refined the standard BSP cost model using Hockney s model [12] to include the effect of message granularity. His approach defined g as a function of the message size x: gx N x g ( 1 2 1, where g is the asymptotic communication cost for very large messages and N 1 2 is the size of message that produces half the optimal bandwidth of the machine, ....
R. W. Hockney, Performance parameters and benchmarking of supercomputers, Parallel Computing, 17:1111-1130 (1991)
....overheads, then the cost of sending every single message may become significant and it will no longer be permissible, for instance, to treat n messages of length 1 as one message of length n. To account for such situations Miller [9] refined the standard BSP cost model using Hockney s model [10] to include the effect of message granularity. His approach defined g as a function of the message size x: gx N x g ( 1 2 1, where g is the asymptotic communication cost for very large messages and N 1 2 is the size of message that produces half the optimal bandwidth of the machine, ....
Hockney R.W., "Performance parameters and benchmarking of supercomputers," Parallel Computing, 17:1111-1130, 1991.
....designers that parallel machines offer the best opportunity for improvement of supercomputers (CRAY T3E, IBM SP. Programs developed on such heterogeneous architectures need to be benchmarked just as sequential programs and machines. Architecture performance parameters have been proposed (see [17] and references therein) Users need speedup upper bounds that can be ideally reached when several processors are used. These upper bounds yield a consistent definition of efficiency. Speedup and efficiency are synthetic performance enhancement measures of an algorithm on a parallel architecture ....
R. Hockney, Performance parameters and benchmarking of supercomputers. , Parallel Computing, 17 (1991), pp. 1111--1130.
....abstraction of a parallel computer. Early parallel models, beginning with [7] avoided the issue of communication and its impact on performance with the assumption of such perfect communication. Algorithms based on such models may appear to be highly performant, but more realistic assumptions [4, 9, 21] about the underlying communication system may reveal significant degradation of their behavior. Our approach is also motivated by factors showing the increasing importance of communication in the area of parallel distributed computing: ffl The need for improving evaluation of complexity and ....
....model for the performance of vector computers as a function of the vector length, it can be used to characterize any process that has a linear timing relation with respect to some variable. Similar schemes were derived for modeling memory access overhead [10] or synchronization overhead [9]. This approach also provides a general framework for benchmarking supercomputers. The bulk synchronous parallel model (BSP) 21] is a model bridging software and hardware aspects of general purpose parallel computation. It is an abstraction of a parallel machine on which algorithms written in ....
Hockney, R. W. Performance Parameters and Benchmarking of Supercomputers. Parallel Computing, 17 (1991), 1111-1130.
....programs. For experts, we want to know the value of p 1 2 , the time it takes to produce a correct program that achieves at least half the performance of the machine. This is analogous with Hockney s n 1 2 , which is the vector length on which a pipeline delivers half its peak performance [18]) There are a number of considerations that must be taken into account in the design of the experiment to obtain fair results: 1) The subjects used in an experiment must be chosen with care. The novices should have no previous experience in parallel computing. To eliminate the cumulative effects ....
R. Hockney. Performance Parameters and Benchmarking of Supercomputers. Parallel Computing, Vol. 17, 1991.
....single message of size h. Although both communications have the same BSP cost as they both realise a h relation with cost hg, it seems extremely difficult to believe that their communication time will be the same on a real parallel machine. Miller [13] refines the BSP cost model by using Hockney s [9] model for including the effect of message granularity on communication cost. In the refined model, g is defined as a function g(x) where x is the message size, and g 1 is the asymptotic communication cost for very large messages (g reported in Table 1 is g 1 ) g(x) N 1 2 x 1 g 1 ....
R. W. Hockney. Performance parameters and benchmarking of supercomputers. Parallel Computing, 17:1111--1130, 1991.
....Hypercube and the vector machine VP100 are analysed in an other paper (see [3] Our explanations are sometimes within 0.5 and almost always within 5 of the measured run times. 1. 1 Introduction Benchmark programs come in four flavours: i) small loops designed to measure machine parameters [4], ii) inner loops of algorithms [6] iii) whole applications [11] and (iv) synthetic benchmarks [10] Benchmark suites may include modules of several flavours [9] 11] In such cases one sometimes wishes to explain results of higher order modules by results of lower order modules [9] Such an ....
....analyse their own algorithms and applications and check if the reasons for a machine s performance on the benchmark apply to their own program. Unfortunately, we have been unable to find any such explanation whatsoever in the literature, and it has been stated that often it cannot be given at all [4]. This is obviously true, because any such explanation would have to involve some kind of analysis of algorithms which is not feasible if the flow of control depends too heavily on the input data (e.g. Livermore loop 15 [6] and in general, of course, because of the unsolvability of the halting ....
[Article contains additional citation context not shown here]
R. Hockney. Performance parameters and benchmarking of supercomputers. Parallel Computing, 17:1111--1130, 1991.
....time slice. The (small) effect of multithreading overhead is automatically accounted for in the coefficient since during the calibration the kernel is run as a thread. 4. 3 Communication Model In conventional static techniques a linear delay model is often used to predict the transfer delay (e.g. [3, 13]) in terms of startup, hop count and bandwidth. Although precise for isolated transfers, these models do not account for additional queuing delay induced by concurrent traffic contending for the intermediate (sending, forwarding, receiving) link and node services. This is especially true in ....
R.W. Hockney, "Performance parameters and benchmarking of supercomputers," Parallel Computing, 17, 1991, pp. 1111--1130.
....and includes specific codes for problems like scheduling, load balancing, communications and synchronisation. For instance, Genesis [9] includes an implementation of the ping pong method generally used to evaluate communications in MIMD (Multiple Instructions Multiple Data) parallel machines [10]. However, the ping pong application executes only one communication at a time. This application is very useful to evaluate parameters described in paragraph 2.1, but it does not evaluate communications in the case of contention or of global communications. For these two cases, specific benchmarks ....
....configured as a grid with 32 nodes (a 4x8 grid) The grid topology is chosen because it has the maximal possible degree and it is easy to extend. 4.1 Ping Pong Experiment In this experiment only the two tasks of the tagged ping pong were executed. A description of this application can be found in [10]. Figure 1 shows the average communication time as a function of the size of the message for different distances. The hops are due to the packetization. In Figures 2 and 3, average communication time is a function of the two parameter sizes and distance. A change in the curve appears clearly when ....
R. Hockney. Performance Parameters and Benchmarking of Supercomputers. Parallel Computing, 17(10-11):1111--1130, December 1991.
....process sending h messages of size one or a single message of size h. However, on a real parallel machine there is a start up latency associated with every message so the actual communication cost is dependent on message size. Miller [7] refined the standard BSP cost model using Hockney s model [5] to accurately include the effect of message granularity in the communication cost. In the refined BSP model, g is defined as a function of the message size x: g(x) N 1 2 x 1 g 1 (3) where g 1 is the asymptotic communication cost for very large messages (g reported in Table 1 is g 1 ....
R. W. Hockney. Performance parameters and benchmarking of supercomputers. Parallel Computing, 17:1111--1130, 1991.
....what they do. For novices, we are interested in measuring how quickly they can learn the system and produce correct programs. For experts, we want to know the value of p 1 2 , the time it takes to produce a correct program that achieves at least half the peak performance of the machine 3 [9]. 3. THE SAMPLE PROBLEM AND ITS SOLUTION The problem chosen for the experiment was the computation of a transitive closure. This problem was used in a graduate course in parallel computing at the University of Alberta during the previous year, but all students used NMP. Although consideration was ....
R. Hockney. Performance Parameters and Benchmarking of Supercomputers. Parallel Computing, vol. 17, pp. 1111-1130, 1991.
....elimination code and give the corresponding real and estimated execution times in order to show the accuracy of the estimated performance figures. Related Work There are numerous articles in the literature about benchmarking different aspects of recent parallel architectures or supercomputers [3, 4, 11, 12, 13, 14, 16]. There are also several benchmark suits specially developed to provide a common ground to test the performance of different high performance computers [1, 2, 10, 15] Some of them investigate the use of real application programs, while others employ short kernel codes to evaluate the performance, ....
R.W. Hockney. Performance parameters and benchmarking of supercomputers. Parallel Computing, 17 (10-11):1111--1130 (1991).
....One can notice that there are discontinuities in the communication time at periodic sizes of the message length due to packetization. The discontinuities are at 217 232i Bytes; where i 2 f0; 1; 2; g The Point to point communication time often can be expressed by a few simple parameters [13]; r 1 is the asymptotic transfer rate in MBytes sec, and t 0 is the asymptotic zero message length latency in microseconds. The time (in microseconds) for Point to point communication as a function of n, T pp (n) is given by the following equation: T pp (n) t 0 n r 1 For small messages, the ....
R. Hockney, "Performance parameters and benchmarking of supercomputers," Parallel Computing, vol. 17, pp. 1111--1130, Dec. 1991.
....various multi stage interconnection networks. Even Valiant s Bulk Synchronous Parallel approach [97, 125] to modelling parallel computation explicitly excludes the possibility of this sort of contention from its analysis. Hockney s approach to modelling the performance of vector processing units [66] should be included under this heading. There is a correspondence between the costs of starting up a vector processor pipeline and initiating a communications connection. This approach is also interesting as the analysis is based on the behaviour of the actual hardware component. One novel ....
.... function constant communications delay [96] constant communications delay single algorithm model [31] algebraic: abstract characterisation topologically dependent overheads [39] Solution Rate algebraic: complexity function pipeline costs pipeline start up [66] Response Time algebraic: stochastic complexity function synchronisation costs fork join data dependency fixed configurations [29] Throughput Power algebraic: stochastic characterisation stochastic arrivals open system finite processors [77] Table 2.2: Basic taxonomy of models ....
[Article contains additional citation context not shown here]
Roger Hockney. Performance parameters and benchmarking of supercomputers. Parallel Computing, 17:1111--1130, 1991.
.... produces a slower code on VAX then the VAX MIPS rating of the other machine gets inflated (Weicker 1991) In short, there are no accepted industry standards for measuring the performance of different computers (Serlin 1986) Some vendors publicize what is known as sustained performance ratio (Hockney 1991). The sustained performance is the number of floating point operations performed per second during the execution of a standard problem, expressed in MFLOPS. In order for this term to be used in a meaningful fashion, the problem on which the performance was achieved must be given (Hockney 1991) ....
....(Hockney 1991) The sustained performance is the number of floating point operations performed per second during the execution of a standard problem, expressed in MFLOPS. In order for this term to be used in a meaningful fashion, the problem on which the performance was achieved must be given (Hockney 1991). Some researchers have suggested that the term sustained performance should be dropped and replaced by the name of a well defined benchmark whose performance is being quoted (Hockney 1991) Therefore, to give a clear meaning to the MIPS and MFLOPS numbers they should be obtained by executing a ....
[Article contains additional citation context not shown here]
Hockney, Roger. 1991. Performance parameters and benchmarking of supercomputers. Parallel Computing 17 (December): 1111--30.
....used to gather the timing data, and develop a performance model for each pattern. The point to point communication time can often be expressed by a few simple parameters as T(n) t 0 n r1 r 1 is the asymptotic transfer rate in MB s, and t 0 is the asymptotic zero length message latency in s [7]. For small messages, t 0 is dominant, while for large messages the transfer time is dominant. Since n is measured in bytes, and is varied from 8 bytes to 1MB, the model or each library and each basic communication type was split into several regions where each region has its own t 0 and r 1 . ....
R. Hockney. Performance parameters and benchmarking of supercomputers. Parallel Computing, 17(10):1111--1130, Dec. 1991.
....between two communicating parties on the size of the message. For example, various hardware and software overheads in a parallel environment that are modeled by a fixed component, independent of the message size, and by a variable component, proportional to the message size, are identified in [1, 3, 4, 8, 9]. However, such models (with constant coefficients) cannot accommodate contention in a general fashion. Schemes for partially avoiding contention in routing architectures (e.g. a hypercube in [15] and for obtaining probabilistic guarantees for propagation times are proposed, but the problem of ....
Hockney, R. W. Performance Parameters and Benchmarking of Supercomputers. Parallel Computing, 17 (1991), 1111-1130.
....the actual execution time measured (within a few percents, as shown later on) Note, that for this example conventional static analysis may already yield an error up to 100 . 4. 3 Communication Model In conventional static techniques a delay model is used to predict the transfer delay (e.g. [5, 8, 21]) Although precise for isolated transfers, these models generally do not account for additional queueing delay induced by concurrent traffic on the intermediate (sending, forwarding, receiving) links and nodes. In the following we develop a transfer contention model which provides a first order ....
R.W. Hockney, "Performance parameters and benchmarking of supercomputers," Parallel Computing, vol. 17, 1991, pp. 1111--1130.
....abstraction of a parallel computer. Early parallel models, beginning with [7] avoided the issue of communication and its impact on performance with the assumption of such perfect communication. Algorithms based on such models may appear to be highly performant, but more realistic assumptions [4, 9, 21] about the underlying communication system may reveal significant degradation of their behavior. Our approach is also motivated by factors showing the increasing importance of communication in the area of parallel distributed computing: ffl The need for improving evaluation of complexity and ....
....model for the performance of vector computers as a function of the vector length, it can be used to characterize any process that has a linear timing relation with respect to some variable. Similar schemes were derived for modeling memory access overhead [10] or synchronization overhead [9]. This approach also provides a general framework for benchmarking supercomputers. The bulk synchronous parallel model (BSP) 21] is a model bridging software and hardware aspects of general purpose parallel computation. It is an abstraction of a parallel machine on which algorithms written in ....
Hockney, R. W. Performance Parameters and Benchmarking of Supercomputers. Parallel Computing, 17 (1991), 1111-1130.
....machine VP100. Our explanations almost always within 15 of the measured run times. We also point out several striking anomalies in the run time behaviour of the above architectures. 2. 2 Introduction Benchmark programs come in four flavours: i) small loops designed to measure machine parameters [Hoc91] ii) inner loops of algorithms [McM88] iii) whole applications [Wei91] and (iv) synthetic benchmarks [Wei84] Benchmark suites may include modules of several flavours [VdS91] Wei91] In such cases one sometimes wishes to explain results of later modules by results of earlier modules ....
....of the inner loop, respectively. If no confusion is possible we will drop arguments g, j or L in order to simplify notation. If routine r(g; j) consists of a fixed number of operations, then for large n we expect to observe run times of the form t(L; n) a 0 a 1 n ( 1=r1 (n 1=2 n) see [Hoc91] HJ88] t(L; n) t(L; n) n = a 0 =n a 1 (see figure 2.1 Type A curve or scalar curve) Because we are interested in large problem sizes it would seem natural to take measurements for large values of n only. This did not suffice for the following two reasons: parallelization of a given ....
[Article contains additional citation context not shown here]
R. Hockney. Performance parameters and benchmarking of supercomputers. Parallel Computing, 17:1111-- 1130, 1991.
....both communications have an h relation cost of hg. However, a superstep in which very little total communication occurs may still deviate from the cost model because of the effects of startup costs for message transmission. Miller refined the standard cost model [29] using a technique of Hockney [20] to model the effect of message granularity on communication cost. In the refined model, g is defined as a function of the message size x: g(x) n 1=2 x 1 g 1 (1) where g 1 is the asymptotic communication cost for very large messages (g reported in Table 2 is g 1 ) and n 1=2 is the ....
R.W. Hockney. Performance parameters and benchmarking of supercomputers. Parallel Computing, 17:1111--1130, 1991.
....implies that the first packet has 216 data flits and 39 overhead flits, and the subsequent packets have 232 data flits and 23 overhead flits. The effective HPS bandwidth is thus 40 Theta 232=255 = 36.4 MB sec. The Point to point communication time often can be expressed by a few simple parameters [14]; r 1 is the asymptotic transfer rate in MB sec, and t 0 is the asymptotic zero message length latency in sec. The time (in sec) for Point to point communication as a function of n, T pp (n) is given by the following equation: T pp (n) t 0 n r 1 For small messages, the setup time t 0 is ....
R. Hockney. Performance parameters and benchmarking of supercomputers. Parallel Computing, 17(10):1111--1130, Dec. 1991.
....In [7] a formalization is given of the program model construction process. 4. 3 Communication Model Especially in implementations which involve a high level of concurrency (threads, latency hiding) a traditional approach to model a message passing system (startup time and bandwidth parameters [2, 9]) seriously under estimates communication delay in the presence of other communications due to contention for resources. We present a model which accounts for the contention for routing forwarding services at all processors along the virtual circuit. Intended as a first order approximation for the ....
R.W. Hockney, "Performance parameters and benchmarking of supercomputers, " Parallel Computing, vol. 17, 1991, pp. 1111--1130.
....attention has been paid to the quality of the associated performance models. The models used are either based on training sets [15, 33] or reflect the state of the art in static performance prediction in which (for example) communication costs are modeled using linear startup bandwidth models [13, 18, 35]. However, it is shown that even under moderate communication densities these models easily underestimate communication delay by orders of magnitude due to the fact that network contention is not accounted for [5] A new approach has been introduced to the performance prediction of parallel ....
R.W. Hockney, "Performance parameters and benchmarking of supercomputers," Parallel Computing, vol. 17, 1991, pp. 1111--1130.
....performance models for communication as a function of simple performance parameters. One such estimation approach has been applied to a wide variety of parallel machines with good results. Simple calibration loops were developed to measure latency and bandwidth for internode communication. [20] Let r be the asymptotic transfer rate of a communication interconnect in units of megabytes second, n is the message length in bytes, and t o is the (asymptotic) zero message length latency in microseconds. This suggests the following model of communication latency: 28) As shown in Section ....
R. Hockney, "Performance Parameters and Benchmarking of Supercomputers," Parallel Computing, Vol. 17, No. 10 & 11, December, 1991, pp. 1111-1130.
....Netsim is geared towards optimized libraries which do not make use of source buffering for large messages. As a result, netsim does not capture the sender behavior as well as it does on the SP 2. 4 Related Work Many analytical models have been developed for analyzing network interconnects [1, 3, 9, 16]. Even though it is more convenient to use analytical models for the analysis, it is very difficult to obtain accurate models for complex systems. Analytical models are usually complemented with simulation for the missing accuracy. We resort to simulation in our work, because modeling a full ....
R.W.Hockney. Performance Parameters and Benchmarking of Supercomputers. Parallel Computing, 17:1,111--1,130, 1991.
....performance achieved by an average programmer as a function of time on each of a range of problems. This would lead to graphs of the kind shown in Figure 3, where the left graph shows a particular result and the right graph a simplified abstraction of it. By analogy with Hockney s n 1=2 measure [20], which is the length of vector on which a pipelined architecture achieves half its theoretical peak performance, we could in principle find the value of p 1=2 the programming time required to achieve half of a machine s peak performance 5 for a particular combination of programming system ....
Roger Hockney. Performance Parameters and Benchmarking of Supercomputers. Parallel Computing, 17:1111--30, 1991.
....correct programs. For experts, we want to know the value of p 1 2 , the time it takes to produce a correct program that achieves at least half the performance of the machine. This is analagous with Hockney s n 1 2 , which is the vector length on which a pipline delivers half its peak performance [Hoc91]) There are a number of considerations that must be taken into account in the design of the experiment to obtain fair results: 1) The subjects used in an experiment must be chosen with care. The novices should have no previous experience in parallel computing. To eliminate the cumulative effects ....
R. Hockney. Performance Parameters and Benchmarking of Supercomputers Parallel Computing, Vol. 17, 1991. - 17 - Technical Report TR93-09
....While the measurements clearly show the accuracy of T (within percents) the results support the conclusions of the above simulation study with respect to T l . In this case study we introduce a new communication model of the messagepassing machine that extends traditional static models [5, 8, 23] through its capability to predict communication delay in the presence of simultaneous communications due to network contention. It is shown that this innovation is crucial to the prediction accuracy even under moderate network load. A significant part of the experiments involve actual machine ....
R.W. Hockney, "Performance parameters and benchmarking of supercomputers," Parallel Computing, vol. 17, 1991, pp. 1111--1130.
....size h; both communications realise a h relation cost of hg. However, a superstep in which very little total communication occurs may still deviate from the cost model because of the effects of startup costs for message transmission. Miller refined the cost model [26] using a technique of Hockney [17] to model the effect of message granularity on communication cost. In the refined model, g is defined as a function of the message size x: g(x) n 1=2 x 1 g 1 (1) where g 1 is the asymptotic communication cost for very large messages (g reported in Table 2 is g 1 ) and n 1=2 is the ....
R.W. Hockney. Performance parameters and benchmarking of supercomputers. Parallel Computing, 17:1111--1130, 1991.
....metric of interest and is equal to (ff (h Gamma 1)fl) fi [10] As with any metric that is a ratio, any notion of goodness or optimality of n 1=2 should only be considered in the context of the underlying metrics ff, fi, fl, and h. For a more complete discussion of these parameters see [9, 8]. There are a number of factors that can affect message passing performance. The number of times the message has to be copied or touched (e.g. checksums) is probably most influential and obviously a function of message size. The vendor may provide hints as to how to reduce message copies, for ....
Roger Hockney. Performance parameters and benchmarking of supercomputers. Parallel Computing, 17:1111--1130, 1991.
....to the analytic prediction quality the results support the above conclusions. ffl We introduce a new communication bandwidth model of the message passing machine which accounts for resource contention within the network. The technique extends traditional static (point topoint) modeling techniques [4, 7, 13] through its capability to predict communication delay in the presence of simultaneous communications. The case study measurements show that this innovation is crucial to the prediction accuracy even under moderate network load. ffl We introduce a compile time metric which measures the degree ....
....time slice. In the coefficient the (small) effect of multithreading overhead is automatically accounted for since during the calibration the above code is run as a thread. 4. 3 Communication Model In conventional static techniques a linear delay model is used to predict the transfer delay (e.g. [4, 7, 13]) in terms of startup, hopcount and bandwidth. Although precise for isolated transfers, these models generally do not account for additional queuing delay induced by concurrent traffic contending for the intermediate (sending, forwarding, receiving) link and node services. This is especially the ....
R.W. Hockney, "Performance parameters and benchmarking of supercomputers," Parallel Computing, vol. 17, 1991, pp. 1111--1130.
....one can easily find that time of execution is completely dominated by message communication time. 4. 2 Communications Benchmark (COMMS1) This benchmark measures the basic communication properties of a computer network by performing the pingpong experiment between a neighbouring pair of nodes [9]. A message of varying length is sent and returned after the data has become available to the receiving user program. Half the time for this pingpong exchange is recorded as the time to send a message from one node to a neighbour. Analysis of this time as a function of message length allows the ....
R.W. Hockney. Performance Parameters and Benchmarking of Supercomputers, Parallel Computing, 17(10-11) (1991) 1111-1130.
....give a clear meaning to these averages, and the value of the benchmark is more in the distribution itself. In particular, the maximum and minimum give the range of likely performance in full applications. The ratio of maximum to minimum performance has been called the instability or the speciality [18], and is a measure of how difficult it is to obtain good performance from the computer, and therefore how specialised it is. The minimum or worst performance obtained on these loops is of special value, because there is much truth in the saying that the best computer to choose is that with the ....
....SYNCH1 benchmark assesses this by measuring the number of barrier synchronisation statements that can be executed per second as a function of the number of processors taking part in the barrier. 3.3. 1 Communication Benchmarks: COMMS1 and COMMS2 The purpose of the COMMS1, or Pingpong, benchmark [29, 18] is to measure the basic communication properties of a message passing computer. A message of variable length, n, is sent from a master processor to a slave processor. The slave receives the message into a Fortran data array, and immediately returns it to the master. Half the time for this message ....
R. W. Hockney. Performance Parameters and Benchmarking of Supercomputers. Parallel Computing, 17:1111--1130, 1991.
No context found.
R. W. Hockney. Performance Parameters and benchmarking of supercomputers. Parallel Computing, 17:1111-1130, 1991.
No context found.
R.W. Hockney. `Performance parameters and benchmarking of supercomputers'. Parallel Computing 17 (10-11), 1991.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC