| Borkar, Shekhar, et. al. iWarp: An Integrated Solution to High-Speed Parallel Computing. In Proceedings Supercomputing `88, pages 330-339, Orlando, Florida, November 1988. IEEE Computer Society and ACM SIGARCH. |
.... Execution of this procedure in a four node array topology creates the following set of processes, with the lines representing the port elements passed as arguments: the cell s interface to the outside world. These port elements can be used to establish connections to other cells. P[O] P[1] P[2] P[3] For brevity, we shall sometimes use the following more compact representation of cell. This represents the same cell as the preceding figure, and indicates that the port is to be used for input. P op 5.2 Ring Pipeline Example We use an example to illustrate how PCN programs are ....
....Oil; 11 i over O. nodes( i : op (I[i] O[i] S[ i l) nodes( S[i] node(i) The process structure created by this procedure can be drawn as follows, with the solid lines indicating the port connections to the outside world and the dotted lines representing internal streams. I[0] 010] 011] 0[2] 0[3] OR O The following procedures implement simple input and output cells. The procedure load reads values from a file and sends them to successive elements of the port array P; the procedure store writes to a file values received on successive elements of port array . Both use the sequential ....
[Article contains additional citation context not shown here]
Borkar, S., et al., iWarp: An integrated solution to high-speed parallel computing, Proc. Supercomputing Conf., 330-339, 1988. 25
....the packages to about 90 GB sec for a pin out of 1800 usable MCM pins. The bisection bandwidth would be in the range of 10 TB sec. MORPH s flexible architecture subsumes both the processor in memory (PIM) and scalable shared memory approaches. Based on the experience of several PIM like systems [21, 20, 55, 7, 6], there is evidence that PIM organizations represent significant programming challenges, particularly for irregular applications. We believe that the use of more traditional processor memory structures will yield a machine with more accessible performance Network Interface Programmable Logic ....
Borkar, S., Cohn, R., Cox, G., Gleason, S., Gross, T., Kung, H. T., Lam, M., Moore, B., Peterson, C., Pieper, J., Rankin, L., Tseng, P. S., Sutton, J., Urbanski, J., and Webb, J. iWarp: An integrated solution to high-speed parallel computing. In Proceedings of Supercomputing '88 (1988), IEEE Press, pp. 330--341. Orlando, Florida.
....to use global knowledge of the program for layout and transformations at compile time while Stream C interprets each basic block at runtime and performs local optimizations such as stream register allocation in order to map the current set of stream computations onto Imagine. The iWarp system [3] is a scalable multiprocessor with configurable communication between nodes. In iWarp, one can set up FIFO channels for communicating between non neighboring tiles. However, reconfiguring the communication channels is more coarse grained and has a higher cost than on Raw, where each cycle can be ....
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J. Webb. iWarp: An integrated solution to high-speed parallel computing. In Supercomputing, pages 330--339, 1988.
....applications here that require large memory or for which an appropriate Toolkit configuration would be bigger than a few boards. Simulation of fluid flow is one such example. Other examples can be found in [8] Efforts with similar goals include the NuMesh effort at MIT, and the iWARP work at CMU [7]. There are other promising strategies for parallel computation, represented by machines such as the MIT Monsoon Dataflow machine, the Connection Machine, the Multiflow computer, and many others. These are generalpurpose machines. Our idea differs in that we intend to statically configure both ....
S. Borkar, 1. Cohen, G. Cox, S. Gleason, T. Gross, H.T. Kung, M. Lain, B. Moore, C. Peterson, J. Pieper, L. Rankin, P.S. Tseng, J. Sutton, J. Urbanski, and J. Webb, "iWarp: An Integrated Solution to High-speed Parallel Computing," Supercomputing '88, Kissimmee, Florida, Nov., 1988.
....a memory port (26 pins) six network ports (15 pins each) and a diagnostic port (3 pins) space. The same IDs (virtual addresses) are used to reference local (on the same node) and remote (on a different node) objects. Like the INMOS transpurer [6] the Caltech MOSAIC [20] and the Intel iWARP [7], the MDP is a single chip processing element integrating a processor, memory, and a communication unit. The MDP is unique in that it extends these previous efforts with efficient primitive mechanisms for communication, synchronization and naming [15] The MDP uses a direct communication network ....
....provides orders of magnitude lower communication and synchronization overhead than is possible with multicomputers built from off the shelf microprocessors. Communication and synchronization performance is competitive with processing nodes specialized to a single model of computation such as iWARP [7] (systolic) or the Transpater [6] CSP) Computers built from fine grain processing nodes, such as the MDP, consisting of a small but powerful processor and a small amount of memory, are more cost effective than those built from fewer coarse grain nodes. Fine grain nodes have a larger fraction of ....
Shekhar Borkar et al. iWARP: An Integrated Solution to High-Speed Parallel Computing. In Proceed- ings of the Supercomputing Conference, pages 330 338. IEEE, November 1988.
....group at CMU, Bellcore, the Pittsburgh Supercomputer Center (PSC) and Bell Atlantic. The goal of the testbed is to build a gigabit Metropolitan Area Network (MAN) and to demonstrate its value to applications. The testbed consists of twenty five DEC Alpha workstations, an iWarp parallel array [3] and a Paragon [12] on the CMU campus, and a Cray C 90, Cray T3D [1] CM 2 and Alpha cluster at the Pittsburgh Supercomputer Center (PSC) The Alpha workstations and 8 iWarp use network interfaces that provide architectural support for copy avoidance to optimize throughput [24, 25, 15] CMU ....
Shekhar Borkar, Robert Cohn, George Cox, Sha Gleason, Thomas Gross, H. T. Kung, Monica Lam, Brian Moore, Craig Peterson, John Pieper, Linda Rankin, P.S. Tseng, Jim Sutton, John Urbanski, and Jon Webb. iWarp: An Integrated Solution to High-Speed Parallel Computing, In Proceedings of Supercomputing '88, pages 330-339, Orlando, Florida, November 1988. IEEE Computer Society and ACM SIGARCH
....to use global knowledge of the program for layout and transformations at compile time while Stream C interprets each basic block at runtime and performs local optimizations such as stream register allocation in order to map the current set of stream computations onto Imagine. The iWarp system [3] is a scalable multiprocessor with con gurable communication between nodes. In iWarp, one can set up FIFO channels for communicating between non neighboring tiles. However, recon guring the communication channels is more coarse grained and has a higher cost than on Raw, where each cycle can be ....
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J. Webb. iWarp: An integrated solution to high-speed parallel computing. In Supercomputing, pages 330-339, 1988.
....section 5 concludes this paper summarizing our results and discussing possible future directions. 2 Background High performance routing networks, the subject of significant study over the last ten years, are currently in widespread use in machines such as the Intel Paragon [11] Intel iWARP [5, 25], NCUBE 2 [23] and the MIT J machine [15, 16] All of these multicomputer systems use direct networks, meaning that the computing nodes are embedded in the network topology, and as a result, some nodes are closer than others. In addition to use in multicomputers, direct networks are gaining ....
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J. Webb. iWARP: An Integrated Solution to High-Speed Parallel Computing. In Proceedings of Supercomputing '88, pages 330--341. IEEE Press, 1988. Orlando, Florida.
....this chapter, we use the network model described below. Network Topology We basically study n dimensional mesh networks. We focus on low (two, three, or four) dimensional mesh networks because they are currently in 68 widespread use in many machines such as the Intel Paragon [14] Intel iWARP [8, 43], the MIT J machine [21, 24] Stanford DASH [35] and Tera Computer s TERA machine [5] Unless stated otherwise, the two and four dimensional networks used contain 256 nodes. The size was determined for reasonable amounts of simulation time. For simplicity, we study networks with uniform radix, ....
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J. Webb. iWARP: An Integrated Solution to High-Speed Parallel Computing. In Proceedings of Supercomputing '88, pages 330--341. IEEE Press, 1988. Orlando, Florida.
....rate of 800 Mbit second or 1.6 Gbit second. In addition to HIPPI, there are a number of high speed network standards in various stages of development by standards bodies. These include ATM (Asynchronous Transfer Mode) 16] and Fibre Channel [26] Meanwhile, distributed memory computer systems [6, 23, 27, 32, 44] are becoming the architecture of choice for many supercomputer applications. The reason is that they are inherently scalable, and provide relatively inexpensive computing cycles compared with traditional uniprocessor or shared memory multiprocessor supercomputers. However, while traditional ....
....programs sending data to the Cray C90 at the Pittsburgh Supercomputer Center as part of a heterogeneous distributed computing application. Some of the applications that use the HIPPI interface are described in Section 7. 3 iWarp overview iWarp is a distributed memory parallel computing system [6]. An iWarp cell consists of a single chip iWarp processor and a local memory. The iWarp processor integrates both a high speed computation and communication agent in a single component. The communication agent connects the iWarp cell to four neighbors through 40 MByte second buses; the cells in ....
Shekhar Borkar, Robert Cohn, George Cox, Sha Gleason, Thomas Gross, H. T. Kung, Monica Lam, Brian Moore, Craig Peterson, John Pieper, Linda Rankin, P. S. Tseng, Jim Sutton, John Urbanski, and Jon Webb. iWarp: An integrated solution to high-speed parallel computing. In Proceedings of the
....adaptive routing against these significant costs in deciding whether or not to include adaptivity. 2 Background High performance routing networks, the subject of significant study over the last fifteen years, are currently in widespread use in machines such as the Intel Paragon [9] Intel iWarp [7, 30], Ncube 2 [26] and the MIT J Machine [11, 13] All of these multicomputer systems use direct networks, meaning that the computing nodes are embedded in the network topology, and as a result, some nodes are closer than others. In addition to use in multicomputers, direct networks are gaining ....
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J. Webb. iwarp: An integrated solution to high-speed parallel computing. In Proceedings of Supercomputing '88, pages 330--341. IEEE Press, 1988. Orlando, Florida.
....used. Not surprisingly, the computing students showed a strong preference for PCN, while the science students preferred Fortran M. 5 Related Work Several parallel languages and programming environments have been developed to support the modular construction of parallel programs. Borkhar et al. [4] propose that parallel programs be constructed by plugging together cells, in a manner analogous to VLSI. They use this technique to generate efficient programs for the iWarp systolic processor. occam [24] has been used for similar purposes. The target hardware limits the programs that can be ....
S. Borkar et al., "iWarp: An integrated solution to high-speed parallel computing," in Proc. Supercomputing Conf., 1988, pp. 330--339.
....of thousands of processing elements, there should be many processing elements on each chip; otherwise, the physical space used by the co processor would be excessively large. Typical independently programmed systolic processing elements, such as iWarp, only have one processing element per chip [2]. Additionally, when large numbers of MIMD processors (hundreds or thousands) are used for systolic algorithms, the programs stored in each processor tend to be similar: by definition, systolic algorithms require only a small number of cell programs [1] Thus, the Reprint 2 Reprint International ....
....In a programmable systolic array, data cannot be moved automatically since different applications require different types of data movement; data movement must be explicitly specified. Several machines provide this ability with special instructions and special queues or registers for data movement [2, 3]. In an SSR architecture, however, data moves through the array as a natural result of the user s program. Figure 1 shows a segment of a linear SSR array. Each functional unit (F i ) is adjacent to two register banks. Each functional unit can access data values from the register banks directly to ....
S. Borkar et al., "iWarp: An integrated solution to high-speed parallel computing, " in Proc. Supercomputing '88, pp. 330--339, IEEE, Nov. 1988.
....the communication computation bandwidth ratio, the node degree (nr. of links per node) requirement of fixed connections, the average message size, burst communication etc. Recently many communication processor proposals and designs are made, supporting message passing in massive parallel systems [3,4,5,6,7,8,9,10,11]. In [2] we explored the design space for building such processors. It turned out that many designs cover a very restricted area into this design space, and are therefore only suitable for very limited application areas. This paper presents a scalable and flexible communication processor for ....
e.a. Borkar S. Iwarp: an integrated solution to high-speed parallel computing. In Proceedings of Supercomputing '88, 1988.
....transfers, and is implemented using CMAM xfer function which splits up the transfer into a sequence of hardware packets at the source, and CMAM handle left xfer function which reassembles the packets at the destination. 1 While this is not the most efficient type of network interface [13, 8, 4], it has the significant virtue that no changes to the processor are required. Many researchers believe that this type of interface is basically representative of future network interfaces. 2 The CM 5 NI also supports an interrupt driven interface for reception; however, the cost is very high ....
....exploring what impact advanced network features (adaptive routing, virtual channels) have on network interface complexity and software overhead. Our work addresses some of these issues. Research on network interfaces has focused primarily on reducing message injection (and reception) overhead [13, 8, 19, 4] or offloading the communication onto a coprocessor [14, 16, 3] Such efforts are complementary to our goal of software protocol overhead reduction. Improvements in network interface can reduce the basic communication cost in our studies. While reducing the basic cost is important, as can be seen ....
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J. Webb. iWarp: An integrated solution to high-speed parallel computing. In Proceedings of Supercomputing '88, pages 330--341. IEEE Press, 1988. Orlando, Florida.
.... et ses diff erences par rapport au mod ele SIMD classique sont d ecrites dans[25] C est ce mod ele architectural que l on retrouve lorqu on programme des machines parall eles destin ees a l acc el eration d algorithmes num eriques, telles que SPLASH [10] ArMen [33] ou encore le r eseau iWarp [6]. proc 1 proc 2 proc n t t t D(m,n) m 2 1 . k k r . r r k 1 2 n m . 1 0 m . 1 0 1 r . r r 1 1 m 2 1 1 k Figure 5 : Calcul de la distance de Levenshtein sur une architecture lin eaire Sur une telle architecture, le calcul de la distance de Levenshtein est effectu e ....
....d un calcul systolique et les r esultats successifs de sa compilation en ReLaCS pour des machines de type SIMD, MIMD et s equentiel. L etat actuel des d eveloppements autour de ReLaCS comprend une version op erationnelle du compilateur sur les machines iPSC 2 [16] ArMen [33] et iWarp [6]. A d efaut d etre efficace, le portage de ReLaCS sur la machine a m emoire distribu ee iPSC 2 nous a permis d exp erimenter le processus de compilation pour les architectures MIMD. Par la suite nous avons propos e un m ecanisme mat eriel de communication plus efficace. Ce m ecanisme a et e ....
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H.T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. Tseng, J. Sutton, J. Urbanski, and J. Webb. iWarp:An integrated Solution to High-Speed Parallel Computing. In ICS, 1988.
....and output channels. This paper describes hardware support for multicast communication in multicomputers that possess multiple pairs of internal channels at each node, resulting in a k port communication architecture. Examples of such multicomputers include the nCUBE 2 [1] and the Intel CMU iWARP [19]. Currently, many commercial multicomputers support only a single pair of internal channels, which may become a bottleneck for packets entering and leaving the direct network. In addition, most existing systems offer hardware support for only unicast communication. Multicast communication based on ....
S. B. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J. Webb, "iWarp: An integrated solution to high-speed parallel computing," in Proceedings of Supercomputing'88, pp. 330--339, Nov. 1988.
....La bande passante en entr ees sorties est tr es importante, 80 M ega octets par seconde, pour permettre la circulation des donn ees entre les processeurs. 6. 2 Machine iWARP (universit e de Carnegie Mellon et Intel) Le iWARP, d evelopp e conjointement par Intel et l universit e de Carnegie Mellon [Bork88] et financ e par la DARPA durant quatre ann ees, est une version int egr ee du processeur el ementaire de la machine WARP. Le iWARP utilise l approche VLIW (Very Long Instruction Word) c est a dire que des instructions longues sp ecifient explicitement l ensemble des op erations s ex ecutant dans ....
S. Borkar, R. Cohn, G. Cox, T. Gross, H.T. Kung, M. Lam, M. Moore, C. Peterson, J. Pieper, J. Rankin, P.S. Tseng, J. Sutton, J. Urbanski, et J. Webb. iWarp : An Integrated Solution to High-Speed Parallel Computing. In Proceedings of Supercomputing '88, pages 330--339, Orlando FL (USA), novembre 1988.
....parallel algorithm which requires only local communication on a linear array. The partitioning scheme is suitable for systolic implementation on a multiprocessor Dynamic Programming Parallel Implementations for the Knapsack Problem 11 machine. Our experiments have been conducted on an iWarp [32]. This machine supports the register toregister communication model that is required to efficiently execute systolic programs [33] The machine we use has 8 processors connected in a ring. To support efficient systolic experiments, we use the parallel language ReLaCS[34] which embodies both the ....
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. Tseng, J. Sutton, J. Urbanski, and J. Webb, "iWarp :An integrated Solution to High-Speed Parallel Computing," in ICS, 1988.
....regular data flow through a network of identical cells with local memory. This characteristic is supported by an inter processor communication mechanism which avoids the overflow of the local memory [7] There exists one commercially available programmable machine based on this concept, the iWarp [2]. However this machine is much more expensive than a Transputer machine and its processor is not available as a component chip. One could expect to use the Transputer links and the services of a distributed system for systolic communications. And this is the straightforward approach with which we ....
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H.T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. Tseng, J. Sutton, J. Urbanski, and J. Webb. iWarp :An integrated Solution to High-Speed Parallel Computing. In International Conference on Supercomputing, 1988.
.... also supports user level message passing, but places more burden on application programs by requiring them to construct their own message headers [15] Some previous machines have worked to streamline the hardware software interface by mapping network interface FIFOs into processor registers [14, 24, 37]. Such approaches go against SHRIMP s goal of using commodity CPUs. A slightly less integrated approach mapping FIFOs to memory rather than registers was employed in the CM 5 [42] CM 5 implementation restrictions limited the degree of multiprogramming, however, and applications were still ....
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P.S. Tseng, J. Sutton, J. Urbanski, and J. Webb. iWarp: An Integrated Solution to High-Speed Parallel Computing. In Proceedings of Supercomputing '88, pages 330--339, 1988.
....can be easily made to increase the degree of automatic protection against errors at the expense of increased bandwidth overhead or buffer memory size. It is shown in [5] that while enjoying automatic protection against errors, the N23 Scheme is equivalent to the additive credit updating methods [6, 7, 12], as far as the effect of flow control on buffer management is concerned. A major motivation for using the per VC link by link flow control (LLFC) approach is to maximize ATM network performance. We have been performing simulations to validate the approach based on these credit based flow control ....
....1 Credit Cell Data Cell Data Cell Credit Cell Modified: November 1, 1993 6:01 pm Page 3 of 12 particular, the new credit count will not be relative to the old credit count. This is in contrast with relative or additive updating used in some previously proposed credit like flow control schemes [6, 7, 12], where the new credit count is equal to the old credit count plus the newly received credit value. The absolute credit updating allows a robust flow control scheme in the sense that any effect of a corrupted credit can be recovered automatically by the arrival of the next successfully transmitted ....
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J. A. Webb, "iWarp: An Integrated Solution to High-Speed Parallel Computing," Proceedings of Supercomputing `88 Conference, Orlando, Florida, November 1988, pp. 330-339.
.... 9 #defun split #v f s# #let ##notf #not f## #tmp # scan #btoi f# s## #down # scan #btoi notf# s## #sum # reduce notf s## #up # #dist sum s# tmp## #ind #select f up down### #permute v ind s### v = 7842907] f = TTFFTFT] s = 34) notf = FFTTFTF] tmp = 0120011] down = 0000112] sum =[12] up = 1232233] ind = 1200213] result = 4782097] Figure 3: Definition and action of the split operation. Vectors are enclosed in square brackets, while segment descriptors are enclosed in parentheses. The operation btoi converts a boolean vector to an integer vector. 2.2.1 Representations of ....
....loop parallelism is a good execution model on most parallel machines, other models may be better suited to some machines. For instance, a macro actor model [59] may be suitable for a machine such as the J machine [29] and functional pipelines [44, 18] may be well suited to a machine such as iWarp [12]. We chose not to pursue such alternative execution models as loop parallelism is effective over a wider range of parallel machines. 11 3 Size Inference Each computational operation in VCODE can be implemented as a normalized loop (i.e. a loop where the lower bound is zero and the loop ....
BORKAR,S.,COHN,R.,COX,G.,GLEASON,S.,GROSS,T.,KUNG,H.T.,LAM,M.,MOORE,B.,PETERSON,C., PIEPER,J.,RANKIN, L., TSENG,P.S.,SUTTON,J.,URBANSKI,J.,AND WEBB, J. iWarp: An integrated solution to high-speed parallel computing. In Proceedings of Supercomputing '88 (Orlando, FL, Nov. 1988), pp. 330--339.
.... block line word consumer yes yes no [18] deliver line block line word producer yes yes yes [17] reader copy block block consumer no yes no [16, 19] writer copy block block word producer no yes no [16, 19] message block word producer no yes no [5] message eager block word producer no yes yes [6, 4] Table 1: Summary of mechanisms. Weaker models of consistency have been proposed, however, which do allow store concurrency [2, 12] These models recognize that a series of writes may proceed in any order, if they do not affect the global view of the computation, as long as they all complete ....
....issues an instruction which transmits the entire message to a stream in the consumer s cache. Finally, we consider the message eager mechanism, which is message with eager transmission. In this model, data is written to the network as soon as it is produced, as in the J Machine[7] Intel s iWarp [4] architecture has a similar mechanism, in which logical channels can be set up between processes in advance. The message, message eager, and writer copy mechanisms are producer initiated; the reader copy mechanism is consumer initiated. 4 Evaluating the Mechanisms Now that we have enumerated the ....
Shekhar Borkar et al. iWarp: An integrated solution to high-speed parallel computing. In Supercomputing '88, November 1988.
....deal of attention. This is primarily due to the number of parallel computer systems that use multidimensional meshes for interconnection. Some examples of existing or proposed machines that make use of direct networks are: Caltech Cosmic Cube [4] Caltech Mosaic [5] CMU Intel iWarp [6] [7] Connection Machine [8] HORIZON [9] Intel iPSC and Paragon; MIT Alewife [10] MIT J machine [11] MuNet [12] Stanford DASH Multiprocessor [13] Thinking Machines CM2 [8] and . Cray T3E [14] PE PE SW PE PE SW SW SW PE PE PE PE (a) SW PE SW PE SW PE SW PE (b) Fig. ....
S. Borkar et al., "iWarp: An integrated solution to high-speed parallel computing," in Proc. Supercomputing '88, Nov. 1988.
.... is effective for three reasons: 1) Efficient systolic algorithms exist for parallelizing block operations such as matrix multiplication [10] 13] 14] 2) Fine grain distributed memory parallel machines capable of efficient execution of systolic algorithms have become available such as iWarp [2], 3] iWarp is commercially available from Intel. 3) Libraries written using block routines for linear algebra computations have been developed such as LAPACK [6] 8] Thus our approach takes advantage of advances in several areas, including parallel algorithm design, parallel architectures, ....
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J.Webb. iWarp: An integrated solution to high-speed parallel computing. In Proceedings of the Supercomputing Conference, pages 330--339, November 1988.
....and communication locality. Remember that massive parallelism results in small tasks having a high communication to processing bandwidth ratio. Flexible and fast communication requires hardware communication support. Many designs for communication processors (CPs) are proposed and made recently [1,2,3,4,5,6,7,8,9]. The design tradeoffs made resulted in very different processors. e.g. 8] describes a very high performance but rather restrictive processor. On the other hand [4] gives a description of a CP able to support deadlock and livelock free communication for arbitrary (although degree limited) network ....
....routing tables. Virtual links: If multiple connections are needed using the same physical link, the network can support virtual links. This has the additional advantage of (1) higher link utilization, because of less message blocking, and (2) the possibility of intermediate node routing. [7 ] and [10] describe virtual link implementations. A rather different approach to establishing virtual connections is taken in [2 ] where messages are divided into packets, which contain a virtual connection number. Virtual connections are not supported within the network, and therefore it lacks ....
[Article contains additional citation context not shown here]
e.a. Borkar S. Iwarp: an integrated solution to high-speed parallel computing. In Proceedings of Supercomputing '88, 1988.
....networks are favored because they scale better than high dimensional networks, they are modular, and they are easy to implement. Examples of machine designs that use such networks are the MuNet [12] Ametek 2010 [26] the Caltech Mosaic [3] the MIT J machine [9] and the CMU Intel iWarp [4]. Some recent distributed shared memory designs are also planning to use low dimensional direct networks, e.g. HORIZON [18] the Stanford DASH Multiprocessor [20] and the MIT Alewife machine [2, 6] The choice of the optimal network for a multiprocessor is highly sensitive to the assumptions ....
Shekhar Borkar et al. iWarp: An Integrated Solution to High-Speed Parallel Computing. In Proceedings of Supercomputing '88, November 1988.
....the phase rotation FFT. We present a new set of recipes for generating the twiddles and shuffle indices directly in terms of the parallel pipeline. Finally, we describe mapping strategies for the phase rotation FFT on the iWarp, a parallel computer system developed by Intel and Carnegie Mellon [1, 2]. We describe a fine grained approach for an N point radix 2 phase rotation FFT that balances computation and communication to run at the full 40 Mbytes sec rate of the iWarp physical links, regardless of the size of the input data sets. Section 2 introduces the phase rotation concept. Section 3 ....
....for the radix 2 FFT on the iWarp system. The main result is a scalable implementation of the pipelined phase rotation FFT that runs at the full 40 Mbytes second rate of the iWarp physical links. 5.1. iWarp The iWarp is a private memory multicomputer developed jointly by Intel and Carnegie Mellon [1, 2]. iWarp systems are 2 dimensional tori of iWarp nodes, ranging in size from 4 to 1024 nodes. Each node consists of an iWarp component, up to 16 Mbytes of off chip local memory, and a set of 8 unidirectional communication links that physically connect the node to four neighboring nodes. The iWarp ....
BORKAR, S., COHN, R., COX, G., GLEASON, S., GROSS, T., KUNG, H. T., LAM, M., MOORE, B., PETERSON, C., PIEPER, J., RANKIN, L., TSENG, P. S., SUTTON, J., URBANSKI, J., AND WEBB, J. iWarp: An integrated solution to high-speed parallel computing. In Supercomputing '88 (Nov. 1988), pp. 330--339.
....feedback to the user. Our system meets all of these goals, and provides considerable flexibility for future expansion. In the current implementation it supports four synchronized cameras sampling 512x480 8 bit grayscale images at 30 Hz. The foundation of this system is an iWarp parallel computer [2][3] which manages the overall data flow. Video input to the iWarp is performed by locally developed hardware. The video data is stored in the iWarp s local memory, and simultaneously sent via a High Performance Parallel Interface (HiPPI) network to a frame buffer, where all four images are ....
Borkar, S., R. Cohn, et al. (1988). iWarp: An Integrated Solution to High-Speed Parallel Computing. Proceedings of Supercomputing `88, Orlando, Florida, 330-339.
....consists of the physical camera setup described earlier in this section, the video interface board, and the 8 8 matrix of iWarp cells (Fig. 4) Each iWarp component contains a 20 MFLOPS computation engine and low latency (100 150 ns) communication engine for interfacing with other iWarp cells [3]. The existing iWarp system is an 8 8 torus of iWarp cells, half of which have 16 MB DRAMS per cell. The video interface, which is described in detail elsewhere [17] is connected directly to the iWarp cell through the memory interface; the digitized video data is routed and distributed at video ....
Borkar, S., et al. iWarp: An Integrated Solution to High-Speed Parallel Computing. in Proceedings of Supercomputing '88. 1988. Orlando, Florida.: p. 330-339.
....connection is critical to achieve good performance from the coprocessor. Thus, the objective of minimizing the number of accesses over the limited bandwidth connection is considered in the mapping process. This is in contrast to other approaches of building general purpose systolic computers [24, 25, 26, 27, 7]. Thus, Chapter 5 discusses design methods under constraints of fixed bandwidth and area, and objectives of yield (clock frequency) or speedup, and number of accesses. The mapping process incorporates the General Parameter Method (Chapters 2 and 4) to map partitioned dependence graphs of the given ....
....can be extended to processor arrays of arbitrary dimensions. We choose to study linear arrays because they are easier to build and program than arrays of higher dimension. Hence, several linear arrays have been implemented for specific applications as well as for general purpose computing [34, 35, 24, 25]. The organization of this chapter is as follows. Section 1.2 describes the model of algorithms targeted in this thesis, followed by a discussion of previous and related work in Section 2.1. Section 2.2 presents the definitions of parameters, followed by the constraint equations for valid systolic ....
[Article contains additional citation context not shown here]
S. Borkar, "iWarp : An integrated solution to high-speed parallel computing," Proceedings Supercomputing, pp. 330--339, IEEE Computer Society Press, Nov. 1988.
....consists of the physical camera setup described earlier in this section, the video interface board, and the 8 8 matrix of iWarp cells (Fig. 2) Each iWarp component contains a 20 MFLOPS computation engine and low latency (100 150 ns) communication engine for interfacing with other iWarp cells [3]. The existing iWarp system is an 8 8 torus of iWarp cells, half of which have 16 MB DRAMS per cell. The video interface, which is described in detail elsewhere [18] is connected directly to the iWarp cell through the memory interface; the digitized video data is routed and distributed at video ....
Borkar, S., et al. iWarp: An Integrated Solution to HighSpeed Parallel Computing. in Proc. of Supercomputing '88. 1988. Orlando, Florida.: p. 330-339.
.... also supports user level message passing, but places more burden on application programs by requiring them to construct their own message headers [15] Some previous machines have worked to streamline the hardware software interface by mapping network interface FIFOs into processor registers [14, 25, 38]. Such approaches go against SHRIMP s goal of using commodity CPUs. A slightly less integrated approach#mapping FIFOs to memory rather than registers#was employed in the CM 5 [43] CM 5 implementation restrictions limited the degree of multiprogramming, however, and applications were still ....
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P.S. Tseng, J. Sutton, J. Urbanski, and J. Webb. iWarp: An Integrated Solution to High-Speed Parallel Computing. In Proceedings of Supercomputing '88, pages 330#339, 1988.
....freeware provided for UNIX workstations. 1 1 Introduction C stolic was first designed for programming a linear systolic array: the MicMacs machine [3] Then the C stolic compiler evolved to support simulation of systolic algorithms on parallel machines [6] such as the iPSC 2 [2] the iWarp [9], the ArMen machine [4] or the MasPar [1] In addition, a simulator running on UNIX workstations has been developed for testing and debugging purposes. Only the simulator is provided within the freeware. It allows programmers to easily design, experiment, and validate systolic algorithms. ....
Borkar S., Cohn R., Cox G., Gleason S., Gross T., Kung H.T., Lam M., Moore B., Peterson C., Pieper J., Rankind L., Tseng P. S., Sutton J., Urbanski J., and Webb J. iwarp: An integrated solution to high-speed parallel computing. In Proceedings of Supercomputing '88, pages 330--339. IEEE Computer Society and ACM SIGARCH, nov 1988.
....the optimal value for n is two or three [2, 8, 10] Many existing and emerging multiprocessor systems use such low dimensional direct networks to interconnect the processors, including the Intel Paragon, Cray T3D, Stanford Dash [14] M.I.T. Alewife [1] M.I.T. J Machine [16] and CMU Intel iWarp [5]. In this paper, we develop performance models to study k ary n cube networks with wormhole routing, with either single flit or infinite network buffers. Our model for the single flit buffer case includes the deadlock free routing algorithm of Dally and Seitz [9] In contrast to previous analyses ....
S. Borkar, iWarp: An Integrated Solution to High-Speed Parallel Computation, Proceedings of Supercomputing '88, November 1988.
....in this category include, for example, Linda [14] Split C [16] and CSP [29] 3.2 Scheduled Routing Architectures There are a number of existing scheduled routing architectures discussed in the literature. Two of them are presented here, iWarp and GF11. 3.2. 1 iWarp The iWarp architecture [54, 10, 11] is CMU and Intel s follow on project to the Warp architecture. iWarp integrates the processing and routing units on a single chip, targeted to DSP, scientific, and image processing. The interface between the processor and router is an interface register file used for systolic communication, ....
Shekhar Borkar et al. iWarp: An integrated solution to high-speed parallel computing. In Proceedings of Supercomputing '88, November 1988.
....is sometimes done for a parallel file system. Among the possible network choices, HIPPI is CUlTently the most popular one, and most manufacturers of distributed memory parallel systems either provide or have announced a H1PPI connection (e. g, CM 2 [20] CM 5 [7] iSC 860 [12] NCube2 [14] iWarp [4], Paragon XP S [11] Maspar [2, 15] As far as an application on the parallel system is concerned, the exact characteristics of the external links do not matter, and the I O node provides an appropriate abstrac tion. We can think of the I O nodes as establishing the periphery of the parallel ....
....toms. Each node is connected to its four neighbors by 2 unidirectional links with a peak bandwidth of 40 MBytes sec. Activity on all links can proceed in parallel, so each node has a peak communication bandwidth of 320 MBytes sec. The local memory bandwidth of each node is 160 MBytes sec [4]. 4.1 Measurements In this section, we present some measurements to illustrate how the effective bandwidth varies as a function of the message size and how multiple links can be used to provide an effective bandwidth that is higher than the individual link bandwidth. We concentrate on the ....
[Article contains additional citation context not shown here]
Shekhar Borkar, Robert Cohn, George Cox, Sha Gleason, Thomas Gross, H. T. Kung, Monica Lam, Brian Moore, Craig Peterson, John Pieper, Linda Rankin, P.S. Tseng, Jim Sutton, John Urbanski, and Jon Webb. iWarp: An integrated solution to high-speed parallel computing. In Proceedings of the 1988.
No context found.
Borkar, Shekhar, et. al. iWarp: An Integrated Solution to High-Speed Parallel Computing. In Proceedings Supercomputing `88, pages 330-339, Orlando, Florida, November 1988. IEEE Computer Society and ACM SIGARCH.
No context found.
Shekhar Borkar, Robert Cohn, George Cox, Sha Gleason, Thomas Gross, H. T. Kung, Monica Lam, Brian Moore, Craig Peterson, John Pieper, Linda Rankin, P. S. Tseng, Jim Sutton, John Urbanski, and Jon Webb. "iWarp: An Integrated Solution to High-Speed Parallel Computing". In Supercomputing '88, pages 330--339, 1988.
No context found.
Shekhar Borkar, Robert Cohn, George Cox, Sha Gleason, Thomas Gross, H. T. Kung, Monica Lam, Brian Moore, Craig Peterson, John Pieper, Linda Rankin, P. S. Tseng, Jim Sutton, John Urbanski, and Jon Webb. iWarp: An Integrated Solution to High-Speed Parallel Computing. Proceedings of the
No context found.
S. Borkar et. al., "iWarp: An Integrated Solution to High-Speed Parallel Computing", Proceedings of Supercomputing 1988.
No context found.
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J. Webb. iWarp: An integrated solution to high-speed parallel computing. In Proc. of Supercomputing '88, pages 330--339, November 1988.
No context found.
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J. Webb. iWarp: An integrated solution to high-speed parallel computing. In Proc. of Supercomputing '88, pages 330--339, November 1988.
No context found.
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J. Webb. iWarp: An integrated solution to high-speed parallel computing. In Proc. of Supercomputing '88, pages 330--339, November 1988.
No context found.
S. Borkar, R. Cohn, G. Cox, S. Gleason, T. Gross, H. T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P. S. Tseng, J. Sutton, J. Urbanski, and J. Webb. iWarp: An integrated solution to high-speed parallel computing. In Supercomputing, 1988.
No context found.
S. Borkar et al., "iWarp: An integrated solution to high-speed parallel computing," in Proc. Supercomputing 1988.
No context found.
S. Borkar, R. Cohen, G. Cox, S. Gleason, T. Gross, H.T. Kung, M. Lam, B. Moore, C. Peterson, J. Pieper, L. Rankin, P.S. Tseng, J. Sutton, J. Urbanski, and J. Webb, "iWarp: An Integrated Solution to High-speed Parallel Computing," Supercomputing '88, Kissimmee, Florida, Nov., 1988.
No context found.
Shekhar Borkar, Robert Cohn, George Cox, Sha Gleason, Thomas Gross, H. T. Kung, Monica Lam, Brian Moore, Craig Peterson, John Pieper, Linda Rankin, P.S. Tseng, Jim Sutton, John Urbanski, and Jon Webb. iWarp: An Integrated Solution to High-Speed Parallel Computing. Proceedings of Supercomputing ' 88, IEEE Computer Society and ACM SIGARCH, Orlando, Florida, November, 1988, pp. 330-339.
No context found.
Shekhar Borkar et al., iWarp: An integrated solution to high-speed parallel computing, in: Proc. of Supercomputing '88 (1988) 330-339.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC