| Intel, "Paragon XP/S Product Overview," Supercomputer Systems Division, Intel Corporation, Beaverton, OR 97006, 1991. |
....be the most efficient and scalable parallel architecture in constructing the high performance parallel systems. Many existing prototypes and commercial k D mesh systems have been built such as the 2 D Intel Sandia ASCI System [17] the 2 D Intel Touchstone system [11] the 2 D Intel Paragon XP S [12], the 3 D Tera Computer System [1] the 3 D Mosaic C [19] etc. for executing supercomputing applications (which usually require large sub system sizes) Among those massively Meshconnected multicomputers, the Intel products [11, 12, 17] support a multi user environment (executing various ....
....the 2 D Intel Touchstone system [11] the 2 D Intel Paragon XP S [12] the 3 D Tera Computer System [1] the 3 D Mosaic C [19] etc. for executing supercomputing applications (which usually require large sub system sizes) Among those massively Meshconnected multicomputers, the Intel products [11, 12, 17] support a multi user environment (executing various independent applications in parallel) which an incoming task (or job) with a particular size is allocated on a sub mesh by the operating system, residing in the host. In order to allocate those supercomputing applications (or jobs) in efficient ....
Intel, "Paragon XP/S Product Overview," Supercomputer Systems Division, Intel Corporation, Beaverton, OR 97006, 1991.
....programmer to a host of program development and maintainance problems. Existing parallel machines present two fundamentally different programming models: the shared memory model [Hagersten et al. 1992; Kendall Square Research Corporation 1992; Lenoski et al. 1992] and the message passing model [Intel Supercomputer Systems Division 1991; Thinking Machines Corporation 1991] Even machines that support the same basic model of computation may present interfaces with significantly different functionality and performance characteristics. Developing the same computation on different machines may therefore lead to radically different ....
Intel Supercomputer Systems Division. 1991. Paragon XP/S Product Overview. Intel Supercomputer Systems Division.
....Index Terms: Interconnection networks, Mesh, Wormhole Routing, Deterministic Routing, Virtual Channels, Traffic pattern, Performance. 1 Introduction The 2 dimensional mesh (or mesh for short) has been one of the most common interconnection networks. It has been used in the Intel Paragon [12], Intel Touchstone Delta [13] Symult 2010 [20] and Stanford DASH [14] The adoption of the mesh in recent practical systems has been mainly influenced by Dally s study [4] his results have shown that for an equal implementation cost in VLSI the low dimensional highdiameter torus (or mesh) has ....
....order. In other networks, such as tori, in addition to dimension ordered routing between dimensions, two virtual channels per physical channels are required to prevent deadlock within a dimension as a result of the wrap around connections. Deterministic routing has been widely used in practice [12,13,14,15,19] as a result of its simplicity and minimal requirement for virtual channels. Analytical models of deterministic routing in wormhole routed k ary n cubes, e.g. hypercubes and tori, have been widely reported in the literature [1,4,7,9,10] More recently, a similar model for the 2 dimensional mesh ....
Intel Corporation, Paragon XP/S Product Overview, 1991.
....computing environment. Much recent research has therefore focussed on extending parallel processing solutions to such networks of commodity workstations (NOWs) Traditionally, parallel processing machines have been built with processing nodes interconnected in regular topologies such as a mesh [12], torus [5] hypercube [11] multistage interconnection network (MIN) 22, 40] etc. Such regular topologies have important mathematical properties that make message communication easier better by making message routing simpler, lowering the average distance per communication, and or increasing ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
....of multicast (multicast to all nodes in the system) we will consider multicast for the remainder of this paper. However, it must be noted that all the developed algorithms and theories in this paper apply to broadcast as well. Current generation parallel systems like IBM SP2 [41] Intel Paragon [16], Cray T3E [35] nCube 3 [12] J Machine [28] and Stanford FLASH use the cut through switching technique due to its inherent advantages like low latency communication and reduced communication hardware overhead [27] These systems provide very small buffer space at each hop, which results in ....
....due to its inherent advantages like low latency communication and reduced communication hardware overhead [27] These systems provide very small buffer space at each hop, which results in links getting held up by blocked worms. Also, these systems use regular network topologies (such as meshes [16], tori [35] hypercubes [3, 8] multistage interconnection networks [41] etc. with various deadlock free routing schemes. Such regular topologies have important mathematical properties that make message communication easier by making message routing simpler, lowering the average distance per ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
....computing. Much recent research has therefore focussed on extending parallel processing solutions to such networks of commodity workstations (NOWs) Traditionally, parallel processing machines have been built with processing nodes interconnected in regular topologies such as a mesh [14], torus [7] hypercube [13] multistage interconnection network (MIN) 26, 47] etc. Such regular topologies have important mathematical properties that make message communication easier better by making message routing simpler, lowering the average distance per communication, and or increasing ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
....multiprocessor nodes are sometimes used. I O devices are often connected directly to certain processing nodes. Some machines have a special front end processor that is used for system administration. 22 Typical interconnection networks are exhibited by Intel s DELTA [Intel 91b] and Paragon [Intel 91a] systems. In these systems, the nodes are connected in a twodimensional grid topology. This topology offers a high bisection bandwidth and presents fewer packaging problems than other organizations [Dally 87] Messages are routed using a non adaptive, oblivious algorithm: a message first moves ....
Intel Supercomputer Systems Division. Paragon XP/S Product Overview, 1991.
....will continue to be one of the primary potential bottlenecks in high performance parallel computing. To improve I O subsystem performance, many parallel systems employ a set of disks in parallel. The Intel Paragon XP S, for example, supports multiple redundant arrays of inexpensive disks (RAIDs) [24, 14]. Systems such as this have an impressive peak I O throughput, equal to the product of the throughput of each device and the number of devices. However, at the present time, their effective throughput in a parallel, multi user scientific computing environment is not well understood. 1 Recently, ....
....can also exercise some control over the data distribution. Related work also includes a number of commercial parallel file systems the CM 5 Scalable Parallel File System [19, 18] the Intel Concurrent File System [10] for the iPSC 2 and iPSC 860, and the Intel Paragon s Parallel File System [14]. These provide data striping and a small set of parallel file access modes. In many cases, these access modes do not allow the application enough control to extract good performance from the input output system. Distributed file systems, such as Zebra [12] and Swift [3] stripe data over ....
[Article contains additional citation context not shown here]
Intel Supercomputer Systems Division. Paragon XP/S Product Overview. Beaverton, OR, Nov. 1991.
....Unfortunately, the cost of communication may limit the performance of parallel computers. To fully realize the advantages of parallel processing, we need to design efficient communication mechanisms. Existing communication architectures span a spectrum ranging from message passing [Arlauskas 88, Intel 91a, Dally 90, TMC 91b] to remote memory access [Crowther et al. 85, Cray 93] shared memory [Sequent 87, Lenoski et al. 92, Agarwal et al. 91] and cache only architectures [Hagersten 92a, KSR 92] These communication architectures are often used directly by the programmer a fact that has ....
....but both designs provide the same functionality. Using off the shelf processors reduces overall design time for the machine and results in faster time to market. In fact, most existing parallel computers, such as Intel s series of message passing machines [Arlauskas 88, Bokhari 90, Intel 91b, Intel 91a] the Thinking Machines CM 5[TMC 91b] or the Cray T3D [Cray 93] a NUMA machine) use processing nodes built around a commercial off the shelf microprocessor. Instead of requiring specialpurpose processor instructions, these machines control communication through hardware external to the ....
[Article contains additional citation context not shown here]
Intel Supercomputer Systems Division. Paragon XP/S Product Overview, 1991.
....virtual channels) have on network interface complexity and software overhead. Our work addresses some of these issues. Research on network interfaces has focused primarily on reducing message injection (and reception) overhead [13, 8, 19, 4] or offloading the communication onto a coprocessor [14, 16, 3]. Such efforts are complementary to our goal of software protocol overhead reduction. Improvements in network interface can reduce the basic communication cost in our studies. While reducing the basic cost is important, as can be seen from our studies, reducing the software protocol overhead is ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
....leading to the development of parallel systems using such processor clusters as building blocks instead of single processors, thus allowing a modular and hierarchical approach to building large systems. Prominent examples of processorcluster based systems are the Stanford DASH [8] Intel Paragon [11], and Cray T3D [5] Typically, the interconnections connecting the processor clusters of these systems are scalable meshes, tori, or multistage networks. This is referred to as the inter cluster network. The interconnection inside a processor cluster is referred to as the intra cluster topology. ....
....based systems Computing nodes having more than one processor on a single multi chip module or processor board are becoming increasingly available. In the recent past many parallel multiprocessor systems have been developed using such processor clusters e.g. the CRAY T3D [5] Intel Paragon [11], and the Stanford DASH [8] Most of these systems are two level architectures as shown in Figure 1. The processor clusters are interconnected through a scalable inter cluster network, e.g. 3D torus, 2D mesh. The cluster configuration can vary from a simple star connection as in the T3D to a bus ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
....packets, latency drops to 25s, and for larger packets, bandwidth rises to 19.6MB s. This delivered bandwidth is greater than OC 3 ATM s physical link bandwidth of 19.4 MB s. FM s performance exceeds the messaging performance of commercial messaging layers on numerous massively parallel machines [21, 29, 11]. A good characterization of a messaging layer s usable bandwidth (bandwidth for short messages) is n 1 2 , the packet size to achieve half of the peak bandwidth ( r1 2 ) FM achieves an n 1 2 of 54 bytes. In comparison, Myricom s commercial API requires messages of over 3,873 bytes to ....
....some penalty in latency 512 byte packets deliver 19.6 MB s, greater than OC 3 ATM, and competitive with commercial massively parallel machines. For example, while FM s latencies are larger than Active Messages on the CM 5, the bandwidth is much higher. FM also compares favorably to recent MPPs [20, 21] in both bandwidth and latency. While there may appear to be many design tradeoffs involving performance for short or long messages (latency versus bandwidth) the design of FM is a counterexample. Despite consistently favoring low latency, the delivered peak bandwidth is within a few MB s of the ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
....demonstrates the impact of varying packaging and demand parameters. Finally, concluding remarks and future work are presented. 2 Designing Systems with k ary n cube cluster c Organization 2. 1 k ary n cube cluster c Organization Many current parallel systems like the CRAY T3D [8] Intel Paragon [14], and the Stanford DASH [11] are taking a two level clustering approach. Recently, we have introduced a new k ary n cube cluster c organization [4, 5, 19] to capture this upcoming trend in building scalable parallel systems. In this organization, the lower level consists of k n processor ....
....wires have higher capacitance [2] leading to elongation of the channel cycle time. However, it has been shown [25] that this problem can be alleviated by applying pipelining techniques over long wires. Such techniques are being commonly employed in recently developed systems like the Intel Paragon [14]. In this study we assume such pipelining techniques being used to limit the channel cycle time in the inter cluster system. 5 Parameterizing Packaging Technologies The characteristics and limitations of each level of packaging has an extreme impact on the set of achievable or feasible ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
....is nothing but a DMA channel. Since current technology supports multiple (2 4) DMA channels on a single chip, the above requirement translates to incorporation of 1 2 additional DMA chips at each processor router interface, which is a very low cost solution. It is to be noted that Intel Paragon [17] uses up to two physical consumption channels for each router while matching the aggregate bandwidth of two consumption channels with that of the processor memory bus. As the technology is moving towards 64 bit processors with 64=128 bits wide processor memory bus and 16=32 bit wide communication ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
....local versus remote transfers, and register versus memory access costs. Remote transfers cost more than corresponding local transfers due to routing and network interface delays. Local memory transfers are more 3 The CM5 [10] provides a register to register primitive, while the Paragon [11] provides a memory to memory transfer operation. Situation Version Description local lsend register register destination lsend mem memory memory remote rsend register register destination rsend mem memory memory local cost cost memory access register access lsend mem ....
Intel Corporation, Paragon XP/S Product Overview, 1991.
....system calls required) and a high bandwidth communication network. In particular, we make no assumptions about special mechanisms for local synchronization, global synchronization, or a global address space. This model is compatible with a broad range of existing and announced multicomputers [39, 13, 12]. In such machines, the fundamental performance issues are balancing the level of concurrency exploited against the cost of scheduling and context switching, exploiting data locality within and between nodes and achieving a reasonably balanced work distribution throughout the machine. In the ....
Intel Corporation. Paragon XP/S product overview. Product Overview, 1991.
....and low latency. Similarly, the architectural trend is shifting from hypercube topologies to k ary n cube systems with lower dimension (n) and larger size (k) to gain advantages from constant bisection bandwidth constraints [6] Representative systems falling into this category are Intel Paragon [11], Stanford DASH [14] and KSR [13] As this new generation of systems is becoming available, it is becoming increasingly important to develop efficient broadcasting schemes for these machines to support high performance concurrent computing. In wormhole routing, the latency of a message with ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
....disadvantage of traditional network interface is that message passing costs are usually thousands of CPU cycles. One solution to the problem of software overhead is to add a separate processor on every node just for message passing [12, 8] Recent examples of this approach are the Intel Paragon [9] and Meiko CS 2 [7] The basic idea is for the compute processor to communicate with the message processor through either mailboxes in shared memory or closelycoupled datapaths. The compute and message processors can then work in parallel, to overlap communication and computation. In addition, ....
Intel Corporation. Paragon XP/S Product Overview, 1991. 10
....number of application accessible interfaces that do provide protection: 6.4.1 SHRIMP The SHRIMP II memory mapped network interface [Blumrich94] was developed at Princeton as part of the SHRIMP project to build a multicomputer. The interface connects Pentium PCs to an Intel Paragon interconnect [Intel91] Communication between two hosts is connection oriented, in that two processes on di erent hosts must rst agree to create a mapping between an area of memory they each own. The memory areas do not need to be at the same physical or virtual addresses on the two hosts, but must be of the same ....
Intel Corporation. Paragon XP/S Product Overview, 1991. (p 114)
....processors. Therefore, several recent MIMD architectures explicitly designed to support data parallel languages (HPF and C , usually) have been equipped with specific hardware mechanism to optimize this function. It is in particular the case with the CM 5 [33] of Thinking Machine and the Paragon [11, 21] of Intel. MIMD architectures with an optimized global synchronization hardware facility are sometimes (somewhat improperly, see below) called SPMD (Single Program Multiple Data) as they are specifically designed to execute data parallel languages efficiently. A better nickname would probably be ....
Intel Corporation, Beaverton, OR. Paragon XP/S Product Overview, 1991.
....it requires all shared, non home pages to use a write through caching strategy, while the HLRC protocol can employ a write back caching strategy for all pages. 2 The SHRIMP System The SHRIMP multicomputer system [6] consists of 16 Pentium PC nodes connected by an Intel Paragon routing network [27, 17]. Each PC uses an Intel Pentium Xpress motherboard [18] that holds a 66 MHz Pentium CPU, 256 Kbytes of L2 cache, and 40 Mbytes of DRAM memory. Peripherals are connected to the system through the EISA expansion bus [2] Main memory data can be cached by the CPU as write through or write back on a ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
....restricted to network interfaces for multicomputers. An increasingly common multicomputer approach to the problem of user level transfer initiation is the addition of a separate processor to every node for message passing [16, 10] Recent examples are the Stanford FLASH [14] Intel Paragon [11], and Meiko CS 2 [9] The basic idea is for the compute processor to communicate with the message processor through either mailboxes in shared memory or closely coupled datapaths. The compute and message processors can then work in parallel, to overlap communication and computation. In ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
....necessary here. Specific details of the architecture and implementation will be described more thoroughly throughout this paper. 2. 1 Architecture The SHRIMP system consists of sixteen PC nodes connected by an Intel routing backplane, which is the same as that used for the Paragon multicomputer [27]. The backplane is organized as a two dimensional mesh, and supports oblivious, wormhole routing with a maximum link bandwidth of 200 Mbytes second [43] The right hand photograph in Figure 1 shows the basic interconnection between the nodes and the backplane. The backplane is actually relatively ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
....costs are usually thousands of CPU cycles, with the best implementation [32] still requiring over 100 CPU cycles. One solution to the problem of software overhead is to add a separate processor on every node just for message passing [22, 13] Recent examples of this approach are the Intel Paragon [14] and Meiko CS 2 [12] The basic idea is for the compute processor to communicate with the message processor through either mailboxes in shared memory or closely coupled datapaths. The compute and message processors can then work in parallel, to overlap communication and computation. In ....
Intel Corporation. Paragon XP/S Product Overview, 1991.
No context found.
Intel Corporation, Paragon XP/S Product Overview, 1991.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC