| G. Bell. Ultracomputers A teraflop beforeits time. Comm. of the ACM, 35(8):26--47, Augus 1992. |
....packet, as long as the multicast set is physically contiguous. For a multicast packet to be successfully delivered, a positive acknowledgment must be received from all the recipients of the multicast group. The Elite switches combine the acknowledgments, as pioneered by the NYU Ultracomputer [4] [24] returning a single one to the source. Acknowledgments are combined in a way that the worst ack wins (a network error wins over an unsuccessful transaction, which on its turn wins over a successful one) returning a positive ack only when all the partners in the collective communication ....
G. Bell. Ultracomputer: a Teraflop before its time. Communications of the ACM, 35(8):27--47, 1992.
....mid 1970s was capable of 130 Mflops (million floating point operations per second) The eight processor Cray Y MP 8 of the late 1980s was already capable of 2. 8 Gflops peak performance [58, 37] And now the race is on to increase both memories and speeds of computers to the teraflop (Tflops) level [12]. This is not likely to be solved without highly parallel computers. This development is driven not least by the wish to solve very large scientific and engineering problems in order to enable technological progress. Some of these grand challenges of the 1990s are, for instance, more precise ....
G. Bell. Ultracomputers -- a teraflop before its time. Comm. ACM, 35:27--47, 1992. 93
....the scalability of the algorithm on the target machine. The practicality of using computational models has led to the development and use of many different models for algorithm design. PRAM models have been used extensively to simulate shared memory machines [6, 9] Many people, including Bell [2] and Dongarra, Duff, Sorensen, and van der Vorst [5] have noted that the trend in parallel computing is to build distributed memory machines which emulate shared memory operation. Although these machines may look like shared memory machines to the user, Cypher and Sanz [4] note that they do not ....
Bell, G. Ultracomputers a teraflop before its time. Communications of the ACM 33, 8 (August 1992), 27--47. 14
....want to operate on an object simultaneously. The solution to consistency problems must be chosen very carefully for highly concurrent objects where the number of such processes can order a thousand or more. With major computer manufacturers soon to deliver massively parallel computers [Cahur92] [Bell92], such questions concerning concurrent objects become very important. The outline of the paper is this: in section 2 I will discuss the conventional approach to prevent concurrent object operations from mutually interfering and the drawbacks of the approach. In section 3 I will outline a ....
G. Bell, "Ultracomputers: a Teraflop Before its Time," Communications of the ACM, Volume 35, Number 8, August 1992, pp. 26 - 47.
....magnitude and diversity would require a general, cost effective, scalable, yet powerful computing model which will be able to efficiently support its varied computational and communication requirement. It is this realization that has spurred intense research in heterogeneous computing environments [2, 3, 4, 5, 6, 1, 7, 8]. We believe that the future of parallel computing lies in the integration of the plethora of specialized architectures into a single Heterogeneous High Performance Computing (HHPC) environment that allows them to cooperate in solving complex problems (Figure 1) The HHPC environment will ....
....Software development in any Parallel Distributed environment is a non trivial process and requires a thorough understanding of the application and the architecture. This apparent from the fact that, applications are currently able to achieve only a fraction of peak available performance [7, 1]. The percentage of the peak performance achieved by standard parallel benchmarks on current parallel distributed systems is Northeast Parallel Architectures Center ffl Syracuse University Science and Technology Center ffl 111 College Place ffl Syracuse, NY 13244 4100 Tel: 315) 443 1722, 1723; ....
Gordon Bell, "Ultracomputers: A Teraflop Before Its Time", Communications of the ACM, vol. 35, pp. 27--47, aug 1992.
....pins in the processor. However, cache bus connected designs are difficult to implement and offer very limited support for message passing primitives. As a result, very few network interfaces connect through the cache bus. One notable example is the architecture of the Kendall Square Research KSR 1 [33]. The design is based on a principle called COMA, meaning Cache Only Memory Access (also known as ALLCACHE tm ) In essence, all memory is cache and there is no main memory per se. Memory bus connected network interfaces In most scalable parallel computers the network interface is located at ....
....memory model is popular for scientific computing systems. There are many examples of multicomputers that use this communication model, including Cray Research T3D and T3E [19, 20, 21] Fujitsu AP1000 [48, 49] Stanford DASH [45] 37 Tera MTA 1 [46] and Kendall Square Research KSR 1 and KSR 2 [33]. Under the remote memory communication model (also known as remote load store, put get, shared memory or non uniform memory access) processors access remote memory locations directly using load and store operations. Remote memory is actually two communication models: a remote load model and a ....
Gordon Bell. Ultracomputers: a Teraflop Before Its Time. Communications of the ACM 35(8), August 1992, pp. 26-47.
....performance and a familiar programming environment. Unfortunately, it is difficult to achieve both of these objectives simultaneously. Shared memory machines with a single global memory, often known as Uniform Memory Access (UMA) multiprocessors, have failed to scale beyond a few dozen processors [1]. Newer machines, such as the KSR1, have a global logical memory space, but the memorymodules are physically distributed around the system. As a result, the time to access the memory This material is based in part upon work supported by the Texas Advanced Research Program under Grant No. ....
....supported by the NSF Grand Challenge Grant No. ASC 9217374. 1 ALLCACHE is a trademark of Kendall Square Research Corporation depends on the position of the memory module and the accessing processor, creating what is called a NonUniform Memory Access (NUMA) multiprocessor [2] In a 1992 article [1], Gordon Bell stated: KSR machine is most likely the blueprint for future scalable massively parallel computers. The logical gap between the decentralized hardware and the shared memory programming paradigm is solved through the ALLCACHE TM memory scheme which dynamically binds ....
G. Bell, " Ultracomputers: a Teraflop Before Its Time," Communications of the ACM, Vol. 35, No. 8, August 1992.
....Distributed Programming Paradigms A distributed system[38] 15] is a group of computers cooperating with each other to achieve some goal. These computers are autonomous, in that each computer has an independent flow of control. We assume there is no physical sharing of memory among computers[21] [3]. Processes running on different computers have distinct address spaces. They communicate by sending and receiving data encapsulated as messages. Message passing primitives are then used by applications to communicate with cooperating computers. One necessary characteristic of cooperation is some ....
Gordon Bell. Ultracomputers: A teraflop before its time. Communications of the ACM, 35(8):27--47, August 1992.
....of the local caches. In the context of parallel multiprocessors, this setting is referred to as Cache Only Memory Architecture (COMA) HALH91] and it has has been employed in a number of recently developed parallel machines such as the Data Diffusion Machine [HALH91] and in the new KSR1 machine [Bel92] where it is referred to as AllCache Engine. This setting also corresponds to that of a homogeneous distributed file server, comprised of a collection of disk less workstations. Previous results. Some systems work in a related setting has been reported in [LH86] In theory community, this ....
Gordon Bell. Ultracomputers: A Teraflop before its time. Comm. of the ACM, 35(8), 1992.
.... concepts of scalable computer and computing continuum are introduced and they are illustrated with the diagram of Encore Multimax and Encore Computing continuum as defined by[9] A good tutorial providing further readings about high performance computer systems and scalable machines is provided in [10]. 156 Hardware system 156 Hardware system Chapter 3 Process and resource representation 3.1 Processes and resources In order to organize the collection of programs of a computer system as a software system, we first review the hardware system from the viewpoint of the services it provides to ....
....of the active process. An example is provided by the simulation of the context switch operation for the Mix machine. This simulation is expressed by the C function called SwitchMixContext which in turn uses the generic function SwitchContext. SwitchMixContext( f int i; struct MixPCB RPCB, PCB[10], ToPCB; struct MixProcess MixPDS; struct MixProcess MixHead, MixTail; MixPDS = MakeList(MixHead, MixTail) for (i = 0; i 10; i ) Create 10 MixPCB s f ToPCB = MakeProc( PCB[i] Append (MixPDS, ToPCB) g RPCB = init (MixPDS) Initialize RPCB for (i = 0; i 100; i ) ....
C. Gordon Bell. Ultracomputers a teraflop before its time. Communications of the ACM, 35(8):27--47, August 1992.
.... massively parallel processors,machine performance has gone from megaflops (millions of floating point operations per seconds) to gigaflops (billions of floating point operations per second) and heading towards the ultracomputing goal of teraflops (trillions of floating point operations per second) [7]. Supercomputers have major differences in architecture. However, each compiler uses some variant of Fortran 90 [71, 72] so that many code optimizations are portable from one machine to the next. Vectorization can be viewed as a basic form of parallelism implemented by pipelining and so shares ....
....thrust in the future will be implementation on a wide range of architecture to maintain portability and avoidance of over reliance on machines currently under development that will not survive the high performance computing environment. Getting access to the current generation of ultracomputers [7], such as the Cray C90, CM 5 and Intel Paragon, is essential for solving large scale computing problems. The largest problem that we have computed is 6 states with 16 nodes per state, using about 60MW double precision memory with a total of 1M nodes (i.e. one million discrete states) A dedicated ....
G. Bell, "Ultracomputers: A Teraflop Before Its Time," Communications of ACM 35 (8), pp. 26-47 (1992).
....Section 3 describes data locality and why it is an important factor that cannot be neglected in loop scheduling algorithms. In particular, we argue that data locality is important even in systems with hardwarebased cache coherence and in cache only memory architectures (COMA) such as the KSR [1]. The locality based dynamic scheduling (LDS) algorithm we propose in this paper is described in Section 4, and compared against the affinity scheduling algorithm developed at the University of Rochester, the only other loop scheduling algorithm we are aware of that also takes memory access ....
Gordon Bell. Ultracomputers: A teraflop before its time. CACM, 35(8):27--47, August 1992.
....communicate by passing messages explicitly. For this reason, private memory multiprocessors are commonly known as message passing architectures. Some authors also reserve the name multiprocessor for shared address space MIMD computers, and use the term multicomputer for private memory computers [9]. It is worth noting that shared memory and message passing are architectural features because, following the definition by Lorin [10] they define a machine s programming model and rules for program correctness. Shared memory and message passing can also be organizational features, that is, ....
....and obtain the address trace. We then run the address trace through the network simulation and obtain the following results: Latency[ 0] 0 Latency[ 1] 0 Latency[ 2] 0 Latency[ 3] 0 Latency[ 4] 0 Latency[ 5] 0 Latency[ 6] 0 Latency[ 7] 100215 Latency[ 8] 103453 Latency[ 9] = 70088 Latency[ 10] 41828 Latency[ 11] 23233 Latency[ 12] 11318 Latency[ 13] 5380 Latency[ 14] 2374 Latency[ 15] 906 Latency[ 16] 396 Latency[ 17] 86 Latency[ 18] 46 280 Latency[ 19] 3673 Total = 362996 Average = 8.718440 We note that the ....
[Article contains additional citation context not shown here]
G. Bell, "Ultracomputers: A teraflop before its time," Communications of the ACM, vol. 35, August 1992.
....complexity (such as a low node degree, thus low cost and ease of implementation) relatively small diameter for such a large number of PNs, a high degree of scalability and expandability and most importantly, efficient support for both local and remote communications. Recent studies [9, 10] have shown that efficient implementation of local communications (spatial locality) is a fundamental requirement for interconnection networks since PNs engage in data transfers more frequently with nearby neighbors than with more distant PNs. We should note that the diameter of a network remains ....
....considered. System performance is analyzed in Chapter 8 in terms of scalability, message delay and throughput, node complexity, OPB and BER. Chapter 9 discusses dynamic channel allocation (DCA) and the conclusions are presented in Chapter 10. 13 CHAPTER 3 GENERAL DESCRIPTION It has been shown [3, 9, 10, 28] that a PN engages in data transfer more frequently with nearby neighbors (local communication) than with more distant nodes (remote communications) In many applications, nearby neighbors are the only destinations for interprocessor communications. In image processing, for example, ....
[Article contains additional citation context not shown here]
G. Bell, "Ultracomputers: A Teraflop Before Its Time," Communication of the ACM, vol. 35, pp. 27 -- 47, August 1992.
....routing, diameter, link complexity, fault tolerance, and an example of a multiple access protocol. 7 2. 1 Definition of HORN It has been shown that a PE engages in data transfer more frequently with nearby neighbors (local communication) than with more distant nodes (remote communications) [2, 18]. Therefore, the interconnection topology must be designed so that it can efficiently support local data transfers (spatial locality) This emphasis has led us to consider a hierarchical interconnection network topology in which the lower level network supports local communications very ....
Gordon Bell, "Ultracomputers: A teraflop before its time." Communication of the ACM, v. 35 , n. 8, pp. 27-47, August 1992.
.... per second) supercomputers combined with the launching of the High Performance Computing and Communication (HPCC) initiative is putting major emphasis on exploiting massive parallelism with greater than one thousand processing elements networked to form massively parallel computers (Ultracomputers)[1, 2]. A key element, and deciding factor in terms of performance and cost of these computers is the interconnection network[3] The interconnection network for massively parallel computers must not only be adequate in terms of communication bandwidth, latency, and connectivity but it must also be ....
....network for massively parallel computers must not only be adequate in terms of communication bandwidth, latency, and connectivity but it must also be modular and scalable. Scalability of a network consists of two aspects; size scalability and generation scalability (or timescalability) [2]. Size scalability refers to the property that the size of the network (e.g. the number of communicating nodes) can be increased with nominal change in the existing configuration. Also, the increase in system size is expected to result in an increase in performance comparable to the increasing ....
G. Bell, "Ultracomputers: A Teraflop Before Its Time," Communications of the ACM, vol. 35, pp. 27--47, Aug 1992.
....mesh network also suffers from a major limitation which is its large diameter (N 1=2 for an N node network) along with its limited connectivity. The recent quest for massively parallel computing systems is placing a major emphasis on scalable networks with small diameters and bounded node degrees[14]. As an alternative to the hypercube and the mesh topologies, the de Bruijn topology[15, 16] has recently been receiving much attention. Its properties and applications have been studied by several researchers[2, 17, 18, 19, 20] Its topological properties show that the de Bruijn network is a good ....
G. Bell, "Ultracomputers: A Teraflop Before Its Time," Communications of the ACM, vol. 35, pp. 27--47, Aug 1992.
....amount of data at each processor (and thus a negligible memory hierarchy at each processor) such applications are probably closer to the exception than the rule. The coming generation of tera computers can be expected to consist of thousands of processors, each with its own multi gigabyte storage [6]. Thus, an extension of sequential multi level storage models to the parallel domain would seem to be well motivated. In fact, Vitter and Shriver [20] Nodine and Vitter [17] and Vitter and Nodine [19] have proposed just such a series of extensions, and have examined the complexity of sorting in ....
G. Bell. Ultracomputers: A teraflop before its time. Communications of the ACM, 35(8):26--47, 1992.
.... but channel reuse [7, 2, 13, 14] across spatially separated structures creates a system in which many times that limit is possible, permitting optically connected parallel processing systems with perhaps thousands of processing nodes [1] This approach is also supported by the conclusion by Bell [15], Dandamudi [16] and Goodman [17] that PNs engage in data transfers more frequently with nearby neighbors than with more distant ones. One potential hierarchy presented previously in the literature, Hierarchical Optical Ring Interconnection Network (HORN) 1] uses a ring of rings topology in ....
G. Bell, "Ultracomputers: A teraflop before its time," Communication of the ACM, vol. 35, pp. 27 -- 47, August 1992.
....dynamic selfscheduling by a factor of up to 2.3 on 30 processors. 1 Introduction Scalable shared memory multiprocessors (SSMMs) have become increasingly viable as platforms for highperformance computing by efficiently supporting coherent shared memory in hardware for large numbers of processors [2]. Examples include the Convex SPP1000 [3] the Stanford FLASH [5] and the University of Toronto NUMAchine [12] Although the memory in SSMMs is logically shared by all processors, it is physically distributed to provide scalability, as shown in Figure 1. As a result, memory accesses are ....
G. Bell. Ultracomputers: A teraflop before its time. Comm. of the ACM, 35(8):26--47, August 1992.
....speed of their processors make clusters of these machines an attractive alternative to vector or parallel computers for high performance processing. Workstations with computing speeds greater than 10 Mflops are commonplace, and workstation speeds could reach 500 Mflops within the next few years [2]. However, workstation clusters pose additional challenges that must be met before network parallel processing becomes practical. The communication costs associated with typical interconnection networks and protocols can negate a substantial portion of the peak theoretical combined speed of the ....
....each physical processor its minimum and maximum i; j; k values. By minimum, we mean the least i associated with the least j that is associated with the least k (i.e. 5] 8] 12] is less than P0 P2 P1 Figure 6: An example of Contiguous Point decomposition with three heterogeneous processors. 1][2][13] and [1] 9] 12] Maximum is defined similarly. Again, binary search may be used, at a time cost of O(logp) Because of our assumption that each processor is allocated at least n 2 data points, there will be at most two list entries with the same k values. Further discrimination between two ....
Gordon Bell. Ultracomputers: A teraflop before its time. Communications of the ACM, 35(8):26-- 47, August 1992.
....rating for this particular problem. At small problem sizes, efficiency is low, while large sizes exhibit close to 100 efficiency (80 Mflops) It has been speculated that workstation networks will become an important parallel processing resource in business and industry in the next few years [2]. Parallel programming on a network of workstations is not necessarily an academic exercise. Processor speeds of workstations could reach 500 Mflops within in the next few years and improvements in the speed of the interconnection networks, such as the 100 Mbps FDDI and 100 MBps HIPPI (HIgh ....
.... of workstations could reach 500 Mflops within in the next few years and improvements in the speed of the interconnection networks, such as the 100 Mbps FDDI and 100 MBps HIPPI (HIgh Performance Parallel Interface) could make the workstation model of parallel computation even more attractive [2]. Acknowledgments This work was supported by IBM and the Oregon Advanced Computing Institute. ....
Gordon Bell. Ultracomputers: A teraflop before its time. Communications of the ACM, 35(8):27--47, August 1992.
....make clusters of these machines a natural choice for many parallel processing tasks. It has been predicted that, as the speed of micro processor based workstations increases, the cost per floating point operation could far outstrip the cost effectiveness of machines in the supercomputer class [1]. Unfortunately, the computation speed offered by workstations may be negated by the additional time needed for communication between the processors. Data must usually be exchanged during computation and results must be consolidated and returned at the end of each iteration. The speed of LAN or ....
G. Bell. Ultracomputers: A teraflop before its time. Communications of the ACM, 35(8):26--47, August 1992.
....collection of two or more workstations, possibly with different computational speeds, connected by a LAN. With workstations exhibiting sustained speeds of more than 10 Mflops, and with future performance predicted at 500 Mflops, using these machines for high performance processing has great appeal [3]. Parallel processing on a workstation network has much in common with the loosely coupled multicomputer model of computation. Both systems contain multiple processors that must communicate by message passing. This method of interprocessor communication imposes overhead costs that are system ....
Gordon Bell. Ultracomputers: A teraflop before its time. Communications of the ACM, 35(8):26-- 47, August 1992.
....of around 100 Mbps. Even faster connections such as HiPPI (high performance parallel interface) are more rare, but yield bandwidths in the gigabit range. Workstation computing speeds upwards of 10 Mflops are common, and it has been projected that speeds may reach 500 Mflops in the next few years [2]. While the workstation network may be viewed as a loosely coupled multicomputer suitable for MIMD (multiple instruction, multiple data) processing, the problems we consider in this paper are data parallel in nature. Data parallelism is a model of computation that achieves parallelism through ....
Gordon Bell. Ultracomputers: A teraflop before its time. Communications of the ACM, 35(8):26-- 47, August 1992.
....to the analytical result, CSVM reduces the message traffic of DSM system with global address very effectively. Keywords : Shared virtual memory, Distributed shared memory, Memory management, Parallel computer architecture. 1 Introduction Building a teraflop supercomputer is ongoing challenge [1]. With technological advances pushed to physical limits, architecture and parallelism receive the most promising alternatives to pursue. Large part of most programs consists of subprocesses which either do the same work over different data or are mutually independent, so that concurrent execution ....
G. Bell, "Ultracomputer: A Teraflop Before Its Time," Commun. ACM, 35(8) pp. 27-47, 1992.
....natural proof theoretic techniques [Owi75, OG76] Many experimental and commercial processors provide direct support for this abstraction: indeed, Gordon Bell has predicted that . the mainline, general purpose computer is almost certain to be the shared memory multiprocessor after 1995 [Bel92]. Increasing attention is being paid to implementing shared memory systems either in hardware or in software [Bel92, CG89, LH89, TKB92] This paper investigates fault tolerance in shared memory systems, with an emphasis on benign, or constrained, fault models. Benign fault models are easier to ....
....support for this abstraction: indeed, Gordon Bell has predicted that . the mainline, general purpose computer is almost certain to be the shared memory multiprocessor after 1995 [Bel92] Increasing attention is being paid to implementing shared memory systems either in hardware or in software [Bel92, CG89, LH89, TKB92]. This paper investigates fault tolerance in shared memory systems, with an emphasis on benign, or constrained, fault models. Benign fault models are easier to program than are more malicious models. Just as shared memory is itself an abstraction of much more complex, timing dependent ....
G. Bell. Ultracomputers: A teraflop before its time. Communications of the ACM, 35(8):27--47, August 1992.
..... 162 FIGURE 97 Sum of Attenuated Emittances Parallel Calculation . 162 FIGURE 98 Compositing Methods . viii TABLE 1 Cost Performance Data for Peak Performance [ZORP92] CYBE92][BELL92] . 11 TABLE 2 2D Interpolation error and resolution error for separable interpolation functions (Reproduced from [PRAT78] 41 TABLE 3 Sequential algorithm alternatives . ....
....But, the cost performance improvement is for tuned applications. Because of overhead (synchronization, communication, and scheduling) the peak performance is very difficult to attain, and 1 to 25 of peak is typical [CYBE92] Today s cost performance ratio is in flux. FIGURE 3 and TABLE 1 [BELL92][CYBE92] ZORP92] show that the cost to peak performance ratio is not monotonic. FIGURE 3 Cost Performance Comparison 0 200 400 600 800 1000 1200 1400 1600 1800 2000 100 1000 10000 100000 Mega Flops Industry Data 11 In FIGURE 4 the cost versus peak performance is shown for all of the systems in ....
[Article contains additional citation context not shown here]
G. Bell, "Ultra Computers: A Tera Flop Before Its Time", Communications of the ACM, Vol 35, No. 8, Aug 1992, pp. 27-47.
.... in today s networks may lie idle for as much as 95 of the time [95] It is therefore not surprising that workstation clusters with high speed interconnections are among the candidate computer architectures for achieving the teracomputing goal required to solve grand challenge applications [15]. This report represents a survey of emerging software technologies and philosophies related to workstation clusters. Many stable software products are already available to assist with improving exploiting the utilization of networked computer resources, both in environments where workstation ....
G. Bell. Ultracomputers: A Teraflop Before its Time. Communications of the ACM, 35(8):27--47, 1992.
....and multiprocessors operating out of coherent cache memory. Chapter 5 Organization Alternatives 89 5. 1 Scalability Scalability refers to the potential for implementations of an architecture to yield improved performance with greater hardware resources or refined implementation technologies [Bell92]. Size scalability is rarely an important issue in a single chip processor because integration limits usually dominate the potential complexity. Generation scalability, however, is often more relevant, but difficult to predict in advance. After over a decade of experience with RISC architectures ....
G. Bell, "Ultracomputers: a teraflop before its time," Commun. ACM , vol. 35, no. 8, pp. 27--47, Aug. 1992.
.... to memory management is that it adapts quite naturally to a hybrid multiprocessor in which parts of the address space are shared among subsets of processors, as, for example, in a system containing multiple shared memory multiprocessors connected by a message passing or broadcast local network [4]. In this kind of system each shared memory multiprocessor would naturally be a candidate for constituting a team with its own memory address space, and the various teams would then be spread over the different multiprocessors and communicated by the message passing or broadcast local network. ....
Gordon Bell. Ultracomputers: a Teraflop Before its Time. Communications ACM, 35(8):26--47, 1992.
.... at different levels of granularity from fine grain (SIMD) to coarse grain (SPMD) Although many of the early parallel computers were designed to exploit finegrain data parallelism, the focus of more recent parallel computers has shifted to effectively harnessing coarse grain data parallelism [7]. Although parallel computers that can exploit massive amounts of data parallelism have been built, being able to rapidly develop portable and efficient parallel applications for such computers remains a significant impediment. Existing systems for data parallel programming can be broadly divided ....
G. Bell. Ultracomputers: A teraflop before its time. CACM, 35(8):27--47, August 1992.
....types of networks. 4 Conclusions 4. 1 Related Work Beside Hector several architectures based on slotted rings have been proposed including the CDC CYBERPLUS [14] the Express Ring [5] the IEEE SCI (Scalable Coherent Interface) standard [17, 22] and the KSR 1 from Kendall Square Research [8, 6]. There have been performance studies of single slotted rings, but not of ring based hierarchies. Previous studies of other shared memory architectures with system models at the level of detail of our simulator have tended to only examine small systems (100 processors or less) 31, 9] The ....
G. Bell. Ultracomputers: a teraflop before its time. Communications of the ACM, 35(8):27--47, August 1992.
....of platforms and configurations. 1. 3 Workstation Clusters for Parallel Computing A workstation cluster is a group of interconnected workstations that can be used as a single computing resource [48] Several advantages of workstation clusters as a parallel processing platform are often cited [7, 11, 48]. Availability. If a node in a cluster fails, the work that it was doing can potentially be reassigned to the other nodes. Superior price performance ratio. Clusters generally consist of machines that have very good performance for their price. The price performance ratio of of workstations ....
G. Bell. Ultracomputers: A Teraflop Before its Time. Communications of the ACM 35 (8), 26--47 (August 1992).
....of the local caches. In the context of parallel multiprocessors, this setting is referred to as Cache Only Memory Architecture (COMA) HALH] and it has has been employed in a number of recently developed parallel machines such as the Data Diffusion Machine [HALH] and in the new KSR1 machine [Bel] where it is referred to as AllCache Engine. This setting also corresponds to that of a homogeneous distributed file server, comprised of a collection of diskless workstations. Previous results. In theory community, this problem was first tackled in [BFR] who defined this problem (the ....
....not in V(p; t) These files were dumped into the processor p by the global strategy. The real set may also contain some empty file slots, in place of missing virtual files. A file copy is an ordered pair F; p where F 2 F denotes the file and p 2 P denotes the processor. We associate labels, hot, cold, and dumped, with file copies, a file copy must have exactly one of these labels. The sets of hot, cold, and dumped files in processor p at time t are denoted H(p; t) C(p; t) and D(p; t) respectively. R(p; t) H(p; t) C(p; t) D(p; t) A file is called non single if more than ....
[Article contains additional citation context not shown here]
Gordon Bell. Ultracomputers: A Teraflop before its time. Comm. of the ACM, 35(8), 1992.
....in processors technology, low overhead switches, faster communication channels, and rich interconnection network topologies. With the advent of the High Performance Computing and Communication initiative and the quest for achieving a teraflop computing speed, this trend is likely to continue [12]. However, the advances in software technology for parallel computers have been outpaced by the spectacular progress in hardware. Consequently, programming of parallel 2 computers still remains a tedious task. Parallel computing is not likely to be of common use until considerable efforts are ....
G. Bell, "Ultracomputers - A Teraflop Before its Time," Communications of the ACM, pp. 27-47, Vol. 35, No. 8, August 1992.
....of Megaflops of computing power. A cluster of 1024 DEC alpha workstations would provide a combined computing power of 150 Gigaflops while the same size configuration (1024 nodes control computer i o computers) of the CM5 from Thinking Machines Inc. has a peak rating of only 128 Gigaflops [1]. A recent report from the IBM European Center for Scientific and Engineering Computing [2] stated that a cluster of 8 RX 6000 Model 560 workstations connected with IBM serial optical channel converter, achieved a performance of 0.52 Gigaflops for the Dongarra benchmarks for massively parallel ....
Gordon Bell, "Ultracomputers A Teraflop Before Its Time", Communications of the ACM, vol. 35, pp. 27--47, Aug. 1992.
....The ATOMIC local area network [8] is an early example of an embedded system application of Mosaic components. 1 Multicomputer Scaling Tracks Multicomputers have already established themselves as a highly scalable, high performance form of multiple instruction multiple data (MIMD) architecture [1, 3, 4, 15, 16, 17]. As shown in Figure 1, a multicomputer consists of an ensemble of computing nodes connected by a message passing network. Each node of a multicomputer is a computer, including read write memory, read only memory for initialization and bootstrap, one or more instructioninterpreting processors, and ....
Gordon Bell. Ultracomputers: A Teraflop Before Its Time. CACM, 35(8): 26--47, August 1992.
....showing high parallel efficiency and low overhead for parallelization, have been obtained on a 24 processor shared memory multiprocessor. 6. 1 INTRODUCTION The Single Program Multiple Data (SPMD) model of parallel computation has recently received a lot of attention (see e.g. the article by Bell [16]) The model is characterized by the feature of each parallel process running the same program but with different data. 1 The attraction of this model is that it does not require a dynamic network of parallel processes: this facilitates efficient implementation and makes the parallel ....
G. Bell, Ultracomputers: a Teraflop before its time, in Communications of the ACM, Vol. 35, No. 8, 1992. f85g
....of the ACM. Copyright may be transferred without further notice and the publisher may then post the accepted version. A version of this article appears at http: research.microsoft.com pubs 2 High Performance Computing: Crays, Clusters, and Centers. What Next Gordon Bell and Jim Gray GBell, Gray Microsoft.com Bay Area Research Center Microsoft Research August 2001 Abstract : After 50 years of building high performance scientific computers, two major architectures exist: 1) clusters of Cray style vector supercomputers; 2) clusters of scalar uni and multi processors. ....
Bell, G., "Ultracomputers: A Teraflop Before Its Time", Communications of the ACM, Vol. 35, No. 8, August 1992, pp 27-45.
....system functions and hardware remove from the user the burden of coordinating the accesses to the shared data. One such system has been proposed, implemented and analyzed in [LH89] The KSR, DASH, and Alewife machines are a few examples of machines that use such distributed shared memory [Bel92] Another, more software oriented, approach is implemented in the the Linda system [CG89] In Linda, an abstract tuple space is shared (instead of cache lines or pages) and operations are available to insert and delete tuples. Obviously, the choice of one of these methods of implementing sharing ....
G. Bell. Ultracomputers: A teraflop before its time. CACM, 35(8):27--47, August 1992.
No context found.
G. Bell. Ultracomputers A teraflop beforeits time. Comm. of the ACM, 35(8):26--47, Augus 1992.
No context found.
G. Bell. Ultracomputers: A teraflop before its time. Communications of the ACM 35(8): 27--47, 1992
No context found.
G. Bell, `Ultracomputers, a teraflop before its time', Communications of the ACM, 35, (8), 27--47 (1992).
No context found.
G. Bell. Ultracomputers: A teraflop before its time. Communications of the ACM, 35(8):26--47, 1992.
No context found.
G. Bell, "Ultracomputers : A teraflop before its time," Communications of the ACM, vol. 35, no. 8, pp. 26--47, 1992.
No context found.
G. Bell, "Ultracomputer: A Teraflop Before Its Time." Comm of the ACM Vol.35, No.8, page 27-47, Aug. 1992.
No context found.
G. Bell. Ultracomputers: A teraflop before its time. Communications of the ACM, 35(8):26--47, 1992.
No context found.
Gordon Bell. Ultracomputers: A Teraflop Before Its Time. In Communications of the ACM, 35(8): 26-47, August 1992.
No context found.
G. Bell. Ultracomputers: A teraflop before its time. Communications of the ACM, 35(8):26--47, 1992.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC