• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Synchronization and communication in the t3e multiprocessor,” in (1996)

by S L Scott
Venue:Proc. of ASPLOS-VII,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 145
Next 10 →

The Landscape of Parallel Computing Research: A View from Berkeley

by Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, Katherine A. Yelick - TECHNICAL REPORT, UC BERKELEY , 2006
"... ..."
Abstract - Cited by 487 (25 self) - Add to MetaCart
Abstract not found

Titanium: A High-Performance Java Dialect

by Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham, David Gay, Phil Colella, Alex Aiken - In ACM , 1998
"... Abstract Titanium is a language and system for high-performance parallel scientific computing. Titaniumuses Java as its base, thereby leveraging the advantages of that language and allowing us to focus ..."
Abstract - Cited by 268 (30 self) - Add to MetaCart
Abstract Titanium is a language and system for high-performance parallel scientific computing. Titaniumuses Java as its base, thereby leveraging the advantages of that language and allowing us to focus
(Show Context)

Citation Context

...I, however, has a lower raw performance than a global address space. On a Cray T3E, MPI achieves a bandwidth of about 120 MB/sec, while a global address space achieves a bandwidth of about 330 MB/sec =-=[10]-=-. Furthermore, with a global address space the compiler can optimize remote accesses with the same techniques used for the local memory hierarchy. FIDIL. The multidimensional array support in Titanium...

The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus

by Steven L. Scott, et al. , 1996
"... This paper describes the interconnection network used in the Cray T3E multiprocessor. The network is a bidirectional 3D torus with fully adaptive routing, optimized virtual channel assignments, integrated barrier synchronization support and considerable fault tolerance. The routers are built with LS ..."
Abstract - Cited by 149 (7 self) - Add to MetaCart
This paper describes the interconnection network used in the Cray T3E multiprocessor. The network is a bidirectional 3D torus with fully adaptive routing, optimized virtual channel assignments, integrated barrier synchronization support and considerable fault tolerance. The routers are built with LSI’s 500K ASIC technology with custom transmitters/ receivers driving low-voltage differential signals at 375 MHz, for a link data payload capacity of approximately 500 MB/s.

Effects of communication latency, overhead, and bandwidth in a cluster architecture

by Richard P. Martin, Amin M. Vahdat, David E. Culler, Thomas E. Anderson - In Proceedings of the 24th Annual International Symposium on Computer Architecture , 1997
"... This work provides a systematic study of the impact of communication performance on parallel applications in a high performance network of workstations. We develop an experimental system in which the communication latency, overhead, and bandwidth can be independently varied to observe the effects on ..."
Abstract - Cited by 108 (6 self) - Add to MetaCart
This work provides a systematic study of the impact of communication performance on parallel applications in a high performance network of workstations. We develop an experimental system in which the communication latency, overhead, and bandwidth can be independently varied to observe the effects on a wide range of applications. Our results indicate that current efforts to improve cluster communication performance to that of tightly integrated parallel machines results in significantly improved application performance. We show that applications demonstrate strong sensitivity to overhead, slowing down by a factor of 60 on 32 processors when overhead is increased from 3 to 103 s. Applications in this study are also sensitive to per-message bandwidth, but are surprisingly tolerant of increased latency and lower per-byte bandwidth. Finally, most applications demonstrate a highly linear dependence to both overhead and per-message bandwidth, indicating that further improvements in communication performance will continue to improve application performance. 1
(Show Context)

Citation Context

...focused on improving various aspects of communication performance. These investigations cover a vast spectrum of alternatives, ranging from integrating message transactions into the memory controller =-=[5, 10, 29, 41]-=- or the cache controller [1, 20, 32], to incorporating messaging deep into the processor [9, 11, 12, 17, 22, 23, 36, 40], integrating the network interface on the memory bus [7, 31], providing dedicat...

Vector Microprocessors.

by K Asanovi´c , 1998
"... ..."
Abstract - Cited by 88 (7 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...ining high throughput. To address this problem, the Cray T3E MPP system, which maintains a global uncached shared memory, adds an external memory-mapped vector fetch engine to a scalar microprocessor =-=[Sco96]-=-. Vector memory instructions offer several advantages in dealing with accesses with little temporal locality. Vector memory instructions describe multiple independent memory requests and have weak int...

Implicit Coscheduling: Coordinated Scheduling with Implicit Information in Distributed Systems

by Andrea Carol Arpaci-Dusseau - ACM TRANSACTIONS ON COMPUTER SYSTEMS , 1998
"... In this thesis, we formalize the concept of an implicitly-controlled system, also referred to as an implicit system. In an implicit system, cooperating components do not explicitly contact other components for control or state information; instead, components infer remote state by observing natural ..."
Abstract - Cited by 54 (2 self) - Add to MetaCart
In this thesis, we formalize the concept of an implicitly-controlled system, also referred to as an implicit system. In an implicit system, cooperating components do not explicitly contact other components for control or state information; instead, components infer remote state by observing naturally-occurring local events and their corresponding implicit information, i.e., information available outside of a defined interface. Many systems, particularly in distributed and networked environments, have leveraged implicit control to simplify the implementation of services with autonomous components. To concretely demonstrate the advantages of implicit control, we propose and implement implicit coscheduling, an algorithm for dynamically coordinating the time...
(Show Context)

Citation Context

...arge DRAM memory. In recent systems, these machines may be commodity workstations [4, 11, 19, 37, 75], commodity PCs [143, 154, 175], or commodity processors with specialpurpose communication support =-=[98, 90, 147]-=-. Our current analysis of the performance of implicit coscheduling makes two assumptions about the machine architecture that are not necessarily true for all clusters. First, we assume a single proces...

LoPC: Modeling Contention in Parallel Algorithms

by Matthew I. Frank, Anant Agarwal, Mary K. Vernon , 1997
"... Parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. This paper introduces the LoPC model, which is inspired by the LogP model but accounts for contention for message processing resources in parallel al ..."
Abstract - Cited by 52 (8 self) - Add to MetaCart
Parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. This paper introduces the LoPC model, which is inspired by the LogP model but accounts for contention for message processing resources in parallel algorithms on a multiprocessor or network of workstations. LoPC takes the , and parameters directly from the LogP model and uses them to predict the cost of contention, .

Adaptive History-Based Memory Schedulers

by Ibrahim Hur, Calvin Lin
"... As memory performance becomes increasingly important to overall system performance, the need to carefully schedule memory operations also increases. This paper presents a new approach to memory scheduling that considers the history of recently scheduled operations. This history-based approach provid ..."
Abstract - Cited by 50 (2 self) - Add to MetaCart
As memory performance becomes increasingly important to overall system performance, the need to carefully schedule memory operations also increases. This paper presents a new approach to memory scheduling that considers the history of recently scheduled operations. This history-based approach provides two conceptual advantages: (1) it allows the scheduler to better reason about the delays associated with its scheduling decisions, and (2) it allows the scheduler to select operations so that they match the program's mixture of Reads and Writes, thereby avoiding certain bottlenecks within the memory controller. We evaluate our solution using a cycle-accurate simulator for the recently announced IBM Power5. When compared with an in-order scheduler, our solution achieves IPC improvements of 10.9% on the NAS benchmarks and 63% on the data-intensive Stream benchmarks. Using microbenchmarks, we illustrate the growing importance of memory scheduling in the context of CMP's, hardware controlled prefetching, and faster CPU speeds.

Rigel: An architecture and scalable programming interface for a 1000-core accelerator

by John H. Kelm, Daniel R. Johnson, Matthew R. Johnson, Neal C. Crago, William Tuohy, Aqeel Mahesri, Steven S. Lumetta, Matthew I. Frank, Sanjay J. Patel - In ISCA ’09
"... This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled singleprogram, multiple-data (SPMD) execution model. Rigel’s low-level pro ..."
Abstract - Cited by 44 (10 self) - Add to MetaCart
This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled singleprogram, multiple-data (SPMD) execution model. Rigel’s low-level programming interface adopts a single global address space model where parallel work is expressed in a taskcentric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications. We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GF LOP S mm2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.
(Show Context)

Citation Context

...the generality of reduction-based computations. The implementation of barriers in particular has been accomplished with cache coherence mechanisms [18], explicit hardware support such as the Cray T3E =-=[24]-=-, and more recently, a combination of the two on chip multiprocessors [23]. Using message passing networks to accelerate interprocess communication and synchronization was evaluated on the CM-5 [15]. ...

Express Cube Topologies for On-Chip Interconnects

by Boris Grot, Joel Hestness, Stephen W. Keckler, Onur Mutlu - APPEARS IN THE PROCEEDINGS OF THE 15 TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE , 2009
"... Driven by continuing scaling of Moore’s law, chip multiprocessors and systems-on-a-chip are expected to grow the core count from dozens today to hundreds in the near future. Scalability of on-chip interconnect topologies is critical to meeting these demands. In this work, we seek to develop a better ..."
Abstract - Cited by 38 (9 self) - Add to MetaCart
Driven by continuing scaling of Moore’s law, chip multiprocessors and systems-on-a-chip are expected to grow the core count from dozens today to hundreds in the near future. Scalability of on-chip interconnect topologies is critical to meeting these demands. In this work, we seek to develop a better understanding of how network topologies scale with regard to cost, performance, and energy considering the advantages and limitations afforded on a die. Our contributions are three-fold. First, we propose a new topology, called Multidrop Express Channels (MECS), that uses a one-to-many communication model enabling a high degree of connectivity in a bandwidth-efficient manner. In a 64-terminal network, MECS enjoys a 9 % latency advantage over other topologies at low network loads, which extends to over 20 % in a 256terminal network. Second, we demonstrate that partitioning the available wires among multiple networks and channels enables new opportunities for trading-off performance, area, and energy-efficiency that depend on the partitioning scheme. Third, we introduce Generalized Express Cubes – a framework for expressing the space of on-chip interconnects – and demonstrate how existing and proposed topologies can be mapped to it.
(Show Context)

Citation Context

... result of small hop count and low crossbar complexity. 3.4 Multicast and Broadcast Parallel computing systems often provide hardware support for collective operations such as broadcast and multicast =-=[18, 6]-=-. MECS can easily be augmented to support these collective operations with little additional cost because of the multipoint connectivity. A full broadcast can be implemented in two network hops by fir...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University