Results 1 - 10
of
145
The Landscape of Parallel Computing Research: A View from Berkeley
- TECHNICAL REPORT, UC BERKELEY
, 2006
"... ..."
Titanium: A High-Performance Java Dialect
- In ACM
, 1998
"... Abstract Titanium is a language and system for high-performance parallel scientific computing. Titaniumuses Java as its base, thereby leveraging the advantages of that language and allowing us to focus ..."
Abstract
-
Cited by 268 (30 self)
- Add to MetaCart
(Show Context)
Abstract Titanium is a language and system for high-performance parallel scientific computing. Titaniumuses Java as its base, thereby leveraging the advantages of that language and allowing us to focus
The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus
, 1996
"... This paper describes the interconnection network used in the Cray T3E multiprocessor. The network is a bidirectional 3D torus with fully adaptive routing, optimized virtual channel assignments, integrated barrier synchronization support and considerable fault tolerance. The routers are built with LS ..."
Abstract
-
Cited by 149 (7 self)
- Add to MetaCart
This paper describes the interconnection network used in the Cray T3E multiprocessor. The network is a bidirectional 3D torus with fully adaptive routing, optimized virtual channel assignments, integrated barrier synchronization support and considerable fault tolerance. The routers are built with LSI’s 500K ASIC technology with custom transmitters/ receivers driving low-voltage differential signals at 375 MHz, for a link data payload capacity of approximately 500 MB/s.
Effects of communication latency, overhead, and bandwidth in a cluster architecture
- In Proceedings of the 24th Annual International Symposium on Computer Architecture
, 1997
"... This work provides a systematic study of the impact of communication performance on parallel applications in a high performance network of workstations. We develop an experimental system in which the communication latency, overhead, and bandwidth can be independently varied to observe the effects on ..."
Abstract
-
Cited by 108 (6 self)
- Add to MetaCart
(Show Context)
This work provides a systematic study of the impact of communication performance on parallel applications in a high performance network of workstations. We develop an experimental system in which the communication latency, overhead, and bandwidth can be independently varied to observe the effects on a wide range of applications. Our results indicate that current efforts to improve cluster communication performance to that of tightly integrated parallel machines results in significantly improved application performance. We show that applications demonstrate strong sensitivity to overhead, slowing down by a factor of 60 on 32 processors when overhead is increased from 3 to 103 s. Applications in this study are also sensitive to per-message bandwidth, but are surprisingly tolerant of increased latency and lower per-byte bandwidth. Finally, most applications demonstrate a highly linear dependence to both overhead and per-message bandwidth, indicating that further improvements in communication performance will continue to improve application performance. 1
Implicit Coscheduling: Coordinated Scheduling with Implicit Information in Distributed Systems
- ACM TRANSACTIONS ON COMPUTER SYSTEMS
, 1998
"... In this thesis, we formalize the concept of an implicitly-controlled system, also referred to as an implicit system. In an implicit system, cooperating components do not explicitly contact other components for control or state information; instead, components infer remote state by observing natural ..."
Abstract
-
Cited by 54 (2 self)
- Add to MetaCart
(Show Context)
In this thesis, we formalize the concept of an implicitly-controlled system, also referred to as an implicit system. In an implicit system, cooperating components do not explicitly contact other components for control or state information; instead, components infer remote state by observing naturally-occurring local events and their corresponding implicit information, i.e., information available outside of a defined interface. Many systems, particularly in distributed and networked environments, have leveraged implicit control to simplify the implementation of services with autonomous components. To concretely demonstrate the advantages of implicit control, we propose and implement implicit coscheduling, an algorithm for dynamically coordinating the time...
LoPC: Modeling Contention in Parallel Algorithms
, 1997
"... Parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. This paper introduces the LoPC model, which is inspired by the LogP model but accounts for contention for message processing resources in parallel al ..."
Abstract
-
Cited by 52 (8 self)
- Add to MetaCart
Parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. This paper introduces the LoPC model, which is inspired by the LogP model but accounts for contention for message processing resources in parallel algorithms on a multiprocessor or network of workstations. LoPC takes the , and parameters directly from the LogP model and uses them to predict the cost of contention, .
Adaptive History-Based Memory Schedulers
"... As memory performance becomes increasingly important to overall system performance, the need to carefully schedule memory operations also increases. This paper presents a new approach to memory scheduling that considers the history of recently scheduled operations. This history-based approach provid ..."
Abstract
-
Cited by 50 (2 self)
- Add to MetaCart
As memory performance becomes increasingly important to overall system performance, the need to carefully schedule memory operations also increases. This paper presents a new approach to memory scheduling that considers the history of recently scheduled operations. This history-based approach provides two conceptual advantages: (1) it allows the scheduler to better reason about the delays associated with its scheduling decisions, and (2) it allows the scheduler to select operations so that they match the program's mixture of Reads and Writes, thereby avoiding certain bottlenecks within the memory controller. We evaluate our solution using a cycle-accurate simulator for the recently announced IBM Power5. When compared with an in-order scheduler, our solution achieves IPC improvements of 10.9% on the NAS benchmarks and 63% on the data-intensive Stream benchmarks. Using microbenchmarks, we illustrate the growing importance of memory scheduling in the context of CMP's, hardware controlled prefetching, and faster CPU speeds.
Rigel: An architecture and scalable programming interface for a 1000-core accelerator
- In ISCA ’09
"... This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled singleprogram, multiple-data (SPMD) execution model. Rigel’s low-level pro ..."
Abstract
-
Cited by 44 (10 self)
- Add to MetaCart
(Show Context)
This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled singleprogram, multiple-data (SPMD) execution model. Rigel’s low-level programming interface adopts a single global address space model where parallel work is expressed in a taskcentric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications. We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GF LOP S mm2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.
Express Cube Topologies for On-Chip Interconnects
- APPEARS IN THE PROCEEDINGS OF THE 15 TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE
, 2009
"... Driven by continuing scaling of Moore’s law, chip multiprocessors and systems-on-a-chip are expected to grow the core count from dozens today to hundreds in the near future. Scalability of on-chip interconnect topologies is critical to meeting these demands. In this work, we seek to develop a better ..."
Abstract
-
Cited by 38 (9 self)
- Add to MetaCart
(Show Context)
Driven by continuing scaling of Moore’s law, chip multiprocessors and systems-on-a-chip are expected to grow the core count from dozens today to hundreds in the near future. Scalability of on-chip interconnect topologies is critical to meeting these demands. In this work, we seek to develop a better understanding of how network topologies scale with regard to cost, performance, and energy considering the advantages and limitations afforded on a die. Our contributions are three-fold. First, we propose a new topology, called Multidrop Express Channels (MECS), that uses a one-to-many communication model enabling a high degree of connectivity in a bandwidth-efficient manner. In a 64-terminal network, MECS enjoys a 9 % latency advantage over other topologies at low network loads, which extends to over 20 % in a 256terminal network. Second, we demonstrate that partitioning the available wires among multiple networks and channels enables new opportunities for trading-off performance, area, and energy-efficiency that depend on the partitioning scheme. Third, we introduce Generalized Express Cubes – a framework for expressing the space of on-chip interconnects – and demonstrate how existing and proposed topologies can be mapped to it.