Results 1 - 10
of
11
User-Space Communication: A Quantitative Study
, 1998
"... Powerful commodity systems and networks o#er a promising direction for high performance computing because they are inexpensive and they closely track technology progress. However, high, raw--hardware performance is rarely delivered to the end user. Previous work has shown that the bottleneck in thes ..."
Abstract
-
Cited by 31 (5 self)
- Add to MetaCart
Powerful commodity systems and networks o#er a promising direction for high performance computing because they are inexpensive and they closely track technology progress. However, high, raw--hardware performance is rarely delivered to the end user. Previous work has shown that the bottleneck in these architectures is the overheads imposed by the software communication layer. To reduce these overheads, researchers have proposed a number of user-space communication models. The common feature of these models is that applications have direct access to the network, bypassing the operating system in the common case and thus avoiding the cost of send/receive system calls. In this paper we examine five user--space communication layers, that represent di#erent points in the configuration space: Generic AM, BIP-0.92, FM-2.02, PM-1.2, and VMMC-2. Although these systems support di#erent communication paradigms and employ a variety of di#erent implementation tradeo#s, we are able to quantitatively...
UTLB: A mechanism for address translation on network interfaces
- In Proceedings of the Eighth International Conference Architectural Support for Programming Languages and Operating Systems ASPLOS
, 1998
"... An important aspect of a high-speed network system is the ability to transfer data directly between the network interface and application buffers. Such a direct data path requires the network interface to “know ” the virtual-to-physical address translation of a user buffer, i.e., the physical memory ..."
Abstract
-
Cited by 20 (7 self)
- Add to MetaCart
An important aspect of a high-speed network system is the ability to transfer data directly between the network interface and application buffers. Such a direct data path requires the network interface to “know ” the virtual-to-physical address translation of a user buffer, i.e., the physical memory location of the buffer. This paper presents an efficient address translation architecture, User-managed TLB (UTLB), which eliminates system calls and device interrupts from the common communication path. UTLB also supports application-specific policies to pin and unpin application memory. We report micro-benchmark results for an implementation on Myrinet PC clusters. A trace-driven analysis is used to compare the UTLB approach with the interrupt-based approach. It is also used to study the effects of UTLB cache size, associativity, and prefetching. Our results show that the UTLB approach delivers robust performance with relatively small translation cache sizes. 1
The Impact of Data Transfer and Buffering Alternatives on Network Interface Design
, 1998
"... The explosive growth in the performance of microprocessors and networks has created a new opportunity to reduce the latency of fine-grain communication. Microprocessor clock speeds are now approaching the gigahertz range. Network switch latencies have dropped to tens of nanoseconds. Unfortunately, t ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
The explosive growth in the performance of microprocessors and networks has created a new opportunity to reduce the latency of fine-grain communication. Microprocessor clock speeds are now approaching the gigahertz range. Network switch latencies have dropped to tens of nanoseconds. Unfortunately, this explosive growth also exposes processor accesses to the network interface (NI) as a critical bottleneck for fine-grain communication. Researchers have proposed several techniques, such as using block loads and stores, User-Level DMA, and Coherent Network Interfaces, to alleviate this NI access bottleneck. This paper is the first to systematically identify, examine, and evaluate the key parameters that underlie these design alternatives. We classify these parameters into two categories: data transfer and buffering parameters. The data transfer parameters capture how messages are transferred between internal memory structures (e.g. processor caches, main memory) of a computer and a memory ...
The Optimistic Direct Access File System: Design and Network Interface Support
"... The emergence of commercially-available network interface controllers (NICs) with remote direct memory access (RDMA) capability and the prospect of their tighter integration with the host memory system motivate the design of distributed systems based on an RDMA paradigm. A recent example is the Dire ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The emergence of commercially-available network interface controllers (NICs) with remote direct memory access (RDMA) capability and the prospect of their tighter integration with the host memory system motivate the design of distributed systems based on an RDMA paradigm. A recent example is the Direct Access File System (DAFS). DAFS clients communicate requests to servers using lightweight Remote Procedure Calls (RPC) based on low-overhead message passing. Data can be transmitted either in-line with the messages, or via server-initiated RDMA independent of the messages. DAFS clients do not initiate RDMA, despite the potentially lower latency of this mechanism for short transfers. With current NICs, servers that export their entire file cache would have to resort to excessive page wiring to guarantee success of client-initiated RDMA operations.
An Efficient Virtual Network Interface in the FUGU Scalable Workstation
, 1998
"... A scalable workstation is one vision of a mainstream parallel computer: a machine that combines scalable, fine-grain communication facilities for parallel applications with virtual memory and preemptive multiprogramming to support general-purpose workloads. A key challenge in a scalable workstation ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
A scalable workstation is one vision of a mainstream parallel computer: a machine that combines scalable, fine-grain communication facilities for parallel applications with virtual memory and preemptive multiprogramming to support general-purpose workloads. A key challenge in a scalable workstation is the Virtual Network Interface (VNI) problem. The problem is that high performance communication for parallel programming depends on a tight coupling between the application and the network while multiprogramming and virtual memory effects disrupt such coupling. This thesis
Design and Evaluation of Network Interfaces for System Area Networks
, 1998
"... Much of a computer's communication performance is determined by how well it interacts with networks. Such interaction is critical for latency-sensitive applications, such as parallel programs that send frequent, short messages. Fortunately, networks have improved dramatically, especially System Area ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Much of a computer's communication performance is determined by how well it interacts with networks. Such interaction is critical for latency-sensitive applications, such as parallel programs that send frequent, short messages. Fortunately, networks have improved dramatically, especially System Area Networks (SANs). SANs provide submicrosecond latency, gigabytes per second bandwidth, and very high reliability to 10-100 hosts. Unfortunately, this dramatic improvement in network performance is seldom delivered to applications. A key bottleneck is the host network interface (NI), which connects a network to a host computer. For example, conventional NIs are usually accessed via direct memory access or uncached, memory-mapped device registers, which can incur latencies between ten and hundreds of microseconds. This thesis investigates novel techniques to improve interactions between a processor and a SAN NI. A key principle underlies these techniques: treat NI access as regular, sideeffec...
Evaluating the performance impact of dynamic handle lookup in modern network interfaces
- In Proc. of the 2nd Annual Workshop on Novel Uses of System Area Networks SAN-2
, 2003
"... Abstract — Recent work in low-latency, high-bandwidth communication systems has resulted in building user–level Network Interface Controllers (NICs) and communication abstractions that support direct access from the NIC to applications virtual memory to avoid both data copies and operating system in ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract — Recent work in low-latency, high-bandwidth communication systems has resulted in building user–level Network Interface Controllers (NICs) and communication abstractions that support direct access from the NIC to applications virtual memory to avoid both data copies and operating system intervention. Such mechanisms require the ability to directly manipulate user–level communication buffers for delivering data and achieving protection. To provide such abilities, NICs must maintain appropriate translation data structures. Most user–level NICs manage these data structures statically which results both in high memory requirements for the NIC and limitations on the total size and number of communication buffers that a NIC can handle. In this paper, we categorize the types of data structures used by NICs and propose dynamic handle lookup as a mechanism to manage such data structures dynamically. We implement our approach in a modern, user–level communication system, we evaluate our design with both micro-benchmarks and real applications, and we study the impact of various cache parameters on system performance. In this work we focus mostly on the results of our work. We find that, with appropriate cache tuning, our approach reduces the amount of NIC memory required in our system by a factor of two for the total NIC memory and by more than 80 % for the lookup data structures. For larger system configurations the gains can be even more significant. Moreover, our approach eliminates the limitations imposed by current NICs on the amount of host memory that can be used for communication buffers. Our approach increases execution time by at most 3 % for all but one applications we examine. I.
Mechanisms for Efficient, Protected Messaging
"... Fine-grain parallelism is the key to high performance muticomputing. By partitioning problems into small sub-tasks -- grain-sizes as small as 70 cycles have been found in common benchmark programs -- fine-grain parallelization accelerates existing applications beyond current limits, and promises eff ..."
Abstract
- Add to MetaCart
Fine-grain parallelism is the key to high performance muticomputing. By partitioning problems into small sub-tasks -- grain-sizes as small as 70 cycles have been found in common benchmark programs -- fine-grain parallelization accelerates existing applications beyond current limits, and promises efficient exploitation of multicomputers consisting of thousands of processors. However, contemporary multiprocessor architectures are not equipped to exploit parallelism at this level, due to high communication and synchronization costs that must be amortized over a large grain size. Operating system-managed message interfaces account for most of the high inefficiency in traditional systems. Conversely, in contemporary user-level network interfaces, fast hardware is defeated by software layers that are needed to provide safeguards against starvation and protection violation. This thesis addresses both the efficiency and robustness issues in the message interface. I propose a design which featu...
Architectural Support For User-Level Input/Output
, 2001
"... The performance of the input/output subsystem is becoming increasingly important for many applications. Commercial I/O intensive applications are a fast growing market segment and experience constantly increasing performance demands. Many of these applications exploit concurrency to overlap the late ..."
Abstract
- Add to MetaCart
The performance of the input/output subsystem is becoming increasingly important for many applications. Commercial I/O intensive applications are a fast growing market segment and experience constantly increasing performance demands. Many of these applications exploit concurrency to overlap the latency of I/O operations to improve throughput. At the same time, semiconductor technology trends result in a growing gap between application and operating system performance. Consequently, operating system overhead increasingly limits the efficiency of latency-hiding techniques to improve throughput. This dissertation develops and evaluates a novel I/O architecture that, by providing user-level access to the I/O subsystem, minimizes I/O overhead while maintaining the level of protection and programming flexibility of conventional kernel-based architectures. Inexpensive hardware mechanisms in the I/O device and host processor implement protected user-level request initiation, user-space data transfers, and user-level notifications. Together, these mechanisms are able to reduce I/O overhead by up to two orders of magnitude. As a result, applications are able to efficiently overlap long-latency I/O operations to maximize throughput and to exploit the scalable bandwidth of next-generation distributed I/O architectures. The flexibility of the basic mechanisms facilitates library implementations of a variety of standard I/O programming models with low overhead, as the architecture does not restrict the allocation and use of I/O buffers.
miNI: Minimizing Network Interface Memory Requirements with Dynamic Handle Lookup
"... Recent work in low-latency, high-bandwidth communication systems has resulted in building Network Interface Controllers (NIC) and communication abstractions that support direct access from the NIC to application virtual memory to avoid both data copies and operating system intervention. Such mechani ..."
Abstract
- Add to MetaCart
Recent work in low-latency, high-bandwidth communication systems has resulted in building Network Interface Controllers (NIC) and communication abstractions that support direct access from the NIC to application virtual memory to avoid both data copies and operating system intervention. Such mechanisms require the ability to directly manipulate application buffers in host memory for protection and delivering data. Most modern NICs statically maintain address translation and protection information. However, this results both in high memory requirements for the NIC and limitations in the size of host memory. In this thesis, we categorize the types of data structures for managing communication buffers used in modern NICs, and propose mechanisms to dynamically manage such data structures to alleviate the related limitations. We implement our approach in a modern user–level communication system. The contributions of this thesis are: (i) The integrated approach for dynamic handle lookup that deals with all major lookup data structures reduces NIC memory requirements significantly and eliminates restrictions on

