Results 1 - 10
of
148
System architecture directions for networked sensors
- IN ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS
, 2000
"... Technological progress in integrated, low-power, CMOS communication devices and sensors makes a rich design space of networked sensors viable. They can be deeply embedded in the physical world or spread throughout our environment. The missing elements are an overall system architecture and a methodo ..."
Abstract
-
Cited by 1789 (58 self)
- Add to MetaCart
(Show Context)
Technological progress in integrated, low-power, CMOS communication devices and sensors makes a rich design space of networked sensors viable. They can be deeply embedded in the physical world or spread throughout our environment. The missing elements are an overall system architecture and a methodology for systematic advance. To this end, we identify key requirements, develop a small device that is representative of the class, design a tiny event-driven operating system, and show that it provides support for efficient modularity and concurrency-intensive operation. Our operating system fits in 178 bytes of memory, propagates events in the time it takes to copy 1.25 bytes of memory, context switches in the time it takes to copy 6 bytes of memory and supports two level scheduling. The analysis lays a groundwork for future architectural advances.
Software Transactional Memory
, 1995
"... As we learn from the literature, flexibility in choosing synchronization operations greatly simplifies the task of designing highly concurrent programs. Unfortunately, existing hardware is inflexible and is at best on the level of a Load Linked/Store Conditional operation on a single word. Building ..."
Abstract
-
Cited by 695 (14 self)
- Add to MetaCart
As we learn from the literature, flexibility in choosing synchronization operations greatly simplifies the task of designing highly concurrent programs. Unfortunately, existing hardware is inflexible and is at best on the level of a Load Linked/Store Conditional operation on a single word. Building on the hardware based transactional synchronization methodology of Herlihy and Moss, we offer software transactional memory (STM), a novel software method for supporting flexible transactional programming of synchronization operations. STM is non-blocking, and can be implemented on existing machines using only a Load Linked/Store Conditional operation. We use STM to provide a general highly concurrent method for translating sequential object implementations to lock-free ones based on implementing a k-word compare&swap STM-transaction. Empirical evidence collected on simulated multiprocessor architectures shows that the our method always outperforms all the lock-free translation methods in ...
Virtual Memory Mapped Network Interface for the SHRIMP Multicomputer
- IN PROCEEDINGS OF THE 21ST ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1994
"... The network interfaces of existing multicomputers require a significant amount of software overhead to provide protection and to implement message passing protocols. This paper describes the design of a low-latency, high-bandwidth, virtual memory-mapped network interface for the SHRIMP multicomputer ..."
Abstract
-
Cited by 267 (24 self)
- Add to MetaCart
The network interfaces of existing multicomputers require a significant amount of software overhead to provide protection and to implement message passing protocols. This paper describes the design of a low-latency, high-bandwidth, virtual memory-mapped network interface for the SHRIMP multicomputer project at Princeton University. Without sacrificing protection, the network interface achieves low latency by using virtual memory mapping and write-latency hiding techniques, and obtains high bandwidth by providing a user-level block data transfer mechanism. We have implemented several message passing primitives in an experimental environment, demonstrating that our approach can reduce the message passing overhead to a few user-level instructions.
Job Scheduling in Multiprogrammed Parallel Systems
, 1997
"... Scheduling in the context of parallel systems is often thought of in terms of assigning tasks in a program to processors, so as to minimize the makespan. This formulation assumes that the processors are dedicated to the program in question. But when the parallel system is shared by a number of us ..."
Abstract
-
Cited by 176 (16 self)
- Add to MetaCart
Scheduling in the context of parallel systems is often thought of in terms of assigning tasks in a program to processors, so as to minimize the makespan. This formulation assumes that the processors are dedicated to the program in question. But when the parallel system is shared by a number of users, this is not necessarily the case. In the context of multiprogrammed parallel machines, scheduling refers to the execution of threads from competing programs. This is an operating system issue, involved with resource allocation, not a program development issue. Scheduling schemes for multiprogrammed parallel systems can be classified as one or two leveled. Single-level scheduling combines the allocation of processing power with the decision of which thread will use it. Two level scheduling decouples the two issues: first, processors are allocated to the job, and then the job's threads are scheduled using this pool of processors. The processors of a parallel system can be shared i...
Sparcle: An Evolutionary Processor Design for Large-Scale Multiprocessors
- IEEE MICRO
, 1993
"... Sparcle is a processor chip developed jointly by MIT, LSI Logic, and SUN Microsystems, by evolving an existing RISC architecture towards a processor suited for large-scale multiprocessors. Sparcle supports three multiprocessor mechanisms: fast context switching, fast, user-level message handling, a ..."
Abstract
-
Cited by 112 (21 self)
- Add to MetaCart
(Show Context)
Sparcle is a processor chip developed jointly by MIT, LSI Logic, and SUN Microsystems, by evolving an existing RISC architecture towards a processor suited for large-scale multiprocessors. Sparcle supports three multiprocessor mechanisms: fast context switching, fast, user-level message handling, and fine-grain synchronization. The Sparcle effort demonstrates that RISC architectures coupled with a communications and memory management unit do not require major architectural changes to support multiprocessing efficiently.
The M-Machine Multicomputer
, 1995
"... The M-Machine is an experimental multicomputer being developed to test architectural concepts motivated by the constraints of modern semiconductor technology and the demands of programming systems. The M-Machine computing nodes are con- nected with a 3-D mesh network; each node is a multithreaded pr ..."
Abstract
-
Cited by 111 (12 self)
- Add to MetaCart
The M-Machine is an experimental multicomputer being developed to test architectural concepts motivated by the constraints of modern semiconductor technology and the demands of programming systems. The M-Machine computing nodes are con- nected with a 3-D mesh network; each node is a multithreaded processor incorporating 12 function units, on-chip cache, and local memory. The multiple function units are used to exploit both instruction-level and thread-level parallelism. A user accessible message passing system yields fast communication and synchronization between nodes. RapM access to remote memory is provided transparently to the user with a combination of hardware and software mechanisms. This paper presents the architecture of the M-Machine and describes how its mechanisms attempt to maximize both single thread performance and overall system throughput. The architecture is complete and the MAP chip, which will serve as the M-Machine processing node, is currently being implemented.
Integrating Message-Passing and Shared-Memory: Early Experience
, 1993
"... This paper discusses some of the issues involved in implementing a shared-address space programming model on large-scale, distributed-memory multiprocessors. While such a programming model can be implemented on both shared-memory and messagepassing architectures, we argue that the transparent, coher ..."
Abstract
-
Cited by 100 (15 self)
- Add to MetaCart
(Show Context)
This paper discusses some of the issues involved in implementing a shared-address space programming model on large-scale, distributed-memory multiprocessors. While such a programming model can be implemented on both shared-memory and messagepassing architectures, we argue that the transparent, coherent caching of global data provided by many shared-memory architectures is of crucial importance. Because message-passing mechanisms are much more efficient than shared-memory loads and stores for certain types of interprocessor communication and synchronization operations, however, we argue for building multiprocessors that efficiently support both shared-memory and message-passing mechanisms. We describe an architecture, Alewife, that integrates support for shared-memory and message-passing through a simple interface; we expect the compiler and runtime system to cooperate in using appropriate hardware mechanisms that are most efficient for specific operations. We report on both integrated and exclusively shared-memory implementations of our runtime system and two applications. The integrated runtime system drastically cuts down the cost of communication incurred by the scheduling, load balancing, and certain synchronization operations. We also present preliminary performance results comparing the two systems.
A Tightly-Coupled Processor-Network Interface
- IN PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS (ASPLOS V
, 1992
"... Careful design of the processor-network interface can dramatically reduce the software overhead of interprocessor communication. Our interface architecture reduces communication overhead five fold in our benchmarks. Most of our performance gain comes from simple, low cost hardware mechanisms for fas ..."
Abstract
-
Cited by 85 (3 self)
- Add to MetaCart
Careful design of the processor-network interface can dramatically reduce the software overhead of interprocessor communication. Our interface architecture reduces communication overhead five fold in our benchmarks. Most of our performance gain comes from simple, low cost hardware mechanisms for fast dispatching on, forwarding of, and replying to messages. The remaining improvement can be gained by implementing the network interface as part of the processor's register file. For example, using our hardware mechanisms a register-mapped interface can receive, process, and reply to a remote read request in a total of two RISC instructions. We have implemented an RTL model of an off-chip memory-mapped interface which provides our hardware mechanisms. Our industrial partner, Motorola, is implementing a similar network interface on-chip in an experimental version of the 88110 processor.
Compressionless Routing: A Framework for Adaptive and Fault-tolerant Routing
, 1997
"... Compressionless Routing (CR) is a new adaptive routing framework which provides a unified framework for efficient deadlock-free adaptive routing and fault-tolerance. CR exploits the tight-coupling between wormhole routers for flow control to detect and recover from potential deadlock situations. Fa ..."
Abstract
-
Cited by 72 (5 self)
- Add to MetaCart
Compressionless Routing (CR) is a new adaptive routing framework which provides a unified framework for efficient deadlock-free adaptive routing and fault-tolerance. CR exploits the tight-coupling between wormhole routers for flow control to detect and recover from potential deadlock situations. Fault-tolerant Compressionless Routing (FCR) extends CR to support end-toend fault-tolerant delivery. Detailed routing algorithms, implementation complexity, and performance simulation results for CR and FCR are presented. These results show that the hardware for CR and FCR networks is modest. Further, CR and FCR networks can achieve superior performance to alternatives such as dimension-order routing. Compressionless Routing has several key advantages: deadlock-free adaptive routing in toroidal networks with no virtual channels, simple router designs, order-preserving message transmission, applicability to a wide variety of network topologies, and elimination of the need for buffer allocation messages. Fault-tolerant Compressionless Routing has several additional advantages: data integrity in the presence of transient faults (nonstop fault-tolerance), permanent faults tolerance, and elimination of the need for software buffering and retry for reliability. The advantages of CR and FCR not only simplify hardware support for adaptive routing and fault-tolerance, they also can simplify software communication layers.