Results 1  10
of
63
Software Transactional Memory
, 1995
"... As we learn from the literature, flexibility in choosing synchronization operations greatly simplifies the task of designing highly concurrent programs. Unfortunately, existing hardware is inflexible and is at best on the level of a Load Linked/Store Conditional operation on a single word. Building ..."
Abstract

Cited by 695 (14 self)
 Add to MetaCart
(Show Context)
As we learn from the literature, flexibility in choosing synchronization operations greatly simplifies the task of designing highly concurrent programs. Unfortunately, existing hardware is inflexible and is at best on the level of a Load Linked/Store Conditional operation on a single word. Building on the hardware based transactional synchronization methodology of Herlihy and Moss, we offer software transactional memory (STM), a novel software method for supporting flexible transactional programming of synchronization operations. STM is nonblocking, and can be implemented on existing machines using only a Load Linked/Store Conditional operation. We use STM to provide a general highly concurrent method for translating sequential object implementations to lockfree ones based on implementing a kword compare&swap STMtransaction. Empirical evidence collected on simulated multiprocessor architectures shows that the our method always outperforms all the lockfree translation methods in ...
Adding Networks
, 2001
"... An adding network is a distributed data structure that supports a concurrent, lockfree, lowcontention implementation of a fetch&add counter; a counting network is an instance of an adding network that supports only fetch&increment. We present a lower bound showing that adding networks ha ..."
Abstract

Cited by 117 (33 self)
 Add to MetaCart
An adding network is a distributed data structure that supports a concurrent, lockfree, lowcontention implementation of a fetch&add counter; a counting network is an instance of an adding network that supports only fetch&increment. We present a lower bound showing that adding networks have inherently high latency. Any adding network powerful enough to support addition by at least two values a and b, where a > b > 0, has sequential executions in which each token traverses Ω(n/c) switching elements, where n is the number of concurrent processes, and c is a quantity we call oneshot contention; for a large class of switching networks and for conventional counting networks the oneshot contention is constant. On the contrary, counting networks have O(log n) latency [4,7]. This bound is tight. We present the first concurrent, lockfree, lowcontention networked data structure that supports arbitrary fetch&add operations.
A scalable lockfree stack algorithm
 In SPAA’04: Symposium on Parallelism in Algorithms and Architectures
, 2004
"... The literature describes two high performance concurrent stack algorithms based on combining funnels and elimination trees. Unfortunately, the funnels are linearizable but blocking, and the elimination trees are nonblocking but not linearizable. Neither is used in practice since they perform well o ..."
Abstract

Cited by 80 (11 self)
 Add to MetaCart
(Show Context)
The literature describes two high performance concurrent stack algorithms based on combining funnels and elimination trees. Unfortunately, the funnels are linearizable but blocking, and the elimination trees are nonblocking but not linearizable. Neither is used in practice since they perform well only at exceptionally high loads. The literature also describes a simple lockfree linearizable stack algorithm that works at low loads but does not scale as the load increases. The question of designing a stack algorithm that is nonblocking, linearizable, and scales well throughout the concurrency range, has thus remained open. This paper presents such a concurrent stack algorithm. It is based on the following simple observation: that a single elimination array used as a backoff scheme for a simple lockfree stack is lockfree, linearizable, and scalable. As our empirical results show, the resulting eliminationbackoff stack performs as well as the simple stack at low loads, and increasingly outperforms all other methods (lockbased and nonblocking) as concurrency increases. We believe its simplicity and scalability make it a viable practical alternative to existing constructions for implementing concurrent stacks.
Cloud control with distributed rate limiting
 In SIGCOMM
, 2007
"... Today’s cloudbased services integrate globally distributed resources into seamless computing platforms. Provisioning and accounting for the resource usage of these Internetscale applications presents a challenging technical problem. This paper presents the design and implementation of distributed ..."
Abstract

Cited by 71 (4 self)
 Add to MetaCart
(Show Context)
Today’s cloudbased services integrate globally distributed resources into seamless computing platforms. Provisioning and accounting for the resource usage of these Internetscale applications presents a challenging technical problem. This paper presents the design and implementation of distributed rate limiters, which work together to enforce a global rate limit across traffic aggregates at multiple sites, enabling the coordinated policing of a cloudbased service’s network traffic. Our abstraction not only enforces a global limit, but also ensures that congestionresponsive transportlayer flows behave as if they traversed a single, shared limiter. We present two designs—one general purpose, and one optimized for TCP—that allow service operators to explicitly trade off between communication costs and system accuracy, efficiency, and scalability. Both designs are capable of rate limiting thousands of flows with negligible overhead (less than 3 % in the tested configuration). We demonstrate that our TCPcentric design is scalable to hundreds of nodes while robust to both loss and communication delay, making it practical for deployment in nationwide service providers.
Elimination Trees and the Construction of Pools and Stacks
, 1996
"... Shared pools and stacks are two coordination structures with a history of applications ranging from simple producer/consumer buffers to jobschedulers and procedure stacks. This paper introduces elimination trees, a novel form of diffracting trees that offer pool and stack implementations with super ..."
Abstract

Cited by 45 (13 self)
 Add to MetaCart
Shared pools and stacks are two coordination structures with a history of applications ranging from simple producer/consumer buffers to jobschedulers and procedure stacks. This paper introduces elimination trees, a novel form of diffracting trees that offer pool and stack implementations with superior response (on average constant) under high loads, while guaranteeing logarithmic time "deterministic" termination under sparse request patterns. 1 A preliminary version of this paper appeared in the proceedings of the 7th Annual Symposium on Parallel Algorithms and Architectures (SPAA). Contact Author: Email:shanir@theory.lcs.mit.edu 1 Introduction As multiprocessing breaks away from its traditional number crunching role, we are likely to see a growing need for highly distributed and parallel coordination structures. A realtime application such as a system of sensors and actuators will require fast response under both sparse and intense activity levels (typical examples could be a ra...
An Inherent Bottleneck in Distributed Counting
 Journal of Parallel and Distributed Computing
, 1997
"... A distributed counter allows each processor in an asynchronous message passing network to access the counter value and increment it. We study the problem of implementing a distributed counter such that no processor is a communication bottleneck. We prove a lower bound of\Omega\Gamma/20 n= log log n) ..."
Abstract

Cited by 20 (5 self)
 Add to MetaCart
(Show Context)
A distributed counter allows each processor in an asynchronous message passing network to access the counter value and increment it. We study the problem of implementing a distributed counter such that no processor is a communication bottleneck. We prove a lower bound of\Omega\Gamma/20 n= log log n) on the number of messages that some processor must exchange in a sequence of n counting operations spread over n processors. We propose a counter that achieves this bound when each processor increments the counter exactly once. Hence, the lower bound is tight. Because most algorithms and data structures count in some way, the lower bound holds for many distributed computations. We feel that the proposed concept of a communication bottleneck is a relevant measure of efficiency for a distributed algorithm and data structure, because it indicates the achievable degree of distribution. 1 Introduction Counting is an essential ingredient in virtually any computation. It is therefore highly de...
Linear lower bounds on realworld implementations of concurrent objects
 In Proceedings of the 46th Annual Symposium on Foundations of Computer Science (FOCS
, 2005
"... Abstract This paper proves \Omega (n) lower bounds on the time to perform a single instance of an operationin any implementation of a large class of data structures shared by n processes. For standarddata structures such as counters, stacks, and queues, the bound is tight. The implementations consid ..."
Abstract

Cited by 19 (10 self)
 Add to MetaCart
(Show Context)
Abstract This paper proves \Omega (n) lower bounds on the time to perform a single instance of an operationin any implementation of a large class of data structures shared by n processes. For standarddata structures such as counters, stacks, and queues, the bound is tight. The implementations considered may apply any deterministic primitives to a base object. No bounds are assumedon either the number of base objects or their size. Time is measured as the number of steps a process performs on base objects and the number of stalls it incurs as a result of contentionwith other processes. 1
A Steady State Analysis of Diffracting Trees
, 1997
"... Diffracting trees are an effective and highly scalable distributedparallel technique for shared counting and load balancing. This paper presents the first steadystate combinatorial model and analysis for diffracting trees, and uses it to answer several critical algorithmic design questions. Our mo ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
Diffracting trees are an effective and highly scalable distributedparallel technique for shared counting and load balancing. This paper presents the first steadystate combinatorial model and analysis for diffracting trees, and uses it to answer several critical algorithmic design questions. Our model is simple and sufficiently high level to overcome many implementation specific details, and yet as we will show it is rich enough to accurately predict empirically observed behaviors. As a result of our analysis we were able to identify starvation problems in the original diffracting tree algorithm and modify it to a create a more stable version. We are also able to identify the range in which the diffracting tree performs most efficiently, and the ranges in which its performance degrades. We believe our model and modeling approach openthewayto steadystate analysis of other distributedparallel structures such as counting networks and elimination trees.
Skiplistbased Concurrent Priority Queues
, 2000
"... This paper addresses the problem of designing scalable concurrent priority queues for large scale multiprocessors – machines with up to several hundred processors. Priority queues are fundamental in the design of modern multiprocessor algorithms, with many classical applications ranging from numeric ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
This paper addresses the problem of designing scalable concurrent priority queues for large scale multiprocessors – machines with up to several hundred processors. Priority queues are fundamental in the design of modern multiprocessor algorithms, with many classical applications ranging from numerical algorithms through discrete event simulation and expert systems. While highly scalable approaches have been introduced for the special case of queues with a fixed set of priorities, the most efficient designs for the general case are based on the parallelization of the heap data structure. Though numerous intricate heapbased schemes have been suggested in the literature, their scalability seems to be limited to small machines in the range of ten to twenty processors. This paper proposes an alternative approach: to base the design of concurrent priority queues on the probabilistic skiplist data structure, rather than on a heap. To this end, we show that a concurrent skiplist structure, following a simple set of modifications, provides a concurrent priority queue with a higher level of parallelism and significantly less contention than the fastest known heapbased algorithms. Our initial empirical evidence, collected on a simulated 256 node shared memory multiprocessor architecture similar to the MIT Alewife, suggests that the new skiplist based priority queue algorithm scales significantly better than heap based schemes throughout most of the concurrency range. With 256 processors, they are about 3 times faster in performing deletions and up to 10 times faster in performing insertions.
Randomized Priority Queues for Fast Parallel Access
 Journal of Parallel and Distributed Computing
, 1997
"... Applications like parallel search or discrete event simulation often assign priority or importance to pieces of work. An effective way to exploit this for parallelization is to use a priority queue data structure for scheduling the work; but a bottleneck free implementation of parallel priority ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
(Show Context)
Applications like parallel search or discrete event simulation often assign priority or importance to pieces of work. An effective way to exploit this for parallelization is to use a priority queue data structure for scheduling the work; but a bottleneck free implementation of parallel priority queue access by many processors is required to make this approach scalable. We present simple and portable randomized algorithms for parallel priority queues on distributed memory machines with fully distributed storage. Accessing O(n) out of m elements on an nprocessor network with diameter d requires amortized time O with high probability for many network types. On logarithmic diameter networks, the algorithms are as fast as the best previously known EREWPRAM methods. Implementations demonstrate that the approach is already useful for medium scale parallelism.