@MISC{Giacomoni_(a)lamport, author = {John Giacomoni and Manish Vachharajani}, title = {(a) Lamport}, year = {} }
Bookmark
OpenURL
Abstract
High-rate core-to-core communication is critical for efficient pipeline-parallel software architectures. This paper introduces FastForward, a software-only low-overhead high-rate queue algorithm for pipeline parallelism on multicore architectures. FastForward uses an architecturallytuned domain-specific adaptation of concurrent lock-free queues to provide low-latency and low-overhead core-tocore communication. Enqueue and dequeue times on a 2 GHz Opteron 270 based system are as low as 36 ns, up to 4x faster than Lamport’s solution. 1 Design and Initial Results Traditionally, improvements in processor design and fabrication technology have permitted software developers to deliver next generation applications, including modern genomics and software define radios. The challenge with multicore systems is to develop a set of techniques or hardware modifications that help application level developers continue to harness the power of systems. This work introduces a new concurrent lock-free singleproducer/single-consumer queue algorithm (FastForward) that is up to 4x faster than Lamport’s queue [2] on commodity cache-coherent processors, permitting developers to achieve performance improvements with fine-grain parallelism [1]. FastForward does this by eliminating the cacheunfriendly implicit coupling in Lamport’s queue. Observe that alternating queue operations with Lamport’s algorithm (Figure 1(a)) necessitate the transfer of the head and tail indices between caches for every operation. FastForward counter-intuitively decouples operation by coupling control and data transfer into the storage buffer (Figure 1(b)). Decoupled operation is ensured by separating the producer and consumer in time, thus permitting each processor to operate concurrently on separate cache lines without interference. Hardware prefetching masks the cost of cache transfers for the storage array itself. The references [1] prove that “in the program order of the consumer, the consumer dequeues values in the same order 1 if(NEXT(head) = = tail){ 2 / / Handle full queue. 3} 4 buf[head] = data; 5 head = NEXT(head);