Results 1 - 10
of
21
Delaunay Triangulation with Transactions and Barriers
- IEEE Intl. Symp. on Workload Characterization
, 2007
"... Transactional memory has been widely hailed as a simpler alternative to locks in multithreaded programs, but few nontrivial transactional programs are currently available. We describe an open-source implementation of Delaunay triangulation that uses transactions as one component of a larger parallel ..."
Abstract
-
Cited by 23 (9 self)
- Add to MetaCart
Transactional memory has been widely hailed as a simpler alternative to locks in multithreaded programs, but few nontrivial transactional programs are currently available. We describe an open-source implementation of Delaunay triangulation that uses transactions as one component of a larger parallelization strategy. The code is written in C++, for use with the RSTM software transactional memory library (also open source). It employs one of the fastest known sequential algorithms to triangulate geometrically partitioned regions in parallel; it then employs alternating, barrier-separated phases of transactional and partitioned work to stitch those regions together. Experiments on multiprocessor and multicore machines confirm excellent single-thread performance and good speedup with increasing thread count. Since execution time is dominated by geometrically partitioned computation, performance is largely insensitive to the overhead of transactions, but highly sensitive to any costs imposed on sharable data that are currently “privatized”. 1.
Software Transactional Memory for Large Scale Clusters
- PPOPP'08
, 2008
"... While there has been extensive work on the design of software transactional memory (STM) for cache coherent shared memory systems, there has been no work on the design of an STM system for very large scale platforms containing potentially thousands of nodes. In this work, we present Cluster-STM, an ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
While there has been extensive work on the design of software transactional memory (STM) for cache coherent shared memory systems, there has been no work on the design of an STM system for very large scale platforms containing potentially thousands of nodes. In this work, we present Cluster-STM, an STM designed for high performance on large-scale commodity clusters. Our design addresses several novel issues posed by this domain, including aggregating communication, managing locality, and distributing transactional metadata onto the nodes. We also re-evaluate several STM design choices previously studied for cache-coherent machines and conclude that, in some cases, different choices are appropriate on clusters. Finally, we show that our design scales well up to 512 processors. This is because on a cluster, the main barrier to STM scalability is the remote communication overhead imposed by the STM operations, and our design aggregates most of that communication with the communication of the underlying data.
Flexible Decoupled Transactional Memory Support
- IN ISCA
, 2007
"... A high-concurrency transactional memory (TM) implementation needs to track concurrent accesses, buffer speculative updates, and manage conflicts. We present a system, FlexTM (FLEXible Transactional Memory), that coordinates four decoupled hardware mechanisms: read and write signatures, which summari ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
A high-concurrency transactional memory (TM) implementation needs to track concurrent accesses, buffer speculative updates, and manage conflicts. We present a system, FlexTM (FLEXible Transactional Memory), that coordinates four decoupled hardware mechanisms: read and write signatures, which summarize per-thread access sets; per-thread conflict summary tables (CSTs), which identify the threads with which conflicts have occurred; Programmable Data Isolation, which maintains speculative updates in the local cache and employs a thread-private buffer (in virtual memory) in the rare event of overflow; and Alert-On-Update, which selectively notifies threads about coherence events. All mechanisms are softwareaccessible, to enable virtualization and to support transactions of arbitrary length. FlexTM allows software to determine when to manage conflicts (either eagerly or lazily), and to employ a variety of conflict management and commit protocols. We describe an STM-inspired protocol that uses CSTs to manage conflicts in a distributed manner (no global arbitration) and allows parallel commits. In experiments with a prototype on Simics/GEMS, FlexTM exhibits ∼5 × speedup over high-quality software TM, with no loss in policy flexibility. Its distributed commit protocol is also more efficient than a central hardware manager. Our results highlight the importance of flexibility in determining when to manage conflicts: lazy maximizes concurrency and helps to ensure forward progress while eager provides better overall utilization in a multi-programmed system.
Dependence-Aware Transactional Memory for Increased Concurrency
, 2008
"... Abstract—Transactional memory (TM) is a promising paradigm for helping programmers take advantage of emerging multicore platforms. Though they perform well under low contention, hardware TM systems have a reputation of not performing well under high contention, as compared to locks. This paper prese ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Abstract—Transactional memory (TM) is a promising paradigm for helping programmers take advantage of emerging multicore platforms. Though they perform well under low contention, hardware TM systems have a reputation of not performing well under high contention, as compared to locks. This paper presents a model and implementation of dependence-aware transactional memory (DATM), a novel solution to the problem of scaling under contention. Unlike many proposals to deal with write-shared data (which arise in common data structures like counters and linked lists), DATM operates transparently to the programmer. The main idea in DATM is to accept any transaction execution interleaving that is conflict serializable, including interleavings that contain simple conflicts. Current TM systems reduce useful concurrency by restarting conflicting transactions, even if the execution interleaving is conflict serializable. DATM manages dependences between uncommitted transactions, sometimes forwarding data between them to safely commit conflicting transactions. The evaluation of our prototype shows that DATM increases concurrency, for example by reducing the runtime of STAMP benchmarks by up to 39 % and reducing transaction restarts by up to 94%. I.
Abstract Capabilities and Limitations of Library-Based Software Transactional Memory in C++ ∗
"... Like many past extensions to user programming models, transactions can be added to the programming language or implemented in a library using existing language features. We describe a library-based transactional memory API for C++. Designed to address the limitations of an earlier API with similar f ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Like many past extensions to user programming models, transactions can be added to the programming language or implemented in a library using existing language features. We describe a library-based transactional memory API for C++. Designed to address the limitations of an earlier API with similar functionality, the new interface leverages macros, exceptions, multiple inheritance, generics (templates), and overloading of operators (including pointer dereference) in an attempt to minimize syntactic clutter, admit a wide variety of back-end implementations, avoid arbitrary restrictions on otherwise valid language constructs, enable privatization, catch as many programmer errors as possible, and provide semantics that “seem natural ” to C++ programmers. Having used our API to construct several small and one large application, we conclude that while the interface is a significant improvement on earlier efforts, and makes it practical for systems researchers to build nontrivial applications, it fails to realize the programming simplicity that was supposed to be the motivation for transactions in the first place. Several groups have proposed compiler support as a way to improve the performance of transactions. We conjecture that compiler—and language—support will be even more important as a way to improve the programming model. 1.
Thread-Safe Dynamic Binary Translation using Transactional Memory
"... Dynamic binary translation (DBT) is a runtime instrumentation technique commonly used to support profiling, optimization, secure execution, and bug detection tools for application binaries. However, DBT frameworks may incorrectly handle multithreaded programs due to races involving updates to the ap ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Dynamic binary translation (DBT) is a runtime instrumentation technique commonly used to support profiling, optimization, secure execution, and bug detection tools for application binaries. However, DBT frameworks may incorrectly handle multithreaded programs due to races involving updates to the application data and the corresponding metadata maintained by the DBT. Existing DBT frameworks handle this issue by serializing threads, disallowing multithreaded programs, or requiring explicit use of locks. This paper presents a practical solution for correct execution of multithreaded programs within DBT frameworks. To eliminate races involving metadata, we propose the use of transactional memory (TM). The DBT uses memory transactions to encapsulate the data and metadata accesses in a trace, within one atomic block. This approach guarantees correct execution of concurrent threads of the translated program, as TM mechanisms detect and correct races. To demonstrate this approach, we implemented a DBTbased tool for secure execution of x86 binaries using dynamic information flow tracking. This is the first such framework that correctly handles multithreaded binaries without serialization. We show that the use of software transactions in the DBT leads to a runtime overhead of 40%. We also show that software optimizations in the DBT and hardware support for transactions can reduce the runtime overhead to 6%. 1.
Intermediate Checkpointing with Conflicting Access Prediction in Transactional Memory Systems*
"... Transactional memory systems promise to reduce the burden of exposing thread-level parallelism in programs by relieving programmers from analyzing complex inter-thread dependences in detail. By encapsulating large program code blocks and executing them as atomic blocks, dependence checking is deferr ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Transactional memory systems promise to reduce the burden of exposing thread-level parallelism in programs by relieving programmers from analyzing complex inter-thread dependences in detail. By encapsulating large program code blocks and executing them as atomic blocks, dependence checking is deferred to run-time at which point one of many conflicting transactions will be committed whereas the others will have to roll-back and reexecute. In current proposals, a checkpoint is taken at the beginning of the atomic block and all execution can be wasted even if the conflicting access happens at the end of the atomic block In this paper, we propose a novel scheme that (1) predicts when the first conflicting access occurs and (2) inserts a checkpoint before it is executed. When the prediction is correct, the only execution discarded is the one that has to be re-done. When the prediction is incorrect, the whole transaction has to be re-executed just as before. Overall, we find that our scheme manages to maintain high prediction accuracy and leads to a quite significant reduction in the number of lost cycles due to roll-backs; the geometric mean speedup across five applications is 16%. 1.
Refereeing Conflicts in Hardware Transactional Memory ∗
"... In the search for high performance, most transactional memory (TM) systems execute atomic blocks concurrently and must thus be prepared for data conflicts. The TM system must then choose a policy to decide when and how to manage the resulting contention. In this paper, we analyze the interplay betwe ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
In the search for high performance, most transactional memory (TM) systems execute atomic blocks concurrently and must thus be prepared for data conflicts. The TM system must then choose a policy to decide when and how to manage the resulting contention. In this paper, we analyze the interplay between conflict resolution time and contention management policy in the context of hardware-supported TM systems, highlighting both the implementation tradeoffs and the performance implications of the various points in the design space. We show that both policy decisions have a significant impact on the ability to exploit available parallelism and thereby affect overall performance. Our analysis corroborates previous research findings that stalling (especially prior to retrying an access rather than the entire transaction) helps sidestep conflicts and avoid wasted work. We also demonstrate that conflict resolution time has the dominant effect on performance: lazy (which delays resolution to commit time) uncovers more parallelism than eager (which resolves conflicts at access time). Furthermore, Lazy’s delayed conflict management decreases the likelihood of livelock while Eager needs sophisticated priority mechanisms. Finally, we evaluate a mixed conflict detection mode that detects write-write conflicts eagerly while detecting read-write conflicts lazily, and show that it provides a good compromise between flexibility and implementation complexity.
An efficient transactional memory algorithm for computing minimum spanning forest of sparse graphs
- In PPoPP ’09: Proceedings of the 14th ACM SIGPLAN Symposium on Principles
"... Due to power wall, memory wall, and ILP wall, we are facing the end of ever increasing single-threaded performance. For this reason, multicore and manycore processors are arising as a new paradigm to pursue. However, to fully exploit all the cores in a chip, parallel programming is often required, a ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Due to power wall, memory wall, and ILP wall, we are facing the end of ever increasing single-threaded performance. For this reason, multicore and manycore processors are arising as a new paradigm to pursue. However, to fully exploit all the cores in a chip, parallel programming is often required, and the complexity of parallel programming raises a significant concern. Data synchronization is a major source of this programming complexity, and Transactional Memory is proposed to reduce the difficulty caused by data synchronization requirements, while providing high scalability and low performance overhead. The previous literature on Transactional Memory mostly focuses on architectural designs. Its impact on algorithms and applications has not yet been studied thoroughly. In this paper, we investigate Transactional Memory from the algorithm designer’s perspective. This paper presents an algorithmic model to assist in the design of efficient Transactional Memory algorithms and a novel Transactional Memory algorithm for computing a minimum spanning forest of sparse graphs. We emphasize multiple Transactional Memory related design issues in presenting our algorithm. We also provide experimental results on an existing software Transactional Memory system. Our algorithm demonstrates excellent scalability in the experiments, but at the same time, the experimental results reveal the clear limitation of software Transactional Memory due to its high performance overhead. Based on our experience, we highlight the necessity of efficient hardware support for Transactional Memory to realize the potential of the technology.
Transactional Mutex Locks ∗
"... Mutual exclusion locks limit concurrency but offer low latency. Software transactional memory (STM) typically has higher latency, but scales well. In this paper we propose transactional mutex locks (TML), which attempt to achieve the best of both worlds for read-dominated workloads. TML has much low ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Mutual exclusion locks limit concurrency but offer low latency. Software transactional memory (STM) typically has higher latency, but scales well. In this paper we propose transactional mutex locks (TML), which attempt to achieve the best of both worlds for read-dominated workloads. TML has much lower latency than STM, enabling it to perform competitively with mutexes. It also scales as well as STM when critical sections rarely perform writes. In this paper, we describe the design of TML and evaluate its performance using microbenchmarks on the x86, SPARC, and POWER architectures. Our experiments show that while TML is not general enough to completely subsume both STM and locks, it offers compelling performance for the targeted workloads, without performing substantially worse than locks when writes are frequent. 1.

