Results 1 - 10
of
360
Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
, 2011
"... Geo-replicated, distributed data stores that support complex online applications, such as social networks, must provide an “always-on” experience where operations always complete with low latency. Today’s systems often sacrifice strong consistency to achieve these goals, exposing inconsistencies to ..."
Abstract
-
Cited by 98 (6 self)
- Add to MetaCart
Geo-replicated, distributed data stores that support complex online applications, such as social networks, must provide an “always-on” experience where operations always complete with low latency. Today’s systems often sacrifice strong consistency to achieve these goals, exposing inconsistencies to their clients and necessitating complex application logic. In this paper, we identify and define a consistency model—causal consistency with convergent conflict handling, or causal+—that is the strongest achieved under these constraints. We present the design and implementation of COPS, a key-value store that delivers this consistency model across the wide-area. A key contribution of COPS is its scalability, which can enforce causal dependencies between keys stored across an entire cluster, rather than a single server like previous systems. The central approach in COPS is tracking and explicitly checking whether causal dependencies between keys are satisfied in the local cluster before exposing writes. Further, in COPS-GT, we introduce get transactions in order to obtain a consistent view of multiple keys without locking or blocking. Our evaluation shows that COPS completes operations in less than a millisecond, provides throughput similar to previous systems when using one server per cluster, and scales well as we increase the number of servers in each cluster. It also shows that COPS-GT provides similar latency, throughput, and scaling to COPS for common workloads.
Windows azure storage: a highly available cloud storage service with strong consistency
- In SOSP ’11
, 2011
"... Windows Azure Storage (WAS) is a cloud storage system that provides customers the ability to store seemingly limitless amounts of data for any duration of time. WAS customers have access to their data from anywhere at any time and only pay for what they use and store. In WAS, data is stored durably ..."
Abstract
-
Cited by 48 (1 self)
- Add to MetaCart
(Show Context)
Windows Azure Storage (WAS) is a cloud storage system that provides customers the ability to store seemingly limitless amounts of data for any duration of time. WAS customers have access to their data from anywhere at any time and only pay for what they use and store. In WAS, data is stored durably using both local and geographic replication to facilitate disaster recovery. Currently, WAS storage comes in the form of Blobs (files), Tables (structured storage), and Queues (message delivery). In this paper, we describe the WAS architecture, global namespace, and data model, as well as its resource provisioning, load balancing, and replication systems. Categories and Subject Descriptors
Stronger Semantics for Low-Latency Geo-Replicated Storage
, 2013
"... We present the first scalable, geo-replicated storage system that guarantees low latency, offers a rich data model, and provides “stronger” semantics. Namely, all clients requests are satisfied in the local datacenter in which they arise; the system efficiently supports useful data model abstraction ..."
Abstract
-
Cited by 39 (5 self)
- Add to MetaCart
(Show Context)
We present the first scalable, geo-replicated storage system that guarantees low latency, offers a rich data model, and provides “stronger” semantics. Namely, all clients requests are satisfied in the local datacenter in which they arise; the system efficiently supports useful data model abstractions such as column families and counter columns; and clients can access data in a causally-consistent fashion with read-only and write-only transactional support, even for keys spread across many servers. The primary contributions of this work are enabling scalable causal consistency for the complex columnfamily data model, as well as novel, non-blocking algorithms for both read-only and write-only transactions. Our evaluation shows that our system, Eiger, achieves low (single-ms) latency and has throughput competitive with eventually-consistent and non-transactional Cassandra, upon which it is built. Despite Eiger’s stronger semantics, its throughput is within 15 % of Cassandra’s for a large variety of workloads and within 7 % for one of Facebook’s real-world workloads.
Hyder - a transactional record manager for shared flash
- In CIDR
, 2011
"... Hyder supports reads and writes on indexed records within classi-cal multi-step transactions. It is designed to run on a cluster of servers that have shared access to a large pool of network-addressable raw flash chips. The flash chips store the indexed records as a multiversion log-structured datab ..."
Abstract
-
Cited by 38 (2 self)
- Add to MetaCart
(Show Context)
Hyder supports reads and writes on indexed records within classi-cal multi-step transactions. It is designed to run on a cluster of servers that have shared access to a large pool of network-addressable raw flash chips. The flash chips store the indexed records as a multiversion log-structured database. Log-structuring leverages the high random I/O rate of flash and automatically wear-levels it. Hyder uses a data-sharing architecture that scales out without partitioning the database or application. Each transac-tion executes on a snapshot, logs its updates in one record, and broadcasts the log record to all servers. Each server rolls forward the log against its locally-cached partial-copy of the last committed state, using optimistic concurrency control to determine whether each transaction commits. This paper explains the architecture of the overall system and its three main components: the log, the index, and the roll-forward algorithm. Simulations and prototype measurements are presented that show Hyder can scale out to support high transaction rates. 1.
Cache Craftiness for Fast Multicore Key-Value Storage
"... We present Masstree, a fast key-value database designed for SMP machines. Masstree keeps all data in memory. Its main data structure is a trie-like concatenation of B +-trees, each of which handles a fixed-length slice of a variable-length key. This structure effectively handles arbitrary-length pos ..."
Abstract
-
Cited by 37 (4 self)
- Add to MetaCart
(Show Context)
We present Masstree, a fast key-value database designed for SMP machines. Masstree keeps all data in memory. Its main data structure is a trie-like concatenation of B +-trees, each of which handles a fixed-length slice of a variable-length key. This structure effectively handles arbitrary-length possiblybinary keys, including keys with long shared prefixes. B +-tree fanout was chosen to minimize total DRAM delay when descending the tree and prefetching each tree node. Lookups use optimistic concurrency control, a read-copy-update-like technique, and do not write shared data structures; updates lock only affected nodes. Logging and checkpointing provide consistency and durability. Though some of these ideas appear elsewhere, Masstree is the first to combine them. We discuss design variants and their consequences. On a 16-core machine, with logging enabled and queries arriving over a network, Masstree executes more than six million simple queries per second. This performance is comparable to that of memcached, a non-persistent hash table server, and higher (often much higher) than that of VoltDB,
MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing
, 2012
"... This paper presents a set of architecturally and workload-inspired algorithmic and engineering improvements to the popular Memcached system that substantially improve both its memory e ciency and throughput. These techniques— optimistic cuckoo hashing, a compact LRU-approximating eviction algorithm ..."
Abstract
-
Cited by 32 (8 self)
- Add to MetaCart
(Show Context)
This paper presents a set of architecturally and workload-inspired algorithmic and engineering improvements to the popular Memcached system that substantially improve both its memory e ciency and throughput. These techniques— optimistic cuckoo hashing, a compact LRU-approximating eviction algorithm based upon CLOCK, and comprehensive implementation of optimistic locking—enable the resulting system to use 30 % less memory for small key-value pairs, and serve up to 3x as many queries per second over the network. We have implemented these modifications in a system we call MemC3—Memcached with CLOCK and Concurrent Cuckoo hashing—but believe that they also apply more generally to many of today’s read-intensive, highly concurrent networked storage and caching systems. Keywords: hashing, performance, memory e Low-latency access to data has become critical for many Internet services in recent years. This requirement has led many system designers to serve all or most of certain data sets from main memory—using the memory either as their primary store [21, 28, 23, 27] or as a cache to deflect hot or particularly latency-sensitive items [12].
Consistent and durable data structures for non-volatile byte-addressable memory
, 2010
"... The predicted shift to non-volatile, byte-addressable memory (e.g., Phase Change Memory and Memristor), the growth of “big data”, and the subsequent emergence of frameworks such as memcached and NoSQL systems require us to rethink the design of data stores. To derive the maximum performance from the ..."
Abstract
-
Cited by 31 (4 self)
- Add to MetaCart
(Show Context)
The predicted shift to non-volatile, byte-addressable memory (e.g., Phase Change Memory and Memristor), the growth of “big data”, and the subsequent emergence of frameworks such as memcached and NoSQL systems require us to rethink the design of data stores. To derive the maximum performance from these new memory technologies, this paper proposes the use of singlelevel data stores. For these systems, where no distinction is made between a volatile and a persistent copy of data, we present Consistent and Durable Data Structures (CDDSs) that, on current hardware, allows programmers to safely exploit the low-latency and non-volatile aspects of new memory technologies. CDDSs use versioning to allow atomic updates without requiring logging. The same versioning scheme also enables rollback for failure recovery. When compared to a memory-backed Berkeley DB B-Tree, our prototype-based results show that a CDDS B-Tree can increase put and get throughput by 74 % and 138%. When compared to Cassandra, a two-level data store, Tembo, a CDDS B-Tree enabled distributed Key-Value system, increases throughput by up to 250%–286%.
When Scalability Meets Consistency: Genuine Multiversion Update-Serializable Partial Data Replication
- In Proc. of International Conference on Distributed Systems
, 2012
"... Abstract—In this article we introduce GMU, a genuine partial replication protocol for transactional systems, which exploits an innovative, highly scalable, distributed multiversioning scheme. Unlike existing multiversion-based solutions, GMU does not rely on a global logical clock, which represents ..."
Abstract
-
Cited by 29 (16 self)
- Add to MetaCart
(Show Context)
Abstract—In this article we introduce GMU, a genuine partial replication protocol for transactional systems, which exploits an innovative, highly scalable, distributed multiversioning scheme. Unlike existing multiversion-based solutions, GMU does not rely on a global logical clock, which represents a contention point and can limit system scalability. Also, GMU never aborts read-only transactions and spares them from distributed validation schemes. This makes GMU particularly efficient in presence of read-intensive workloads, as typical of a wide range of real-world applications. GMU guarantees the Extended Update Serializability (EUS) isolation level. This consistency criterion is particularly attractive as it is sufficiently strong to ensure correctness even for very demanding applications (such as TPC-C), but is also weak enough to allow efficient and scalable implementations, such as GMU. Further, unlike several relaxed consistency models proposed in literature, EUS has simple and intuitive semantics, thus being an attractive, scalable consistency model for ordinary programmers. We integrated the GMU protocol in a popular open source in-memory transactional data grid, namely Infinispan. On the basis of a large scale experimental study performed on heterogeneous experimental platforms and using industry standard benchmarks (namely TPC-C and YCSB), we show that GMU achieves linear scalability and that it introduces negligible overheads (less than 10%), with respect to solutions ensuring non-serializable semantics, in a wide range of workloads.
HyperDex: A Distributed, Searchable Key-Value Store
"... Distributed key-value stores are now a standard component of high-performance web services and cloud computing applications. While key-value stores offer significant performance and scalability advantages compared to traditional databases, they achieve these properties through a restricted API that ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
(Show Context)
Distributed key-value stores are now a standard component of high-performance web services and cloud computing applications. While key-value stores offer significant performance and scalability advantages compared to traditional databases, they achieve these properties through a restricted API that limits object retrieval—an object can only be retrieved by the (primary and only) key under which it was inserted. This paper presents HyperDex, a novel distributed key-value store that provides a unique search primitive that enables queries on secondary attributes. The key insight behind HyperDex is the concept of hyperspace hashing in which objects with multiple attributes are mapped into a multidimensional hyperspace. This mapping leads to efficient implementations not only for retrieval by primary key, but also for partially-specified secondary attribute searches and range queries. A novel chaining protocol enables the system to achieve strong consistency, maintain availability and guarantee fault tolerance. An evaluation of the full system shows that HyperDex is 12-13 × faster than Cassandra and MongoDB for finding partially specified objects. Additionally, HyperDex achieves 2-4 × higher throughput for get/put operations.
Scale and Concurrency of GIGA+: File System Directories with Millions of Files
"... Abstract – We examine the problem of scalable file system directories, motivated by data-intensive applications requiring millions to billions of small files to be ingested in a single directory at rates of hundreds of thousands of file creates every second. We introduce a POSIX-compliant scalable d ..."
Abstract
-
Cited by 27 (10 self)
- Add to MetaCart
(Show Context)
Abstract – We examine the problem of scalable file system directories, motivated by data-intensive applications requiring millions to billions of small files to be ingested in a single directory at rates of hundreds of thousands of file creates every second. We introduce a POSIX-compliant scalable directory design, GIGA+, that distributes directory entries over a cluster of server nodes. For scalability, each server makes only local, independent decisions about migration for load balancing. GIGA+ uses two internal implementation tenets, asynchrony and eventual consistency, to: (1) partition an index among all servers without synchronization or serialization, and (2) gracefully tolerate stale index state at the clients. Applications, however, are provided traditional strong synchronous consistency semantics. We have built and demonstrated that the GIGA+ approach scales better than existing distributed directory implementations, delivers a sustained throughput of more than 98,000 file creates per second on a 32-server cluster, and balances load more efficiently than consistent hashing. 1