Results 1 - 10
of
17
An Architecture for Internet Data Transfer
- In Proc. 3rd Symposium on Networked Systems Design and Implementation (NSDI
, 2006
"... This paper presents the design and implementation of DOT, a flexible architecture for data transfer. This architecture separates content negotiation from the data transfer itself. Applications determine what data they need to send and then use a new transfer service to send it. This transfer service ..."
Abstract
-
Cited by 42 (7 self)
- Add to MetaCart
This paper presents the design and implementation of DOT, a flexible architecture for data transfer. This architecture separates content negotiation from the data transfer itself. Applications determine what data they need to send and then use a new transfer service to send it. This transfer service acts as a common interface between applications and the lower-level network layers, facilitating innovation both above and below. The transfer service frees developers from re-inventing transfer mechanisms in each new application. New transfer mechanisms, in turn, can be easily deployed without modifying existing applications. We discuss the benefits that arise from separating data transfer into a service and the challenges this service must overcome. The paper then examines the implementation of DOT and its plugin framework for creating new data transfer mechanisms. A set of microbenchmarks shows that the DOT prototype performs well, and that the overhead it imposes is unnoticeable in the wide-area. End-to-end experiments using more complex configurations demonstrate DOT’s ability to implement effective, new data delivery mechanisms underneath existing services. Finally, we evaluate a production mail server modified to use DOT using trace data gathered from a live email server. Converting the mail server required only 184 lines-of-code changes to the server, and the resulting system reduces the bandwidth needed to send email by up to 20%. 1
Decentralized Deduplication in SAN Cluster File Systems
"... File systems hosting virtual machines typically contain many duplicated blocks of data resulting in wasted storage space and increased storage array cache footprint. Deduplication addresses these problems by storing a single instance of each unique data block and sharing it between all original sour ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
File systems hosting virtual machines typically contain many duplicated blocks of data resulting in wasted storage space and increased storage array cache footprint. Deduplication addresses these problems by storing a single instance of each unique data block and sharing it between all original sources of that data. While deduplication is well understood for file systems with a centralized component, we investigate it in a decentralized cluster file system, specifically in the context of VM storage. We propose DEDE, a block-level deduplication system for live cluster file systems that does not require any central coordination, tolerates host failures, and takes advantage of the block layout policies of an existing cluster file system. In DEDE, hosts keep summaries of their own writes to the cluster file system in shared on-disk logs. Each host periodically and independently processes the summaries of its locked files, merges them with a shared index of blocks, and reclaims any duplicate blocks. DEDE manipulates metadata using general file system interfaces without knowledge of the file system implementation. We present the design, implementation, and evaluation of our techniques in the context of VMware ESX Server. Our results show an 80 % reduction in space with minor performance overhead for realistic workloads. 1
Experiences with content addressable storage and virtual disks
- In Proceedings of the Workshop on I/O Virtualization (WIOV ’08
, 2008
"... Efficiently managing storage is important for virtualized computing environments. Its importance is magnified by developments such as cloud computing which consolidate many thousands of virtual machines (and their associated storage). The nature of this storage is such that there is a large amount o ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Efficiently managing storage is important for virtualized computing environments. Its importance is magnified by developments such as cloud computing which consolidate many thousands of virtual machines (and their associated storage). The nature of this storage is such that there is a large amount of duplication between otherwise discreet virtual machines. Building upon previous work in content addressable storage, we have built a prototype for consolidating virtual disk images using a service-oriented file system. It provides a hierarchical organization, manages historical snapshots of drive images, and takes steps to optimize encoding based on partition type and file system. In this paper we present our experiences with building this prototype and using it to store a variety of drive images for QEMU and the Linux Kernel Virtual Machine (KVM). 1
Supporting Practical Content-Addressable Caching with CZIP Compression Abstract
"... Content-based naming (CBN) enables content sharing across similar files by breaking files into positionindependent chunks and naming these chunks using hashes of their contents. While a number of research systems have recently used custom CBN approaches internally to good effect, there has not yet b ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Content-based naming (CBN) enables content sharing across similar files by breaking files into positionindependent chunks and naming these chunks using hashes of their contents. While a number of research systems have recently used custom CBN approaches internally to good effect, there has not yet been any mechanism to use CBN in a general-purpose way. In this paper, we demonstrate a practical approach to applying CBN without requiring disruptive changes to end systems. We develop CZIP, a CBN compression scheme which reduces data sizes by eliminating redundant chunks, compresses chunks using existing schemes, and facilitates sharing within files, across files, and across machines by explicitly exposing CBN chunk hashes. CZIPaware caching systems can exploit the CBN information to reduce storage space, reduce bandwidth consumption, and increase performance, while content providers and middleboxes can selectively encode their most suitable content. We show that CZIP compares well to standalone compression schemes, that a CBN cache for CZIP is easily implemented, and that a CZIP-aware CDN produces significant benefits. 1
The Effectiveness of Deduplication on Virtual Machine Disk Images
"... Virtualization is becoming widely deployed in servers to efficiently provide many logically separate execution environments while reducing the need for physical servers. While this approach saves physical CPU resources, it still consumes large amounts of storage because each virtual machine (VM) ins ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Virtualization is becoming widely deployed in servers to efficiently provide many logically separate execution environments while reducing the need for physical servers. While this approach saves physical CPU resources, it still consumes large amounts of storage because each virtual machine (VM) instance requires its own multi-gigabyte disk image. Moreover, existing systems do not support ad hoc block sharing between disk images, instead relying on techniques such as overlays to build multiple VMs from a single “base ” image. Instead, we propose the use of deduplication to both reduce the total storage required for VM disk images and increase the ability of VMs to share disk blocks. To test the effectiveness of deduplication, we conducted extensive evaluations on different sets of virtual machine disk images with different chunking strategies. Our experiments found that the amount of stored data grows very slowly after the first few virtual disk images if only the locale or software configuration is changed, with the rate of compression suffering when different versions of an operating system or different operating systems are included. We also show that fixedlength chunks work well, achieving nearly the same compression rate as variable-length chunks. Finally, we show that simply identifying zero-filled blocks, even in ready-touse virtual machine disk images available online, can provide significant savings in storage.
The Case for Content Search of VM Clouds
"... Abstract—The success of cloud computing can lead to large, centralized collections of virtual machine (VM) images. The ability to interactively search these VM images at a high semantic level emerges as an important capability. This paper examines the opportunities and challenges in creating such a ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract—The success of cloud computing can lead to large, centralized collections of virtual machine (VM) images. The ability to interactively search these VM images at a high semantic level emerges as an important capability. This paper examines the opportunities and challenges in creating such a search capability, and presents early evidence of its feasibility. Keywords- data-intensive computing; discard-based search; forensic search; provenance; Diamond; cloud computing; virtual machines; VCL; RC2; EC2; Internet
Evaluating the usefulness of content addressable storage for high-performance data intensive applications
- In Proceedings of the 17th High Performance Distributed Computing (HPDC ’08
, 2008
"... Content Addressable Storage (CAS) is a data representation technique that operates by partitioning a given data-set into non-intersecting units called chunks and then employing techniques to efficiently recognize chunks occurring multiple times. This allows CAS to eliminate duplicate instances of su ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Content Addressable Storage (CAS) is a data representation technique that operates by partitioning a given data-set into non-intersecting units called chunks and then employing techniques to efficiently recognize chunks occurring multiple times. This allows CAS to eliminate duplicate instances of such chunks, resulting in reduced storage space compared to conventional representations of data. CAS is an attractive technique for reducing the storage and network bandwidth needs of performance-sensitive, data-intensive applications in a variety of domains. These include enterprise applications, Web-based e-commerce or entertainment services and highly parallel scientific/engineering applications and simulations, to name a few. In this paper, we conduct an empirical evaluation of the benefits offered by CAS to a variety of real-world data-intensive applications. The savings offered by CAS depend crucially on (i) the nature of the data-set itself and (ii) the chunk-size that CAS employs. We investigate the impact of both these factors on disk space savings, savings in network bandwidth, and error resilience of data. We find that a chunk-size of 1 KB can provide up to 84 % savings in disk space and even higher savings in network bandwidth whilst trading off error resilience and incurring 14 % CAS related overheads. Drawing upon lessons learned from our study, we provide insights on (i) the choice of the chunk-size for effective space savings and (ii) the use of selective data replication to counter the loss of error resilience caused by CAS.
Content-Addressable Data Management
, 2007
"... A direct implication of both the industry and academia proclaiming the Age of Tera-(even the Peta)-scale computing, is that applications have become more data intensive than ever. The increased data volume from applications tackling larger and larger problems has fueled the need for efficient manag ..."
Abstract
- Add to MetaCart
A direct implication of both the industry and academia proclaiming the Age of Tera-(even the Peta)-scale computing, is that applications have become more data intensive than ever. The increased data volume from applications tackling larger and larger problems has fueled the need for efficient management of this data. In this thesis, we evaluate a technique called Content Addressable Storage or CAS, for managing large volumes of data. This evaluation focuses on the benefits and demerits of using CAS for, i) improved application performance via lockless and lightweight synchronization of accesses to shared storage data; ii) improved cache performance; iii) increase in storage capacity; and, iv) increased network bandwidth. We present the design of a CAS-based file store that significantly improves the storage performance providing lightweight and lock-less user-defined consistency semantics. As a result, our file-system shows a 28% increase in read-bandwidth and a 13 % increase in write bandwidth, over a popular file-system in common use. We use the same experimental file-system to analyze CAS on data from real world application benchmarks. We also estimate the potential benefits of using CAS for a virtual
Live Migration of User Environments Across Wide Area Networks
"... A complex challenge in mobile computing is to allow the user to migrate her highly customised environment while moving to a different location and to continue work without interruption. I motivate why this is a highly desirable capability and conduct a survey of the current approaches towards this g ..."
Abstract
- Add to MetaCart
A complex challenge in mobile computing is to allow the user to migrate her highly customised environment while moving to a different location and to continue work without interruption. I motivate why this is a highly desirable capability and conduct a survey of the current approaches towards this goal and explain their limitations. I then propose a new architecture to support user mobility by live migration of a user’s operating system instance over the network. Previous work includes the Collective and Internet Suspend/Resume projects that have addressed migration of a user’s environment by suspending the running state and resuming it at a later time. In contrast to previous work, this work addresses live migration of a user’s operating system instance across wide area links. Live migration is done by performing most of the migration while the operating system is still running, achieving very little downtime and preserving all network connectivity. I developed an initial proof of concept of this solution. It relies on migrating whole operating systems using the Xen virtual machine and provides a way to perform live migration of persistent storage as well as the network connections across subnets. These
The Manna Plug-In Architecture for Content-based Search of VM Clouds
, 2010
"... findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily represent the views of the NSF, IBM, Carnegie Mellon University or the University of Toronto. Keywords: data-intensive computing, virtual machines, cloud computing, interactive search, ..."
Abstract
- Add to MetaCart
findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily represent the views of the NSF, IBM, Carnegie Mellon University or the University of Toronto. Keywords: data-intensive computing, virtual machines, cloud computing, interactive search, forensic analysis, computer vision, pattern recognition,

