Results 1 - 10
of
16
A data placement strategy in scientific cloud workflows
- FUTURE GENERATION COMPUTER SYSTEMS
, 2010
"... ..."
(Show Context)
SmartStore: A New Metadata Organization Paradigm with
- Semantic-Awareness ,” FAST Work-in-Progress Report and Poster Session
, 2009
"... Existing storage systems using hierarchical directory tree do not meet scalability and functionality requirements for exponentially growing datasets and increasingly complex queries in Exabyte-level systems with billions of files. This paper proposes semantic-aware organization, called SmartStore, w ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
(Show Context)
Existing storage systems using hierarchical directory tree do not meet scalability and functionality requirements for exponentially growing datasets and increasingly complex queries in Exabyte-level systems with billions of files. This paper proposes semantic-aware organization, called SmartStore, which exploits metadata semantics of files to judiciously aggregate correlated files into semanticaware groups by using information retrieval tools. Decentralized design improves system scalability and reduces query latency for complex queries (range and top-k queries), which is conducive to constructing semantic-aware caching, and conventional filenamebased query. SmartStore limits search scope of complex query to a single or a minimal number of semantically related groups and avoids or alleviates brute-force search in entire system. Extensive experiments using real-world traces show that SmartStore improves system scalability and reduces query latency over basic database approaches by one thousand times. To the best of our knowledge, this is the first study implementing complex queries in large-scale file systems. 1.
The Small World of File Sharing
"... Web caches, content distribution networks, peer-to-peer file sharing networks, distributed file systems, and data grids all have in common that they involve a community of users who use shared data. In each case, overall system performance can be improved significantly by first identifying and the ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Web caches, content distribution networks, peer-to-peer file sharing networks, distributed file systems, and data grids all have in common that they involve a community of users who use shared data. In each case, overall system performance can be improved significantly by first identifying and then exploiting the structure of community’s data access patterns. We propose a novel perspective for analyzing data access workloads that considers the implicit relationships that form among users based on the data they access. We propose a new structure —the interest-sharing graph — that captures common user interests in data and justify its utility with studies on four data-sharing systems: a high-energy physics collaboration, the Web, the Kazaa peer-to-peer network, and a BitTorrent file-sharing community. We find small-world patterns in the interest-sharing graphs of all four communities. We investigate analytically and experimentally some of the potential causes that lead to this pattern and conclude that user preferences play a major role. The significance of small-world patterns is twofold: it provides a rigorous support to intuition and it suggests the potential to exploit these naturally emerging patterns. As a proof of concept, we design and evaluate an information dissemination system that exploits the small-world interest-sharing graphs by building an interest-aware network overlay. We show that this approach leads to improved information dissemination performance.
Efficiently Identifying Working Sets in Block I/O Streams
"... Identifying groups of blocks that tend to be read or written togetherinagivenenvironmentisthefirststeptowardspowerful techniques for device failure isolation and power management. For example, identified groups can be placed together on a single disk, avoiding excess drive activity across an exascal ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Identifying groups of blocks that tend to be read or written togetherinagivenenvironmentisthefirststeptowardspowerful techniques for device failure isolation and power management. For example, identified groups can be placed together on a single disk, avoiding excess drive activity across an exascale storage system. Unlike previous grouping work, we focus on identifying groupings in data that can be gathered from real, running systems with minimal impact. Using temporal, spatial, and access ordering information from an enterprise data set, we identified a set of groupings that consistently appear, indicating that these are working sets that are likely to be accessed together. We present several techniques to obtain groupings along with a discussion of what techniques best apply to particular types of real systems. We intend to use these preliminary results to inform our search for new types of workloads with a goal of identifying properties of easily separable workloads across different systems and dynamically moving groups in these workloads to reduce disk activity in large storage systems.
http://www.ssrc.ucsc.edu / HANDS: A Heuristically Arranged Non-Backup In-line
, 2012
"... Deduplication on is rarely used on primary storage because of the disk bottleneck problem, whichresultsfromtheneed to keep an index mapping chunks of data to hash values in memory in order to detect duplicate blocks. This index grows with the number of unique data blocks, creating a scalability prob ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
Deduplication on is rarely used on primary storage because of the disk bottleneck problem, whichresultsfromtheneed to keep an index mapping chunks of data to hash values in memory in order to detect duplicate blocks. This index grows with the number of unique data blocks, creating a scalability problem, and at current prices the cost of additional RAM approaches the cost of the indexed disks. Thus, previously, deduplication ratios had to be over 45 % to see any cost benefit. The HANDS technique that we introduce in this paper reduces the amount of in-memory index storage required by up to 99 % while still achieving between 30 % and 90 % of the deduplication of a full memory-resident index, making primary deduplication cost effective in workloads with a low deduplication rate. We achieve this by dynamically prefetching fingerprints from disk into memory cache according to working sets derived from access patterns. We demonstrate the effectiveness of our approach using a simple neighborhood grouping that requires only timestamp and block number, making it suitable for a wide range of storage systems without the need to modify host file systems. 1.
Finding anomalous and suspicious files from directory metadata on a large corpus
- In: 3rd Intl. ICST conference on digital forensics and cyber crime
, 2011
"... Abstract: We describe a tool Dirim for automatically finding files on a drive that are anomalous or suspicious, and thus worthy of focus during digital-forensic investigation, based on solely their directory information. Anomalies are found both from comparing overall drive statistics and from compa ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
Abstract: We describe a tool Dirim for automatically finding files on a drive that are anomalous or suspicious, and thus worthy of focus during digital-forensic investigation, based on solely their directory information. Anomalies are found both from comparing overall drive statistics and from comparing clusters of related files using a novel approach of "superclustering " of clusters. Suspicious file detection looks for a set of specific clues. We discuss results of experiments we conducted on a representative corpus on 1467 drive images where we did find interesting anomalies but not much deception (as expected given the corpus). Cluster comparison performed best at providing useful information for an investigator, but the other methods provided unique additional information albeit with a significant number of false alarms.
Experiences with 100Gbps Network Applications
- In Proceedings of the Fifth international Workshop on Data-intensive Distributed Computing (DIDC), in conjunction with ACM High Performance Distributing Computing (HPDC), 2012
"... 100Gbps networking has finally arrived, and many research and educational institutions have begun to deploy 100Gbps routers and services. ESnet and Internet2 worked together to make 100Gbps networks available to researchers at the Supercomputing 2011 conference in Seattle Washington. In this paper, ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
100Gbps networking has finally arrived, and many research and educational institutions have begun to deploy 100Gbps routers and services. ESnet and Internet2 worked together to make 100Gbps networks available to researchers at the Supercomputing 2011 conference in Seattle Washington. In this paper, we describe two of the first applications to take advantage of this network. We demonstrate a visualization application that enables remotely located scientists to gain insights from large datasets. We also demonstrate climate data movement and analysis over the 100Gbps network. We describe a number of application design issues and host tuning strategies necessary for enabling applications to scale to 100Gbps rates. DISCLAIMER:This document was prepared as an account of work sponsored by the United States Government. While this document is believed to contain correct information, neither the United States Government nor any agency thereof, nor the Regents of the University of California, nor any of their employees, makes any warranty, express or implied, or assumes any legal responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by its trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof, or the Regents of the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof or the Regents of the University of California. 1 1
Single-snapshot file system analysis
- in Proc. of MASCOTS 2013
, 2013
"... Abstract—Metadata snapshots are a common method for gaining insight into filesystems due to their small size and relative ease of acquisition. Since they are static, most researchers have used them for relatively simple analyses such as file size distributions and age of files. We hypothesize that i ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Metadata snapshots are a common method for gaining insight into filesystems due to their small size and relative ease of acquisition. Since they are static, most researchers have used them for relatively simple analyses such as file size distributions and age of files. We hypothesize that it is possible to gain much richer insights into file system and user behavior by clustering features in metadata snapshots and comparing the entropy within clusters to the entropy within natural partitions such as directory hierarchies. We discuss several different methods for gaining deeper insights into metadata snap-shots, and show a small proof of concept using data from Los Alamos National Laboratories. In our initial work, we see evidence that it is possible to identify user locality information, traditionally the purview of dynamic traces, using a single static snapshot. I.
Global analysis of drive file times
"... Abstract—Global analysis is a useful supplement to local forensic analysis of the details of files in a drive image. This paper reports on experiments with global methods to find time patterns associated with disks and files. The Real Disk Corpus of over 1000 drive images from eight countries was us ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Global analysis is a useful supplement to local forensic analysis of the details of files in a drive image. This paper reports on experiments with global methods to find time patterns associated with disks and files. The Real Disk Corpus of over 1000 drive images from eight countries was used as a corpus. The data was clustered into 63 subsets based on file and directory type, and times were analyzed statistically for each subset. Fourteen important subsets of the files were identified based on their times, including default times (zero, low-default, high-default, and on the hour), bursts of activity (one-time, periodic in the week, and periodic in the day), and those having particular equalities or inequalities between any two of creation, modification, and access times. Using overall statistics for each drive, fourteen kinds of drive usage were recognized such as a business operating primarily in the evening. Additional work examined the connection between file times and registry times, and proposed adapting these methods to sampled rather than complete data is discussed.