Results 1 - 10
of
53
The Open Provenance Model
, 2008
"... The Open Provenance Model (OPM) is a community-driven data model for Provenance that is designed to support inter-operability of provenance technology. Underpinning OPM, is a notion of directed acyclic graph, used to represent data products and processes involved in past computations, and causal dep ..."
Abstract
-
Cited by 47 (7 self)
- Add to MetaCart
(Show Context)
The Open Provenance Model (OPM) is a community-driven data model for Provenance that is designed to support inter-operability of provenance technology. Underpinning OPM, is a notion of directed acyclic graph, used to represent data products and processes involved in past computations, and causal dependencies between these. The Open Provenance Model was derived following two “Provenance Challenges”, international, multidisciplinary activities trying to investigate how to exchange information between multiple systems supporting provenance and how to query it. The OPM design was mostly driven by practical and pragmatic considerations, and is being tested in a third Provenance Challenge, which has just started. The purpose of this paper is to investigate the theoretical foundations of this data model. The formalisation consists of a set-theoretic definition of the data model, a definition of the inferences by transitive closure that are permitted, a formal description of how the model can be used to express dependencies in past computations, and finally, a description of the kind of time-based inferences that are supported. A novel element that OPM introduces is the concept of an account, by which multiple descriptions of a same execution are allowed to co-exist in a same graph. Our formalisation gives a precise meaning to such accounts and associated notions of alternate and refinement. Warning It was decided that this paper should be released as early as possible since it brings useful clarifications on the Open Provenance Model, and therefore can benefit the Provenance Challenge 3 community. The reader should recognise that this paper is however an early draft, and several sections are incomplete. Additionally, figures rely on colours but these may be difficult to read when printed in a black and white. It is advisable to print the paper in colour. 1 1
S.: Taskrole-based access control model
- Information Systems 28(6) (2003) 533 – 562
"... Abstract-Existence of data provenance information in a system raises at least two security-related issues. One is how provenance data can be used to enhance security in the system and the other is how to protect provenance data which might be more sensitive than the data itself. Recent data provena ..."
Abstract
-
Cited by 41 (3 self)
- Add to MetaCart
(Show Context)
Abstract-Existence of data provenance information in a system raises at least two security-related issues. One is how provenance data can be used to enhance security in the system and the other is how to protect provenance data which might be more sensitive than the data itself. Recent data provenancerelated access control literature mainly focuses on the latter issue of protecting provenance data. In this paper, we propose a novel provenance-based access control model that addresses the former objective. Using provenance data for access control to the underlying data facilitates additional capabilities beyond those available in traditional access control models. We utilize a notion of dependency as the key foundation for access control policy specification. Dependency-based policy provides simplicity and effectiveness in policy specification and access control administration. We show our model can support dynamic separation of duty, workflow control, origin-based control, and object versioning. The proposed model identifies essential components and concepts and provides a foundational base model for provenance-based access control. We further discuss possible extensions of the proposed base model for enhanced access controls.
The Case of the Fake Picasso: Preventing History Forgery with Secure Provenance
"... As increasing amounts of valuable information are produced and persist digitally, the ability to determine the origin of data becomes important. In science, medicine, commerce, and government, data provenance tracking is essential for rights protection, regulatory compliance, management of intellige ..."
Abstract
-
Cited by 40 (5 self)
- Add to MetaCart
(Show Context)
As increasing amounts of valuable information are produced and persist digitally, the ability to determine the origin of data becomes important. In science, medicine, commerce, and government, data provenance tracking is essential for rights protection, regulatory compliance, management of intelligence and medical data, and authentication of information as it flows through workplace tasks. In this paper, we show how to provide strong integrity and confidentiality assurances for data provenance information. We describe our provenance-aware system prototype that implements provenance tracking of data writes at the application layer, which makes it extremely easy to deploy. We present empirical results that show that, for typical real-life workloads, the runtime overhead of our approach to recording provenance with confidentiality and integrity guarantees ranges from 1 % – 13%. 1
Querying data provenance
- In SIGMOD
, 2010
"... Many advanced data management operations (e.g., incremental maintenance, ..."
Abstract
-
Cited by 40 (11 self)
- Add to MetaCart
(Show Context)
Many advanced data management operations (e.g., incremental maintenance,
Preventing History Forgery with Secure Provenance
"... As increasing amounts of valuable information are produced and persist digitally, the ability to determine the origin of data becomes important. In science, medicine, commerce, and government, data provenance tracking is essential for rights protection, regulatory compliance, management of intellige ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
(Show Context)
As increasing amounts of valuable information are produced and persist digitally, the ability to determine the origin of data becomes important. In science, medicine, commerce, and government, data provenance tracking is essential for rights protection, regulatory compliance, management of intelligence and medical data, and authentication of information as it flows through workplace tasks. While significant research has been conducted in this area, the associated security and privacy issues have not been explored, leaving provenance information vulnerable to illicit alteration as it passes through untrusted environments. In this paper, we show how to provide strong integrity and confidentiality assurances for data provenance information at the kernel, file system, or application layer. We describe Sprov, our provenance-aware system prototype that implements provenance tracking of data writes at the application layer, which makes Sprov extremely easy to deploy. We present empirical results that show that, for real-life workloads, the runtime overhead of Sprov for recording provenance with confidentiality and integrity guarantees ranges from 1 % – 13%, when all file modifications are recorded, and from 12 % – 16%, when all file read and modifications are tracked.
Provenance for generalized map and reduce workflows
- In Proc. Conference on Innovative Data Systems Research (CIDR
, 2011
"... We consider a class of workflows, which we call generalized map and reduce workflows (GMRWs), where input data sets are processed by an acyclic graph of map and reduce functions to produce output results. We show how data provenance (also sometimes called lineage) can be captured for map and reduce ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
We consider a class of workflows, which we call generalized map and reduce workflows (GMRWs), where input data sets are processed by an acyclic graph of map and reduce functions to produce output results. We show how data provenance (also sometimes called lineage) can be captured for map and reduce functions transparently. The captured provenance can then be used to support backward tracing (finding the input subsets that contributed to a given output element) and forward tracing (determining which output elements were derived from a particular input element). We provide formal underpinnings for provenance in GMRWs, and we identify properties that are guaranteed to hold when provenance is applied recursively. We have built a prototype system that supports provenance capture and tracing as an extension to Hadoop. Our system uses a wrapper-based approach, requiring little if any user intervention in most cases, and retaining Hadoop’s parallel execution and fault tolerance. Performance numbers from our system are reported. 1.
Causality-Based Versioning
- IN PROCEEDINGS OF THE 7TH USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES
, 2009
"... Versioning file systems provide the ability to recover from a variety of failures, including file corruption, virus and worm infestations, and user mistakes. However, using versions to recover from data-corrupting events requires a human to determine precisely which files and versions to restore. We ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
(Show Context)
Versioning file systems provide the ability to recover from a variety of failures, including file corruption, virus and worm infestations, and user mistakes. However, using versions to recover from data-corrupting events requires a human to determine precisely which files and versions to restore. We can create more meaningful versions and enhance the value of those versions by capturing the causal connections among files, facilitating selection and recovery of precisely the right versions after data corrupting events. We determine when to create new versions of files automatically using the causal relationships among files. The literature on versioning file systems usually examines two extremes of possible version-creation algorithms: open-to-close versioning and versioning on every write. We evaluate causal versions of these two algorithms and introduce two additional causality-based algorithms: Cycle-Avoidance and Graph-Finesse. We show that capturing and maintaining causal relationships imposes less than 7 % overhead on a versioning system, providing benefit at low cost. We then show that Cycle-Avoidance provides more meaningful versions of files created during concurrent program execution, with overhead comparable to open/close versioning. Graph-Finesse provides even greater control, frequently at comparable overhead, but sometimes at unacceptable overhead. Versioning on every write is an interesting extreme case, but is far too costly to be useful in practice.
Efficient Provenance Storage over Nested Data Collections
, 2009
"... Scientific workflow systems are increasingly used to automate complex data analyses, largely due to their benefits over traditional approaches for workflow design, optimization, and provenance recording. Many workflow systems employ a simple dependency model to represent the provenance of data produ ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
Scientific workflow systems are increasingly used to automate complex data analyses, largely due to their benefits over traditional approaches for workflow design, optimization, and provenance recording. Many workflow systems employ a simple dependency model to represent the provenance of data produced by workflow runs. Although commonly adopted, this model does not capture explicit data dependencies introduced by “provenance-aware” processes, and it can lead to inefficient storage when workflow data is com-plex or structured. We present a provenance model, extending the conventional approach, that supports (i) explicit data dependencies and (ii) nested data collections. Our model adopts techniques from reference-based XML versioning, adding annotations for process and data dependencies. We present strategies and reduction tech-niques to store immediate and transitive provenance information within our model, and examine trade-offs among update time, stor-age size, and query response time. We evaluate our approach on real-world and synthetic workflow execution traces, demonstrating significant reductions in storage size, while also reducing the time required to store and query provenance information.
PrIMe: A methodology for developing provenance-aware applications
, 2006
"... Provenance refers to the past processes that brought about a given (version of an) object, item or entity. By knowing the provenance of data, users can often better understand, trust, reproduce, and validate it. A provenance-aware application has the functionality to answer questions regarding the p ..."
Abstract
-
Cited by 15 (7 self)
- Add to MetaCart
Provenance refers to the past processes that brought about a given (version of an) object, item or entity. By knowing the provenance of data, users can often better understand, trust, reproduce, and validate it. A provenance-aware application has the functionality to answer questions regarding the provenance of the data it produces, by using documentation of past processes. PrIMe is a software engineering technique for adapting application designs to enable them to interact with a provenance middleware layer, thereby making them provenance-aware. In this article, we specify the steps involved in applying PrIMe, analyse its effectiveness, and illustrate its use with two case studies, in bioinformatics and medicine.
Detecting and resolving unsound workflow views for correct provenance analysis
- In SIGMOD
, 2009
"... Workflow views abstract groups of tasks in a workflow into high level composite tasks, in order to reuse sub-workflows and facilitate provenance analysis. However, unless a view is carefully designed, it may not preserve the dataflow between tasks in the workflow, i.e., it may not be sound. Unsound ..."
Abstract
-
Cited by 15 (8 self)
- Add to MetaCart
(Show Context)
Workflow views abstract groups of tasks in a workflow into high level composite tasks, in order to reuse sub-workflows and facilitate provenance analysis. However, unless a view is carefully designed, it may not preserve the dataflow between tasks in the workflow, i.e., it may not be sound. Unsound views can be misleading and cause incorrect provenance analysis. This paper studies the problem of efficiently identifying and correcting unsound workflow views with minimal changes. In particular, given a workflow view, we wish to split each unsound composite task into the minimal number of tasks, such that the resulting view is sound. We prove that this problem is NP-hard by reduction from independent set. We then propose two local optimality conditions (weak and strong), and design polynomial time algorithms for correcting unsound views to meet these conditions. Experiments show that our proposed algorithms are effective and efficient, and that the strong local optimality algorithm produces better solutions than the weak local optimality algorithm with little processing overhead.