Results 1  10
of
32
Deriving private information from randomized data
 In SIGMOD
, 2005
"... Deriving private information from randomized data ..."
Abstract

Cited by 133 (2 self)
 Add to MetaCart
Deriving private information from randomized data
A Statistical Framework for Differential Privacy
"... One goal of statistical privacy research is to construct a data release mechanism that protects individual privacy while preserving information content. An example is a random mechanism that takes an input database X and outputs a random database Z according to a distribution Qn(·X). Differential p ..."
Abstract

Cited by 39 (4 self)
 Add to MetaCart
One goal of statistical privacy research is to construct a data release mechanism that protects individual privacy while preserving information content. An example is a random mechanism that takes an input database X and outputs a random database Z according to a distribution Qn(·X). Differential privacy is a particular privacy requirement developed by computer scientists in which Qn(·X) is required to be insensitive to changes in one data point in X. This makes it difficult to infer from Z whether a given individual is in the original database X. We consider differential privacy from a statistical perspective. We consider several datarelease mechanisms that satisfy the differential privacy requirement. We show that it is useful to compare these schemes by computing the rate of convergence of distributions and densities constructed from the released data. We study a general privacy method, called the exponential mechanism, introduced by McSherry and Talwar (2007). We show that the accuracy of this method is intimately linked to the rate at which the probability that the empirical distribution concentrates in a small ball around the true distribution.
Secure Regression on Distributed Databases
 J. Computational and Graphical Statist
, 2004
"... We present several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to ma ..."
Abstract

Cited by 39 (18 self)
 Add to MetaCart
(Show Context)
We present several methods for performing linear regression on the union of distributed databases that preserve, to varying degrees, confidentiality of those databases. Such methods can be used by federal or state statistical agencies to share information from their individual databases, or to make such information available to others. Secure data integration, which provides the lowest level of protection, actually integrates the databases, but in a manner that no database owner can determine the origin of any records other than its own. Regression, associated diagnostics or any other analysis then can be performed on the integrated data.
Data Dissemination and Disclosure Limitation In a World . . .
 STATIST. SCI
, 2004
"... Given the public's everincreasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdatadata on individual units, such as individuals or establishments. In such a world, ..."
Abstract

Cited by 26 (15 self)
 Add to MetaCart
Given the public's everincreasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdatadata on individual units, such as individuals or establishments. In such a world, an alternative dissemination strategy is remote access analysis servers, to which users submit requests for output from statistical models fit using the data, but are not allowed access to the data themselves. Analysis servers, however, are not free from the risk of disclosure, especially in the face of multiple, interacting queries. We describe these risks and propose quantifiable measures of risk and data utility that can be used to specify which queries can be answered, and with what output. The riskutility framework is illustrated for regression models.
Privacypreserving svm classification on vertically partitioned data
 in PanAsia Conference on Knowledge Discover and Data Mining (PAKDD
, 2006
"... Abstract. Classical data mining algorithms implicitly assume complete access to all data, either in centralized or federated form. However, privacy and security concerns often prevent sharing of data, thus derailing data mining projects. Recently, there has been growing focus on finding solutions to ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Classical data mining algorithms implicitly assume complete access to all data, either in centralized or federated form. However, privacy and security concerns often prevent sharing of data, thus derailing data mining projects. Recently, there has been growing focus on finding solutions to this problem. Several algorithms have been proposed that do distributed knowledge discovery, while providing guarantees on the nondisclosure of data. Classification is an important data mining problem applicable in many diverse domains. The goal of classification is to build a model which can predict an attribute (binary attribute in this work) based on the rest of attributes. We propose an efficient and secure privacypreserving algorithm for support vector machine (SVM) classification over vertically partitioned data. 1
Privacyaware regression modeling of participatory sensing data
 in: Proc. of the 8th ACM Conference on Embedded Networked Sensor Systems, SenSys
"... Abstract Many participatory sensing applications use data collected by participants to construct a public model of a system or phenomenon. For example, a health application might compute a model relating exercise and diet to amount of weight loss. While the ultimately computed model could be public ..."
Abstract

Cited by 18 (6 self)
 Add to MetaCart
(Show Context)
Abstract Many participatory sensing applications use data collected by participants to construct a public model of a system or phenomenon. For example, a health application might compute a model relating exercise and diet to amount of weight loss. While the ultimately computed model could be public, the individual input and output data traces used to construct it may be private data of participants (e.g., their individual food intake, lifestyle choices, and resulting weight). This paper proposes and experimentally studies a technique that attempts to keep such input and output data traces private, while allowing accurate model construction. This is significantly different from perturbationbased techniques in that no noise is added. The main contribution of the paper is to show a certain data transformation at the client side that helps keeping the client data private while not introducing any additional error to model construction. We particularly focus on linear regression models which are widely used in participatory sensing applications. We use the data set from a mapbased participatory sensing service to evaluate our scheme. The service in question is a green navigation service that constructs regression models from participant data to predict the fuel consumption of vehicles on road segments. We evaluate our proposed mechanism by providing empirical evidence that: i) an individual data trace is generally hard to reconstruct with any reasonable accuracy, and ii) the regression model constructed using the transformed traces has a much smaller error than one based on additive dataperturbation schemes.
Secure, privacypreserving analysis of distributed databases
 Technometrics
"... There is clear value, in both industrial and government settings, derived from performing statistical analyses that, in effect, integrate data in multiple, distributed databases. However, the barriers to actually integrating the data can be substantial or even insurmountable. Corporations may be unw ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
(Show Context)
There is clear value, in both industrial and government settings, derived from performing statistical analyses that, in effect, integrate data in multiple, distributed databases. However, the barriers to actually integrating the data can be substantial or even insurmountable. Corporations may be unwilling to share proprietary databases such as chemical databases held by pharmaceutical manufacturers, government agencies are subject to laws protecting confidentiality of data subjects, and even the sheer volume of the data may preclude actual data integration. In this paper, we show how tools from modern information technology—specifically, secure multiparty computation and networking—can be used to perform statistically valid analyses of distributed databases. The common characteristic of the methods we describe is that the owners share sufficient statistics computed on the local databases in a way that protects each owner from the others. That is, while each owner can calculate the “complement ” of its contribution to the analysis, it cannot discern which other owners contributed what to that complement. Our focus is on horizontally partitioned data: the data records rather than the data attributes are spread among the owners. We present protocols for secure regression, contingency tables, maximum likelihood and Bayesian analysis. For lowrisk situations, we describe a secure data integration protocol that integrates the databases but prevents owners from learning the source of data records other than their own. Finally, we outline three current research directions: a software system implementing the protocols, secure EM algorithms, and partially trusted third parties, which reduce incentives to owners not to be honest. 1 1
Privacy preserving analysis of vertically partitioned data using secure matrix products
 J. Official
, 2004
"... Reluctance of statistical agencies and other data owners to share their possibly confidential or proprietary data with others who own related databases is a serious impediment to conducting mutually beneficial analyses. In this paper, we propose a protocol for securely computing matrix products in v ..."
Abstract

Cited by 16 (11 self)
 Add to MetaCart
Reluctance of statistical agencies and other data owners to share their possibly confidential or proprietary data with others who own related databases is a serious impediment to conducting mutually beneficial analyses. In this paper, we propose a protocol for securely computing matrix products in vertically partitioned data, i.e., the data sets have the same subjects but disjoint attributes. This protocol allows data owners to estimate coefficients and standard errors of linear regressions, and to examine regression model diagnostics, without disclosing the values of their attributes to each other or to third parties. The protocol can be used to perform other procedures for which sample means and covariances are sufficient statistics. 1
PrivacyPreserving Ridge Regression on Hundreds of Millions of Records
"... Abstract—Ridge regression is an algorithm that takes as input a large number of data points and finds the bestfit linear curve through these points. The algorithm is a building block for many machinelearning operations. We present a system for privacypreserving ridge regression. The system output ..."
Abstract

Cited by 14 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Ridge regression is an algorithm that takes as input a large number of data points and finds the bestfit linear curve through these points. The algorithm is a building block for many machinelearning operations. We present a system for privacypreserving ridge regression. The system outputs the bestfit curve in the clear, but exposes no other information about the input data. Our approach combines both homomorphic encryption and Yao garbled circuits, where each is used in a different part of the algorithm to obtain the best performance. We implement the complete system and experiment with it on real datasets, and show that it significantly outperforms pure implementations based only on homomorphic encryption or Yao circuits. x1,y1 x x2,y2
Compressed Regression
 In NIPS
, 2007
"... Abstract Recent research has studied the role of sparsity in high dimensional regression and signal reconstruction, establishing theoretical limits for recovering sparse models from sparse data. In this paper we study a variant of this problem where the original n input variables are compressed by ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
Abstract Recent research has studied the role of sparsity in high dimensional regression and signal reconstruction, establishing theoretical limits for recovering sparse models from sparse data. In this paper we study a variant of this problem where the original n input variables are compressed by a random linear transformation to m n examples in p dimensions, and establish conditions under which a sparse linear model can be successfully recovered from the compressed data. A primary motivation for this compression procedure is to anonymize the data and preserve privacy by revealing little information about the original data. We characterize the number of random projections that are required for 1 regularized compressed regression to identify the nonzero coefficients in the true model with probability approaching one, a property called "sparsistence." In addition, we show that 1 regularized compressed regression asymptotically predicts as well as an oracle linear model, a property called "persistence." Finally, we characterize the privacy properties of the compression procedure in informationtheoretic terms, establishing upper bounds on the rate of information communicated between the compressed and uncompressed data that decay to zero.