Results 1  10
of
15
Estimating identification disclosure risk using mixed membership models
 Journal of the American Statistical Association
, 2012
"... Statistical agencies and other organizations that disseminate data are obligated to protect data subjects ’ confidentiality. For example, illintentioned individuals might link data subjects to records in other databases by matching on common characteristics (keys). Successful links are particularl ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Statistical agencies and other organizations that disseminate data are obligated to protect data subjects ’ confidentiality. For example, illintentioned individuals might link data subjects to records in other databases by matching on common characteristics (keys). Successful links are particularly problematic for data subjects with combinations of keys that are unique in the population. Hence, as part of their assessments of disclosure risks, many data stewards estimate the probabilities that sample uniques on sets of discrete keys are also population uniques on those keys. This is typically done using loglinear modeling on the keys. However, loglinear models can yield biased estimates of cell probabilities for sparse contingency tables with many zero counts, which often occurs in databases with many keys. This bias can result in unreliable estimates of probabilities of uniqueness and, hence, misrepresentations of disclosure risks. We propose an alternative to loglinear models for datasets with sparse keys based on a Bayesian version of grade of membership (GoM) models. We present a Bayesian GoM model for multinomial variables and offer an MCMC algorithm for fitting the model. We evaluate the approach by treating data from a recent US Census Bureau public
Maintaining Analytical Utility While Protecting Confidentiality of Survey and Nonresponse Data
, 2009
"... Consider a complete rectangular database at the micro (or unit) level from a survey (sample or census) or nonsurvey (administrative source) in which potential identifying variables (IVs) are suitably categorized (so that the analytic utility is essentially maintained) for reducing the pretreatment d ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Consider a complete rectangular database at the micro (or unit) level from a survey (sample or census) or nonsurvey (administrative source) in which potential identifying variables (IVs) are suitably categorized (so that the analytic utility is essentially maintained) for reducing the pretreatment disclosure risk to the extent possible. The pretreatment risk is due to the presence of unique records (with respect to IVs) or nonuniques (i.e., more than one record having a common IV profile) with similar values of at least one sensitive variable (SV). This setup covers macro (or aggregate) level data including tabular data because a common mean value (of 1 in the case of count data) to all units in the aggregation or cell can be assigned. Our goal is to create a public use file with simultaneous control of disclosure risk and information loss after disclosure treatment by perturbation (i.e., substitution of IVs and not SVs) and suppression (i.e., subsamplingout of records). In this paper, an alternative framework of measuring information loss and disclosure risk under a nonsynthetic approach as proposed by Singh (2002, 2006) is considered which, in contrast to the commonly used deterministic treatment,
Estimation of Regression Parameters from Noise Multiplied Data
 Journal of Privacy and Confidentiality
"... Estimation of regression parameters from noise ..."
Statistical disclosure risk: Separating potential and harm
 International Statistical Review
, 2012
"... Summary Statistical agencies are keen to devise ways to provide research access to data while protecting confidentiality. Although methods of statistical disclosure risk assessment are now well established in the statistical science literature, the integration of these methods by agencies into a ge ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Summary Statistical agencies are keen to devise ways to provide research access to data while protecting confidentiality. Although methods of statistical disclosure risk assessment are now well established in the statistical science literature, the integration of these methods by agencies into a general scientific basis for their practice still proves difficult. This paper seeks to review and clarify the role of statistical science in the conceptual foundations of disclosure risk assessment in an agency's decision making. Disclosure risk is broken down into disclosure potential, a measure of the ability to achieve true disclosure, and disclosure harm. It is argued that statistical science is most suited to assessing the former. A framework for this assessment is presented. The paper argues that the intruder's decision making and behaviour may be separated from this framework, provided appropriate account is taken of the nature of potential intruder attacks in the definition of disclosure potential.
BAYESIAN NONPARAMETRIC DISCLOSURE RISK ESTIMATION VIA MIXED EFFECTS LOGLINEAR MODELS
"... Statistical agencies and other institutions collect data under the promise to protect the confidentiality of respondents. When releasing microdata samples, the risk that records can be identified must be assessed. To this aim, a widely adopted approach is to isolate categorical variables key to the ..."
Abstract
 Add to MetaCart
(Show Context)
Statistical agencies and other institutions collect data under the promise to protect the confidentiality of respondents. When releasing microdata samples, the risk that records can be identified must be assessed. To this aim, a widely adopted approach is to isolate categorical variables key to the identification and analyze multiway contingency tables of such variables. Common disclosure risk measures focus on sample unique cells in these tables and adopt parametric loglinear models as the standard statistical tools for the problem. Such models have often to deal with large and extremely sparse tables that pose a number of challenges to risk estimation. This paper proposes to overcome these problems by studying nonparametric alternatives based on Dirichlet process random effects. The main finding is that the inclusion of such random effects allows us to reduce considerably the number of fixed effects required to achieve reliable risk estimates. This is studied on applications to real data, suggesting in particular that our mixed models with main effects only produces roughly equivalent estimates compared to the alltwo way interactions models, and is effective in defusing potential shortcomings of traditional loglinear models. This paper adopts a fully Bayesian approach that accounts for all sources of uncertainty, including that about the population frequencies, and supplies unconditional (posterior) variances and credible intervals.
Estimating Frequencies of Frequencies in Finite Populations
"... Abstract Given a sample from a finite population partitioned into classes, we consider estimating the distribution of the class frequencies. We propose first to estimate certain moments of this distribution, assuming Poisson sampling with unequal inclusion probabilities, and then to adapt these est ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract Given a sample from a finite population partitioned into classes, we consider estimating the distribution of the class frequencies. We propose first to estimate certain moments of this distribution, assuming Poisson sampling with unequal inclusion probabilities, and then to adapt these estimates using modelling assumptions. A simulation study illustrates the biasrobustness of the approach to departures from these assumptions.
FOR THE FIRST WAVE Eurosystem Household Finance and Consumption Network
"... publications feature a motif taken from the €5 banknote. NOTE: This Statistics Paper should not be reported as representing the views of the European Central Bank (ECB). The views expressed are those of the authors and do not necessarily reflect those of the ECB. Eurosystem Household Finance and Con ..."
Abstract
 Add to MetaCart
publications feature a motif taken from the €5 banknote. NOTE: This Statistics Paper should not be reported as representing the views of the European Central Bank (ECB). The views expressed are those of the authors and do not necessarily reflect those of the ECB. Eurosystem Household Finance and Consumption Network This report has been prepared by the members of the Eurosystem Household Finance and Consumption Network (see the Annex for the list of members). You can reach us at:
EUROPEAN COMMUNITIES (EUROSTAT) A NEIGHBORHOOD REGRESSION MODEL FOR SAMPLE
"... A neighborhood regression model for sample disclosure risk estimation. ..."
IMS Lecture Notes–Monograph Series Complex Datasets and Inverse Problems: Tomography, Networks and Beyond
, 708
"... A smoothing model for sample disclosure risk estimation ..."
(Show Context)
Privacy Protection from Sampling and Perturbation in Survey Microdata
"... Abstract. Statistical agencies release microdata from social surveys as publicuse files after applying statistical disclosure limitation (SDL) techniques. Disclosure risk is typically assessed in terms of identification risk, where it is supposed that small counts on crossclassified identifying ke ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. Statistical agencies release microdata from social surveys as publicuse files after applying statistical disclosure limitation (SDL) techniques. Disclosure risk is typically assessed in terms of identification risk, where it is supposed that small counts on crossclassified identifying key variables, i.e., a key, could be used to make an identification and confidential information may be learnt. In this paper we explore the application of definitions of privacy from the computer science literature to the same problem, with a focus on sampling and a form of perturbation which can be represented as misclassification. We consider two privacy definitions: differential privacy and probabilistic differential privacy. Chaudhuri and Mishra (2006) have shown that sampling does not guarantee differential privacy, but that, under certain conditions, it may ensure probabilistic differential privacy. We discuss these definitions and conditions in the context of survey microdata. We then extend this discussion to the case of perturbation. We show that differential privacy can be ensured if and only if the perturbation employs a misclassification matrix with no zero entries. We also show that probabilistic differential privacy is a viable alternative to differential privacy when there are zeros in the misclassification matrix. We discuss some common examples of SDL methods where in some cases zeros may be prevalent in the misclassification matrix.