Results 1 - 10
of
20
Preserving privacy by de-identifying facial images
- IEEE Transactions on Knowledge and Data Engineering
, 2005
"... In the context of sharing video surveillance data, a significant threat to privacy is face recognition software, which can automatically identify known people, such as from a database of drivers ’ license photos, and thereby track people regardless of suspicion. This paper introduces an algorithm to ..."
Abstract
-
Cited by 21 (4 self)
- Add to MetaCart
In the context of sharing video surveillance data, a significant threat to privacy is face recognition software, which can automatically identify known people, such as from a database of drivers ’ license photos, and thereby track people regardless of suspicion. This paper introduces an algorithm to protect the privacy of individuals in video surveillance data by de-identifying faces such that many facial characteristics remain but the face cannot be reliably recognized. A trivial solution to de-identifying faces involves blacking out each face. This thwarts any possible face recognition, but because all facial details are obscured, the result is of limited use. Many ad hoc attempts, such as covering eyes or randomly perturbing image pixels, fail to thwart face recognition because of the robustness of face recognition methods. This paper presents a new privacy-enabling algorithm, named k-Same, that scientifically limits the ability of face recognition software to reliably recognize faces while maintaining facial details in the images. The algorithm determines similarity between faces based on a distance metric and creates new faces by averaging image components, which may be the original image pixels (k-Same-Pixel) or eigenvectors (k-Same-Eigen). Results are presented on a standard collection of real face images with varying k.
Defining Privacy for Data Mining
- in National Science Foundation Workshop on Next Generation Data Mining
, 2002
"... Privacy preserving data mining -- getting valid data mining results without learning the underlying data values -- has been receiving attention in the research community and beyond. It is unclear what privacy preserving means. This paper provides a framework and metrics for discussing the meaning of ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
Privacy preserving data mining -- getting valid data mining results without learning the underlying data values -- has been receiving attention in the research community and beyond. It is unclear what privacy preserving means. This paper provides a framework and metrics for discussing the meaning of privacy preserving data mining, as a foundation for further research in this field.
Privacy-Preserving Distributed k-Anonymity
, 2005
"... k-anonymity provides a measure of privacy protection by preventing re-identification of data to fewer than a group of k data items. While algorithms exist for producing k-anonymous data, the model has been that of a single source wanting to publish data. This paper presents a k-anonymity protocol w ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
k-anonymity provides a measure of privacy protection by preventing re-identification of data to fewer than a group of k data items. While algorithms exist for producing k-anonymous data, the model has been that of a single source wanting to publish data. This paper presents a k-anonymity protocol when the data is vertically partitioned between sites. A key contribution is a proof that the protocol preserves k-anonymity between the sites: While one site may have individually identifiable data, it learns nothing that violates k-anonymity with respect to the data at the other site. This is a fundamentally different distributed privacy definition than that of Secure Multiparty Computation, and it provides a better match with both ethical and legal views of privacy.
A secure distributed framework for achieving k-anonymity
"... k-anonymity provides a measure of privacy protection by preventing re-identification of data to fewer than a group of k data items. While algorithms exist for producing k-anonymous data, the model has been that of a single source wanting to publish data. Due to privacy issues, it is common that da ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
k-anonymity provides a measure of privacy protection by preventing re-identification of data to fewer than a group of k data items. While algorithms exist for producing k-anonymous data, the model has been that of a single source wanting to publish data. Due to privacy issues, it is common that data from different sites cannot be shared directly. Therefore, this paper presents a two-party framework along with an application that generates k-anonymous data from two vertically partitioned sources without disclosing data from one site to the other. The framework is privacy preserving in the sense that it satisfies the secure definition commonly defined in the literature of Secure Multiparty Computation.
M.: Addressing users’ privacy concerns for improving personalization quality: Towards an integration of user studies and algorithm evaluation
- In: Intelligent Techniques in Web Personalisation. LNCS (LNAI
, 2005
"... Abstract. Numerous studies have demonstrated the effectiveness of personalization using quality criteria both from machine learning / data mining and from user studies. However, a site requires more than a high-performance personalization algorithm: it needs to convince its users to input the data n ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract. Numerous studies have demonstrated the effectiveness of personalization using quality criteria both from machine learning / data mining and from user studies. However, a site requires more than a high-performance personalization algorithm: it needs to convince its users to input the data needed by the algorithm. Today’s Web users are becoming increasingly privacyconscious and less willing to disclose personal data. How can the advantages of personalization (and hence, of disclosure) be communicated effectively, and how can the success of such strategies be measured in terms of improved personalization quality? In this paper, we argue for a tighter integration of the HCI and computational issues involved in these questions. We first outline the problems for personalization that arise from the combination of users ’ privacy concerns and sites ’ current policies of dealing with privacy issues. We then describe the results of an experiment that investigated the effects of changes to a site’s interface on users ’ willingness to disclose data for personalization. This is followed by an overview of studies of the sensitivity of mining algorithms to
Discrimination-aware data mining
, 2007
"... In the context of civil rights law, discrimination refers to unfair or unequal treatment of people based on membership to a category or a minority, without regard to individual merit. Rules extracted from databases by data mining techniques, such as classification or association rules, when used for ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
In the context of civil rights law, discrimination refers to unfair or unequal treatment of people based on membership to a category or a minority, without regard to individual merit. Rules extracted from databases by data mining techniques, such as classification or association rules, when used for decision tasks such as benefit or credit approval, can be discriminatory in the above sense. In this paper, the notion of discriminatory classification rules is introduced and studied. Providing a guarantee of non-discrimination is shown to be a non trivial task. A naïve approach, like taking away all discriminatory attributes, is shown to be not enough when other background knowledge is available. Our approach leads to a precise formulation of the redlining problem along with a formal result relating discriminatory rules with apparently safe ones by means of background knowledge. An empirical assessment of the results on the German credit dataset is also provided.
Economies in Transition: The Beginning of Growth
- AEA Papers and Proceedings
, 2004
"... Abstract * Regional healthcare initiatives seek to improve the quality of healthcare by collecting, analyzing, and disseminating information about chronic diseases such as diabetes. The data required to support such initiatives comes from several organizations such as insurers, physicians, hospitals ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract * Regional healthcare initiatives seek to improve the quality of healthcare by collecting, analyzing, and disseminating information about chronic diseases such as diabetes. The data required to support such initiatives comes from several organizations such as insurers, physicians, hospitals, pharmacies and labs each of which gather and maintain data for the purpose of healthcare delivery. Accessing data in this distributed and heterogeneous environment is difficult and has to deal with well-documented issues such as resolving semantic conflicts, multiple query languages etc. Data warehousing and mediator-based architectures are often proposed and used in these settings. In this paper, we focus on mediator-based architectures and the privacy problems that arise in the healthcare context owing to the linkage of information about patients, physicians, and diseases enabled by the mediator. Current proposals for securityconscious mediators do not address inferential disclosure resulting from record linkage. In particular, we study the problem of interval inference, a specific kind of disclosure that arises when participants are able to compute tight bounds on sensitive values of other participants, based on the aggregate information published by the mediator. We illustrate our approach with a real world example and propose an "audit and aggregate " methodology
Data Mining for Discrimination Discovery
"... In the context of civil rights law, discrimination refers to unfair or unequal treatment of people based on membership to a category or a minority, without regard to individual merit. Discrimination in credit, mortgage, insurance, labor market, and education has been investigated by researchers in e ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
In the context of civil rights law, discrimination refers to unfair or unequal treatment of people based on membership to a category or a minority, without regard to individual merit. Discrimination in credit, mortgage, insurance, labor market, and education has been investigated by researchers in economics and human sciences. With the advent of automatic decision support systems, such as credit scoring systems, the ease of data collection opens several challenges to data analysts for the fight against discrimination. In this paper, we introduce the problem of discovering discrimination through data mining in a dataset of historical decision records, taken by humans or by automatic systems. We formalize the processes of direct and indirect discrimination discovery by modelling protected-by-law groups and contexts where discrimination occurs in a classification rule based syntax. Basically, classification rules extracted from the dataset allow for unveiling contexts of unlawful discrimination, where the degree of burden over protected-bylaw groups is formalized by an extension of the lift measure of a classification rule. In direct discrimination, the extracted rules can be directly mined in search of discriminatory contexts. In indirect discrimination, the mining process needs some background knowledge as a further input, e.g., census data, that combined with the extracted rules might allow for unveiling contexts of discriminatory decisions. A strategy adopted for combining extracted classification rules with background knowledge is called an inference model. In this paper, we propose two inference models and provide automatic procedures for their implementation. An empirical assessment of our results is provided on the German credit dataset and on the PKDD Discovery Challenge 1999 financial dataset.
Automated De-Identification of Free-Text Medical Records
- BMC MEDICAL INFORMATICS AND DECISION MAKING
, 2008
"... Background Text-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from such records before the ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Background Text-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from such records before they can be disseminated. Manual de-identification of large medical record databases is prohibitively expensive, time-consuming and prone to error, giving rise to the need for a software system for large-scale, automated de-identification. Methods We describe an automated de-identification Perl-based software package that is generally usable on most free-text medical records, e.g., nursing notes, discharge summaries, X-ray reports, etc. The algorithm uses lexical look-up tables, regular expressions, and simple heuristics to locate both HIPAA PHI, and an extended PHI set that includes doctors ’ names and years of dates. To develop the de-identification
Reidentification of Individuals in Chicago's Homicide Database: A Technical and Legal Study
, 2001
"... Many government agencies, hospitals, and other organizations collect personal data of a sensitive nature. Often, these groups would like to release their data for statistical analysis by the scientific community, but do not want to cause the subjects of the data embarrassment or harassment. To re ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Many government agencies, hospitals, and other organizations collect personal data of a sensitive nature. Often, these groups would like to release their data for statistical analysis by the scientific community, but do not want to cause the subjects of the data embarrassment or harassment. To resolve this conflict between privacy and progress, data is often deidentified before publication. In short, personally identifying information such as names, home addresses, and social security numbers are stripped from the data. We analyzed one such deidentified data set containing information about Chicago homicide victims over a span of three decades. By comparing the records in the Chicago data set with records in the Social Security Death Index, we were able to associate names with, or reidentify, 35% of the victims. This study details the reidentification method and results, and includes a legal review of U.S. regulations related to reidentification. Based on the findings of our project, we recommend removal of these databases from their online locations, and the establishment of national deidentification regulations.

