Results 1 - 10
of
52
Anonymization of Set-Valued Data via Top-Down, Local Generalization
, 2009
"... Set-valued data, in which a set of values are associated with an individual, is common in databases ranging from market basket data, to medical databases of patients ’ symptoms and behaviors, to query engine search logs. Anonymizing this data is important if we are to reconcile the conflicting deman ..."
Abstract
-
Cited by 45 (1 self)
- Add to MetaCart
Set-valued data, in which a set of values are associated with an individual, is common in databases ranging from market basket data, to medical databases of patients ’ symptoms and behaviors, to query engine search logs. Anonymizing this data is important if we are to reconcile the conflicting demands arising from the desire to release the data for study and the desire to protect the privacy of individuals represented in the data. Unfortunately, the bulk of existing anonymization techniques, which were developed for scenarios in which each individual is associated with only one sensitive value, are not well-suited for set-valued data. In this paper we propose a top-down, partition-based approach to anonymizing set-valued data that scales linearly with the input size and scores well on an information-loss data quality metric. We further note that our technique can be applied to anonymize the infamous AOL query logs, and discuss the merits and challenges in anonymizing query logs using our approach.
A.: Democrats, Republicans and Starbucks Afficionados
- In: Proceedings of KDD 2011
"... More and more technologies are taking advantage of the explosion of social media (Web search, content recommendation services, marketing, ad targeting, etc.). This paper focuses on the problem of automatically constructing user profiles, which can significantly benefit such technologies. We describe ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
(Show Context)
More and more technologies are taking advantage of the explosion of social media (Web search, content recommendation services, marketing, ad targeting, etc.). This paper focuses on the problem of automatically constructing user profiles, which can significantly benefit such technologies. We describe a general and robust machine learning framework for large-scale classification of social media users according to dimensions of interest. We report encouraging experimental results on 3 tasks with different characteristics: political affiliation detection, ethnicity identification and detecting affinity for a particular business.
Vanity Fair: Privacy in Querylog Bundles
"... A recently proposed approach to address privacy concerns in storing web search querylogs is bundling logs of multiple users together. In this work we investigate privacy leaks that are possible even when querylogs from multiple users are bundled together, without any user or session identifiers. We ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
(Show Context)
A recently proposed approach to address privacy concerns in storing web search querylogs is bundling logs of multiple users together. In this work we investigate privacy leaks that are possible even when querylogs from multiple users are bundled together, without any user or session identifiers. We begin by quantifying users ’ propensity to issue own-name vanity queries and geographically revealing queries. We show that these propensities interact badly with two forms of vulnerabilities in the bundling scheme. First, structural vulnerabilities arise due to properties of the heavy tail of the user search frequency distribution, or the distribution of locations that appear within a user’s queries. These heavy tails may cause a user to appear visibly different from other users in the same bundle. Second, we demonstrate analytical vulnerabilities based on the ability to separate the queries in a bundle into threads corresponding to individual users. These vulnerabilities raise privacy issues suggesting that bundling must be handled with great care.
A survey of query log privacy-enhancing techniques from a policy perspective
- ACM Trans. Web
, 2008
"... As popular search engines face the sometimes conflicting interests of protecting privacy while retaining query logs for a variety of uses, numerous technical measures have been suggested to both enhance privacy and preserve at least a portion of the utility of query logs. This article seeks to asses ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
As popular search engines face the sometimes conflicting interests of protecting privacy while retaining query logs for a variety of uses, numerous technical measures have been suggested to both enhance privacy and preserve at least a portion of the utility of query logs. This article seeks to assess seven of these techniques against three sets of criteria: (1) how well the technique protects privacy, (2) how well the technique preserves the utility of the query logs, and (3) how well the technique might be implemented as a user control. A user control is defined as a mechanism that allows individual Internet users to choose to have the technique applied to their own query logs.
Who Does What on the Web: A Large-scale Study of Browsing Behavior
"... As the Web has become integrated into daily life, understanding how individuals spend their time online impacts domains ranging from public policy to marketing. It is difficult, however, to measure even simple aspects of browsing behavior via conventional methods—including surveys and site-level ana ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
As the Web has become integrated into daily life, understanding how individuals spend their time online impacts domains ranging from public policy to marketing. It is difficult, however, to measure even simple aspects of browsing behavior via conventional methods—including surveys and site-level analytics— due to limitations of scale and scope. In part addressing these limitations, large-scale Web panel data are a relatively novel means for investigating patterns of Internet usage. In one of the largest studies of browsing behavior to date, we pair Web histories for 250,000 anonymized individuals with user-level demographics—including age, sex, race, education, and income—to investigate three topics. First, we examine how behavior changes as individuals spend more time online, showing that the heaviest users devote nearly twice as much of their time to social media relative to typical individuals. Second, we revisit the digital divide, finding that the frequency with which individuals turn to the Web for research, news, and healthcare is strongly related to educational background, but not as closely tied to gender and ethnicity. Finally, we demonstrate that browsing histories are a strong signal for inferring user attributes, including ethnicity and household income, a result that may be leveraged to improve ad targeting. 1
Sample complexity bounds for differentially private learning
- In Proc. 24th COLT. Omnipress
, 2011
"... This work studies the problem of privacy-preserving classification – namely, learning a classifier from sensitive data while preserving the privacy of individuals in the training set. In particular, the learning algorithm is required in this problem to guarantee differential privacy, a very strong n ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
(Show Context)
This work studies the problem of privacy-preserving classification – namely, learning a classifier from sensitive data while preserving the privacy of individuals in the training set. In particular, the learning algorithm is required in this problem to guarantee differential privacy, a very strong notion of privacy that has gained significant attention in recent years. A natural question to ask is: what is the sample requirement of a learning algorithm that guarantees a certain level of privacy and accuracy? We address this question in the context of learning with infinite hypothesis classes when the data is drawn from a continuous distribution. We first show that even for very simple hypothesis classes, any algorithm that uses a finite number of examples and guarantees differential privacy must fail to return an accurate classifier for at least some unlabeled data distributions. This result is unlike the case with either finite hypothesis classes or discrete data domains, in which distribution-free private learning is possible, as previously shown by Kasiviswanathan et al. (2008). We then consider two approaches to differentially private learning that get around this lower bound. The first approach is to use prior knowledge about the unlabeled data
Providing Privacy through Plausibly Deniable Search
- In SDM
, 2009
"... Query-based web search is an integral part of many people’s daily activities. Most do not realize that their search history can be used to identify them (and their interests). In July 2006, AOL released an anonymized search query log of some 600K randomly selected users. While valuable as a research ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
(Show Context)
Query-based web search is an integral part of many people’s daily activities. Most do not realize that their search history can be used to identify them (and their interests). In July 2006, AOL released an anonymized search query log of some 600K randomly selected users. While valuable as a research tool, the anonymization was insufficient: individuals were identified from the contents of the queries alone [2]. Government requests for such logs increases the concern. To address this problem, we propose a client-centered approach of plausibly deniable search. Each user query is substituted with a standard, closely-related query intended to fetch the desired results. In addition, a set of k-1 cover queries are issued; these have characteristics similar to the standard query but on unrelated topics. The system ensures that any of these k queries will produce the same set of k queries, giving k possible topics the user could have been searching for. We use a Latent Semantic Indexing (LSI) based approach to generate queries, and evaluate on the DMOZ [10] webpage collection to show effectiveness of the proposed approach. 1
What and How Children Search on the Web
"... The Internet has become an important part of the daily life of children as a source of information and leisure activities. Nonetheless, given that most of the content available on the web is aimed at the general public, children are constantly exposed to inappropriate content, either because the lan ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
(Show Context)
The Internet has become an important part of the daily life of children as a source of information and leisure activities. Nonetheless, given that most of the content available on the web is aimed at the general public, children are constantly exposed to inappropriate content, either because the language goes beyond their reading skills, their attention span differs from grown-ups or simple because the content is not targeted at children as is the case of ads and adult content. In this work we employed a large query log sample from a commercial web search engine to identify the struggles and search behavior of children of the age of 6 to young adults of the age of 18. Concretely we hypothesized that the large and complex volume of information to which children are exposed leads to ill-defined searches and to disorientation during the search process. For this purpose, we quantified their search difficulties based on query metrics (e.g. fraction of queries posed in natural language), session metrics (e.g. fraction of abandoned sessions) and click activity (e.g. fraction of ad clicks). We also used the search logs to retrace stages of child development. Concretely we looked for changes in the user interests (e.g. distribution of topics searched), language development (e.g. readability of the content accessed) and cognitive development (e.g. sentiment expressed in the queries) among children and adults. We observed that these metrics clearly demonstrate an increased level of confusion and unsuccessful search sessions among children. We also found a clear relation between the reading level oftheclickedpages andthedemographics characteristics of the users such as age and average educational attainment of the zone in which the user is located.
Privacy in search logs
- CoRR
"... Search engine companies collect the “database of intentions”, the histories of their users ’ search queries. These search logs are a gold mine for researchers. Search engine companies, however, are wary of publishing search logs in order not to disclose sensitive information. In this paper, we devel ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
(Show Context)
Search engine companies collect the “database of intentions”, the histories of their users ’ search queries. These search logs are a gold mine for researchers. Search engine companies, however, are wary of publishing search logs in order not to disclose sensitive information. In this paper, we develop a novel algorithm called ZEALOUS that for the first time enables publishing frequent keywords, queries, and clicks from a search log while achieving a very strong privacy guarantee called (ǫ, δ)-probabilistic differential privacy. An extensive experimental evaluation shows that search logs published with ZEALOUS can be used for research on both search quality and search efficiency with little loss in utility. 1