DMCA
Differentially Private Subspace Clustering
Citations: | 2 - 0 self |
Citations
2198 | Probability inequalities for sums of bounded random variables
- Hoeffding
- 1963
(Show Context)
Citation Context ...ement. Then with probability at least 3/4 over random samples U , the following holds uniformly for all candidate subspace sets C: cost(C;XS) ≤ 2cost(C;X ) + γ. (13) B.1 Proof of Lemma B.1 Lemma B.2 (=-=[39]-=-). Fix X and f : X → [0,M ] for some positive constant M > 0. Let XS be a subset of X with t elements, each drawn uniformly at random from X without replacement. Let , δ > 0. Then Pr[|EX [f(x)]− EXS ... |
888 | Matrix Perturbation Theory
- STEWART, SUN
- 1990
(Show Context)
Citation Context ... absolute constants c1 > 0, 0 < c2 < 1 such that for every t > 0, Pr [ σn(A) ≤ t √ n ] ≤ c1t+ cn2 , where σn(A) is the least singular value of A. Theorem F.2 (Wedin’s theorem; Theorem 4.1, pp. 260 in =-=[45]-=-). Let A,E ∈ Rm×n be given matrices with m ≥ n. Let A have the following singular value decomposition U>1U>2 U>3 A [ V1 V2 ] = [ Σ1 00 Σ2 0 0 ] , 20 where U1,U2,U3,V1,V2 have orthonormal columns a... |
748 | From few to many: illumination cone models for face recognition under variable lighting and pose
- Georghiades, Belhumeur, et al.
- 2001
(Show Context)
Citation Context ... privacy parameter ε is truly large (which means very little privacy protection). On the other hand, both Gibbs sampling and SuLQ subspace clustering give reasonably good performance. Figure 2 also shows that SuLQ scales poorly with the ambient dimension d. This is because SuLQ subspace clustering requires calibrating noise to a d × d sample covariance matrix, which induces much error when d is large. Gibbs sampling seems to be robust to various d settings. We also experiment on real-world datasets. The right two plots in Figure 2 report utility on a subset of the extended Yale Face Dataset B [13] for face clustering. 5 random individuals are picked, forming a subset of the original dataset with n = 320 data points (images). The dataset is preprocessed by projecting each individual onto a 9D affine subspace via PCA. Such preprocessing step was adopted in [32, 29] and was theoretically justified in [1]. Afterwards, ambient dimension of the entire dataset is reduced to d = 50 by random Gaussian projection. The plots show that Gibbs sampling significantly outperforms the other algorithms. 4In MPPCA latent variables yi are sampled from a normal distributionN (0, ρ2Iq). 7 −1 −0.5 0 0.5 1 1.... |
645 | Calibrating noise to sensitivity in private data analysis
- Dwork, McSherry, et al.
- 2006
(Show Context)
Citation Context ... that if utility is measured in terms of exact clustering, then no private subspace clustering algorithm exists when neighboring databases are allowed to differ on an entire user profile. In addition, state-of-the-art subspace clustering methods like Sparse Subspace Clustering (SSC, [11]) lack a complete analysis of its clustering output, thanks to the notorious “graph connectivity” problem [21]. Finally, clustering could have high global sensitivity even if only cluster centers are released, as depicted in Figure 1. As a result, general private data releasing schemes like output perturbation [7, 8, 2] do not apply. In this work, we present a systematic and principled treatment of differentially private subspace clustering. To circumvent the negative result in [29], we use the perturbation of recovered low1 dimensional subspace from the ground truth as the utility measure. Our contributions are two-fold. First, we analyze two efficient algorithms based on the sample-aggregate framework [22] and established formal privacy and utility guarantees when data are generated from some stochastic model or satisfy certain deterministic separation conditions. New results on (non-private) subspace clus... |
522 | Lambertian reflectance and linear subspaces
- Basri, Jacobs
(Show Context)
Citation Context ...ibrating noise to a d × d sample covariance matrix, which induces much error when d is large. Gibbs sampling seems to be robust to various d settings. We also experiment on real-world datasets. The right two plots in Figure 2 report utility on a subset of the extended Yale Face Dataset B [13] for face clustering. 5 random individuals are picked, forming a subset of the original dataset with n = 320 data points (images). The dataset is preprocessed by projecting each individual onto a 9D affine subspace via PCA. Such preprocessing step was adopted in [32, 29] and was theoretically justified in [1]. Afterwards, ambient dimension of the entire dataset is reduced to d = 50 by random Gaussian projection. The plots show that Gibbs sampling significantly outperforms the other algorithms. 4In MPPCA latent variables yi are sampled from a normal distributionN (0, ρ2Iq). 7 −1 −0.5 0 0.5 1 1.5 2 2.5 3 0 0.05 0.1 0.15 0.2 0.25 0.3 Log 10 ε K − m ea ns c os t s.a., SSC s.a., TSC s.a., LRR exp., SSC exp. LRR SuLQ−10 SuLQ−50 −1 −0.5 0 0.5 1 1.5 2 2.5 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Log 10 ε K − m ea ns c os t s.a., SSC s.a., TSC s.a., LRR exp., SSC exp. LRR SuLQ−10 SuLQ−50 −1 −0.5 0 0.5 1 1.5 2 2.5 3... |
224 | Practical privacy: The SuLQ framework
- Blum, Dwork, et al.
- 2005
(Show Context)
Citation Context ... that if utility is measured in terms of exact clustering, then no private subspace clustering algorithm exists when neighboring databases are allowed to differ on an entire user profile. In addition, state-of-the-art subspace clustering methods like Sparse Subspace Clustering (SSC, [11]) lack a complete analysis of its clustering output, thanks to the notorious “graph connectivity” problem [21]. Finally, clustering could have high global sensitivity even if only cluster centers are released, as depicted in Figure 1. As a result, general private data releasing schemes like output perturbation [7, 8, 2] do not apply. In this work, we present a systematic and principled treatment of differentially private subspace clustering. To circumvent the negative result in [29], we use the perturbation of recovered low1 dimensional subspace from the ground truth as the utility measure. Our contributions are two-fold. First, we analyze two efficient algorithms based on the sample-aggregate framework [22] and established formal privacy and utility guarantees when data are generated from some stochastic model or satisfy certain deterministic separation conditions. New results on (non-private) subspace clus... |
212 | Mechanism design via differential privacy
- Mcsherry
- 2007
(Show Context)
Citation Context ...e two-fold. First, we analyze two efficient algorithms based on the sample-aggregate framework [22] and established formal privacy and utility guarantees when data are generated from some stochastic model or satisfy certain deterministic separation conditions. New results on (non-private) subspace clustering are obtained along our analysis, including a fully agnostic subspace clustering on well-separated datasets using stability arguments and exact clustering guarantee for thresholding-based subspace clustering (TSC, [14]) in the noisy setting. In addition, we employ the exponential mechanism [18] and propose a novel Gibbs sampler for sampling from this distribution, which involves a novel tweak in sampling from a matrix Bingham distribution. The method works well in practice and we show it is closely related to the well-known mixtures of probabilistic PCA model [27]. Related work Subspace clustering can be thought as a generalization of PCA and k-means clustering. The former aims at finding a single low-dimensional subspace and the latter uses zerodimensional subspaces as cluster centers. There has been extensive research on private PCA [2, 4, 10] and k-means [2, 22, 26]. Perhaps the ... |
172 | Smooth sensitivity and sampling in private data analysis
- Nissim, Raskhodnikova, et al.
- 2007
(Show Context)
Citation Context ...inally, clustering could have high global sensitivity even if only cluster centers are released, as depicted in Figure 1. As a result, general private data releasing schemes like output perturbation [7, 8, 2] do not apply. In this work, we present a systematic and principled treatment of differentially private subspace clustering. To circumvent the negative result in [29], we use the perturbation of recovered low1 dimensional subspace from the ground truth as the utility measure. Our contributions are two-fold. First, we analyze two efficient algorithms based on the sample-aggregate framework [22] and established formal privacy and utility guarantees when data are generated from some stochastic model or satisfy certain deterministic separation conditions. New results on (non-private) subspace clustering are obtained along our analysis, including a fully agnostic subspace clustering on well-separated datasets using stability arguments and exact clustering guarantee for thresholding-based subspace clustering (TSC, [14]) in the noisy setting. In addition, we employ the exponential mechanism [18] and propose a novel Gibbs sampler for sampling from this distribution, which involves a novel ... |
152 | Our data, ourselves: Privacy via distributed noise generation
- Dwork, Kenthapadi, et al.
(Show Context)
Citation Context ... that if utility is measured in terms of exact clustering, then no private subspace clustering algorithm exists when neighboring databases are allowed to differ on an entire user profile. In addition, state-of-the-art subspace clustering methods like Sparse Subspace Clustering (SSC, [11]) lack a complete analysis of its clustering output, thanks to the notorious “graph connectivity” problem [21]. Finally, clustering could have high global sensitivity even if only cluster centers are released, as depicted in Figure 1. As a result, general private data releasing schemes like output perturbation [7, 8, 2] do not apply. In this work, we present a systematic and principled treatment of differentially private subspace clustering. To circumvent the negative result in [29], we use the perturbation of recovered low1 dimensional subspace from the ground truth as the utility measure. Our contributions are two-fold. First, we analyze two efficient algorithms based on the sample-aggregate framework [22] and established formal privacy and utility guarantees when data are generated from some stochastic model or satisfy certain deterministic separation conditions. New results on (non-private) subspace clus... |
128 | Robust recovery of subspace structures by low-rank representation. Available at http://arxiv.org/abs/1010
- Liu, Lin, et al.
(Show Context)
Citation Context ...sults We provide numerical results of both the sample-aggregate and Gibbs sampling algorithms on synthetic and real-world datasets. We also compare with a baseline method implemented based on the k-plane algorithm [3] with perturbed sample covariance matrix via the SuLQ framework [2] (details presented in Appendix E). Three solvers are considered for the sample-aggregate framework: threshold-based subspace clustering (TSC, [14]), which has provable utility guarantee with sampleaggregation on stochastic models, along with sparse subspace clustering (SSC, [11]) and low-rank representation (LRR, [17]), the two state-of-the-art methods for subspace clustering. For Gibbs sampling, we use non-private SSC and LRR solutions as initialization for the Gibbs sampler. All methods are implemented using Matlab. For synthetic datasets, we first generate k random q-dimensional linear subspaces. Each subspace is generated by first sampling a d× q random Gaussian matrix and then recording its column space. n data points are then assigned to one of the k subspaces (clusters) uniformly at random. To generate a data point xi assigned with subspace S`, we first sample yi ∈ Rq with ‖yi‖2 = 1 uniformly at ran... |
120 |
Asymptotics of sample eigenstructure for a large dimensional spiked covariance model, Statistica Sinica 17
- Paul
- 2007
(Show Context)
Citation Context ... E(`) are stochastic, by standard analysis of PCA one can show that the top-q subspace of X(`) converges to the underlying subspace S∗` in probability as the number of data points n` goes to infinity =-=[43]-=-. The theorem then holds because m = o(n) and hence n′ = n/m→∞. C.1 Proof of Lemma 3.5 Proposition C.2. Suppose yi = xi + εi with εi ∼ N (0, σ 2 d Id) and σ > 0. Then with probability at least 1− n2e−... |
115 | Clustering appearances of objects under varying illumination conditions
- Ho, Yang, et al.
- 2003
(Show Context)
Citation Context ...rmal privacy and utility guarantees; the other one asymptotically preserves differential privacy while having good performance in practice. Along the course of the proof, we also obtain two new provable guarantees for the agnostic subspace clustering and the graph connectivity problem which might be of independent interests. 1 Introduction Subspace clustering was originally proposed to solve very specific computer vision problems having a union-of-subspace structure in the data, e.g., motion segmentation under an affine camera model [11] or face clustering under Lambertian illumination models [15]. As it gains increasing attention in the statistics and machine learning community, people start to use it as an agnostic learning tool in social network [5], movie recommendation [33] and biological datasets [19]. The growing applicability of subspace clustering in these new domains inevitably raises the concern of data privacy, as many such applications involve dealing with sensitive information. For example, [19] applies subspace clustering to identify diseases from personalized medical data and [33] in fact uses subspace clustering as a effective tool to conduct linkage attacks on individ... |
102 | The Littlewood-Offord problem and invertibility of random matrices
- Rudelson, Vershynin
(Show Context)
Citation Context ...s with privacy parameters ε and δ. Then the overall algorithm is (ε′, δ′)- differentially private with ε′ = √ 2kT ln(1/δ)ε+ kTε(eε − 1), δ′ = (kT + 1)δ. Appendix F Concentration theorems Theorem F.1 (=-=[44]-=-, Theorem 1.2). Let A be an n × n matrices with entries i.i.d. sampled from standard Gaussian distribution. Then there exist absolute constants c1 > 0, 0 < c2 < 1 such that for every t > 0, Pr [ σn(A)... |
95 | Sparse subspace clustering: Algorithm, theory, and applications
- Elhamifar, Vidal
- 2013
(Show Context)
Citation Context ...y and experiments that one of the presented methods enjoys formal privacy and utility guarantees; the other one asymptotically preserves differential privacy while having good performance in practice. Along the course of the proof, we also obtain two new provable guarantees for the agnostic subspace clustering and the graph connectivity problem which might be of independent interests. 1 Introduction Subspace clustering was originally proposed to solve very specific computer vision problems having a union-of-subspace structure in the data, e.g., motion segmentation under an affine camera model [11] or face clustering under Lambertian illumination models [15]. As it gains increasing attention in the statistics and machine learning community, people start to use it as an agnostic learning tool in social network [5], movie recommendation [33] and biological datasets [19]. The growing applicability of subspace clustering in these new domains inevitably raises the concern of data privacy, as many such applications involve dealing with sensitive information. For example, [19] applies subspace clustering to identify diseases from personalized medical data and [33] in fact uses subspace cluster... |
83 | The Effectiveness of Lloyd-Type Methods for the k-Means Problem.” FOCS
- Ostrovsky, Rabani, et al.
- 2006
(Show Context)
Citation Context ...ition 3.2 (Well-separation condition for k-means subspace clustering). A dataset X is (φ, η, ψ)-well separated if there exist constants φ, η and ψ, all between 0 and 1, such that ∆2k(X ) ≤ min { φ2∆2k−1(X ),∆2k,−(X )− ψ,∆2k,+(X ) + η } , (6) where ∆k−1, ∆k,− and ∆k,+ are defined as ∆2k−1(X ) = minS1:k−1∈Sdq cost({Si};X ); ∆ 2 k,−(X ) = minS1∈Sdq−1,S2:k∈Sdq cost({Si};X ); and ∆ 2 k,+(X ) = minS1∈Sdq+1,S2:k∈Sdq cost({Si};X ). The first condition in Eq. (6), ∆2k(X ) ≤ φ2∆2k−1(X ), constrains that the input dataset X cannot be well clustered using k − 1 instead of k clusters. It was introduced in [23] to analyze stability of k-means solutions. For subspace clustering, we need another two conditions regarding the intrinsic dimension of each subspace. The ∆2k(X ) ≤ ∆2k,−(X ) − ψ asserts that replacing a q-dimensional subspace with a (q − 1)-dimensional one is not sufficient, while ∆2k(X ) ≤ ∆2k,+(X ) + η means an additional subspace dimension does not help much with clustering X . The following lemma is our main stability result for subspace clustering on well-separated datasets. It states that when a candidate clustering C is close to the optimal clustering C∗ in terms of clustering cost, ... |
77 | k-plane clustering
- Bradley, Mangasarian
(Show Context)
Citation Context ...`)); U` ∈ Rd×q,U>` U` = Iq×q, (12) where A` = X(`)X(`) > is the unnormalized sample covariance matrix. Distribution of the form in Eq. (12) is a special case of the matrix Bingham distribution, which admits a Gibbs sampler [16]. We give implementation details in Appendix D.2 with modifications so that the resulting Gibbs sampler is empirically more efficient for a wide range of parameter settings. 3Recently [28] established full clustering guarantee for SSC, however, under strong assumptions. 6 4.2 Discussion The proposed Gibbs sampler resembles the k-plane algorithm for subspace clustering [3]. It is in fact a “probabilistic” version of k-plane since sampling is performed at each iteration rather than deterministic updates. Furthermore, the proposed Gibbs sampler could be viewed as posterior sampling for the following generative model: first sample U` uniformly at random from Sdq for each subspace S`; afterwards, cluster assignments {zi}ni=1 are sampled such that Pr[zi = j] = 1/k and xi is set as xi = U`yi + PU⊥` wi, where yi is sampled uniformly at random from the qdimensional unit ball and wi ∼ N (0, Id/ε). Connection between the above-mentioned generative model and Gibbs sampler... |
63 | A geometric analysis of subspace clustering with outliers
- Soltanolkotabi, Candes
(Show Context)
Citation Context ... for agnostic k-means subspace clustering, with additional amount of per-coordinate Gaussian noise upper bounded by O( φ 2 √ k ε(ψ−η) ). Our bound is comparable to the one obtained in [22] for private k-means clustering, except for the (ψ−η) term which characterizes the well-separatedness under the subspace clustering scenario. 3.3 The stochastic setting We further consider the case when data points are stochastically generated from some underlying “true” subspace set C∗ = {S∗1 , · · · ,S∗k}. Such settings were extensively investigated in previous development of subspace clustering algorithms [24, 25, 14]. Below we give precise definition of the considered stochastic subspace clustering model: The stochastic model For every cluster ` associated with subspace S∗` , a data point x (`) i ∈ Rd belonging to cluster ` can be written as x(`)i = y (`) i + ε (`) i , where y (`) i is sampled uniformly at random from {y ∈ S∗` : ‖y‖2 = 1} and εi ∼ N (0, σ2/d · Id) for some noise parameter σ. Under the stochastic setting we consider the solver f to be the Threshold-based Subspace Clustering (TSC, [14]) algorithm. A simplified version of TSC is presented in Alg. 2. An alternative idea is to apply results in... |
45 | Clustering partially observed graphs via convex optimization
- Jalali, Chen, et al.
- 2011
(Show Context)
Citation Context ... the proof, we also obtain two new provable guarantees for the agnostic subspace clustering and the graph connectivity problem which might be of independent interests. 1 Introduction Subspace clustering was originally proposed to solve very specific computer vision problems having a union-of-subspace structure in the data, e.g., motion segmentation under an affine camera model [11] or face clustering under Lambertian illumination models [15]. As it gains increasing attention in the statistics and machine learning community, people start to use it as an agnostic learning tool in social network [5], movie recommendation [33] and biological datasets [19]. The growing applicability of subspace clustering in these new domains inevitably raises the concern of data privacy, as many such applications involve dealing with sensitive information. For example, [19] applies subspace clustering to identify diseases from personalized medical data and [33] in fact uses subspace clustering as a effective tool to conduct linkage attacks on individuals in movie rating datasets. Nevertheless, privacy issues in subspace clustering have been less explored in the past literature, with the only exception of ... |
32 | The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, - Dwork, Roth - 2013 |
22 | Robust subspace clustering.
- Soltanolkotabi, Elhamifar, et al.
- 2014
(Show Context)
Citation Context ... for agnostic k-means subspace clustering, with additional amount of per-coordinate Gaussian noise upper bounded by O( φ 2 √ k ε(ψ−η) ). Our bound is comparable to the one obtained in [22] for private k-means clustering, except for the (ψ−η) term which characterizes the well-separatedness under the subspace clustering scenario. 3.3 The stochastic setting We further consider the case when data points are stochastically generated from some underlying “true” subspace set C∗ = {S∗1 , · · · ,S∗k}. Such settings were extensively investigated in previous development of subspace clustering algorithms [24, 25, 14]. Below we give precise definition of the considered stochastic subspace clustering model: The stochastic model For every cluster ` associated with subspace S∗` , a data point x (`) i ∈ Rd belonging to cluster ` can be written as x(`)i = y (`) i + ε (`) i , where y (`) i is sampled uniformly at random from {y ∈ S∗` : ‖y‖2 = 1} and εi ∼ N (0, σ2/d · Id) for some noise parameter σ. Under the stochastic setting we consider the solver f to be the Threshold-based Subspace Clustering (TSC, [14]) algorithm. A simplified version of TSC is presented in Alg. 2. An alternative idea is to apply results in... |
21 | Turning big data into tiny data: Constant-size coresets for kmeans, pca and projective clustering.
- Feldman, Schmidt, et al.
- 2013
(Show Context)
Citation Context ... to prove stability results as outlined in Eq. (5) for particular subspace clustering solvers and then apply Theorem 3.1. 3.2 The agnostic setting We first consider the setting when data points {xi}ni=1 are arbitrarily placed. Under such agnostic setting the optimal solution C∗ is defined as the one that minimizes the k-means cost as in Eq. (3). The solver f is taken to be any (1 + )-approximation2 of optimal k-means subspace clustering; that is, f always outputs subspaces C satisfying cost(C;X ) ≤ (1 + )cost(C∗;X ). Efficient core-set based approximation algorithms exist, for example, in [12]. The key task of this section it to identify assumptions under which the stability condition in Eq. (5) holds with respect to an approximate solver f . The example given in Figure 1 also suggests that identifiability issue arises when the input data X itself cannot be well clustered. For example, no two straight lines could well approximate data uniformly distributed on a circle. To circumvent the above-mentioned difficulty, we impose the following well-separation condition on the input data X : Definition 3.2 (Well-separation condition for k-means subspace clustering). A dataset X is (φ, η, ... |
16 | On concentration of probability.
- Janson
- 2002
(Show Context)
Citation Context ...)(q − 1) ≤ 0.02. (17) The second inequality is an application of Eq. (5.2) in [42] and the last inequality is due to Eq. (16). On the other hand, by tail bounds of binomial distribution (Theorem 1 in =-=[41]-=-) we have Pr [∣∣C(a(`)i , rθ∗)∣∣ > n`(p+ 0.01)] ≤ e− 0.012n2`2(pn`+0.01n`/3) ≤ e− n`400 , (18) where in the last inequality we used the fact that pn` ≤ 0.01n`. Since s̄ ≤ 0.02n` ≤ s̃, we proved that w... |
16 |
Mixtures of probabilistic principle component analyzers.
- Tipping, Bishop
- 1999
(Show Context)
Citation Context ... (non-private) subspace clustering are obtained along our analysis, including a fully agnostic subspace clustering on well-separated datasets using stability arguments and exact clustering guarantee for thresholding-based subspace clustering (TSC, [14]) in the noisy setting. In addition, we employ the exponential mechanism [18] and propose a novel Gibbs sampler for sampling from this distribution, which involves a novel tweak in sampling from a matrix Bingham distribution. The method works well in practice and we show it is closely related to the well-known mixtures of probabilistic PCA model [27]. Related work Subspace clustering can be thought as a generalization of PCA and k-means clustering. The former aims at finding a single low-dimensional subspace and the latter uses zerodimensional subspaces as cluster centers. There has been extensive research on private PCA [2, 4, 10] and k-means [2, 22, 26]. Perhaps the most similar work to ours is [22, 4]. [22] applies the sample-aggregate framework to k-means clustering and [4] employs the exponential mechanism to recover private principal vectors. In this paper we give non-trivial generalization of both work to the private subspace clust... |
14 | Noisy sparse subspace clustering.
- Wang, Xu
- 2013
(Show Context)
Citation Context ...ecause SuLQ subspace clustering requires calibrating noise to a d × d sample covariance matrix, which induces much error when d is large. Gibbs sampling seems to be robust to various d settings. We also experiment on real-world datasets. The right two plots in Figure 2 report utility on a subset of the extended Yale Face Dataset B [13] for face clustering. 5 random individuals are picked, forming a subset of the original dataset with n = 320 data points (images). The dataset is preprocessed by projecting each individual onto a 9D affine subspace via PCA. Such preprocessing step was adopted in [32, 29] and was theoretically justified in [1]. Afterwards, ambient dimension of the entire dataset is reduced to d = 50 by random Gaussian projection. The plots show that Gibbs sampling significantly outperforms the other algorithms. 4In MPPCA latent variables yi are sampled from a normal distributionN (0, ρ2Iq). 7 −1 −0.5 0 0.5 1 1.5 2 2.5 3 0 0.05 0.1 0.15 0.2 0.25 0.3 Log 10 ε K − m ea ns c os t s.a., SSC s.a., TSC s.a., LRR exp., SSC exp. LRR SuLQ−10 SuLQ−50 −1 −0.5 0 0.5 1 1.5 2 2.5 3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Log 10 ε K − m ea ns c os t s.a., SSC s.a., TSC s.a., LRR exp., SSC exp. LRR SuLQ... |
11 | Graph connectivity in sparse subspace clustering.
- Nasihatkon, Hartley
- 2011
(Show Context)
Citation Context ...sufficient side information. It is perhaps reasonable why there is little work focusing on private subspace clustering, which is by all means a challenging task. For example, a negative result in [29] shows that if utility is measured in terms of exact clustering, then no private subspace clustering algorithm exists when neighboring databases are allowed to differ on an entire user profile. In addition, state-of-the-art subspace clustering methods like Sparse Subspace Clustering (SSC, [11]) lack a complete analysis of its clustering output, thanks to the notorious “graph connectivity” problem [21]. Finally, clustering could have high global sensitivity even if only cluster centers are released, as depicted in Figure 1. As a result, general private data releasing schemes like output perturbation [7, 8, 2] do not apply. In this work, we present a systematic and principled treatment of differentially private subspace clustering. To circumvent the negative result in [29], we use the perturbation of recovered low1 dimensional subspace from the ground truth as the utility measure. Our contributions are two-fold. First, we analyze two efficient algorithms based on the sample-aggregate framewo... |
10 | Robust subspace clustering via thresholding. arXiv:1307.4891,
- Heckel, Bolcskei
- 2013
(Show Context)
Citation Context ...l subspace from the ground truth as the utility measure. Our contributions are two-fold. First, we analyze two efficient algorithms based on the sample-aggregate framework [22] and established formal privacy and utility guarantees when data are generated from some stochastic model or satisfy certain deterministic separation conditions. New results on (non-private) subspace clustering are obtained along our analysis, including a fully agnostic subspace clustering on well-separated datasets using stability arguments and exact clustering guarantee for thresholding-based subspace clustering (TSC, [14]) in the noisy setting. In addition, we employ the exponential mechanism [18] and propose a novel Gibbs sampler for sampling from this distribution, which involves a novel tweak in sampling from a matrix Bingham distribution. The method works well in practice and we show it is closely related to the well-known mixtures of probabilistic PCA model [27]. Related work Subspace clustering can be thought as a generalization of PCA and k-means clustering. The former aims at finding a single low-dimensional subspace and the latter uses zerodimensional subspaces as cluster centers. There has been exten... |
8 | Diameter bounds for equal area partitions of the unit sphere, Electron
- Leopardi
(Show Context)
Citation Context ...roof is then completed by noting that∣∣〈yi,yj〉 − 〈xi,xj〉∣∣ ≤ ∣∣〈εi,xj〉∣∣+ ∣∣〈εj ,xi〉∣∣+ ∣∣〈εi, εj〉∣∣ ≤ (2 √ 5σ + 5σ2) √ 6 log n d . Lemma C.3 ([14], Lemma 3; extracted from the proof of Lemma 6.2. in =-=[42]-=-). Let Sd−1 = {x ∈ Rd : ‖x‖2 = 1} denote the d-dimensional unit sphere. For an arbitrary p ∈ Sd−1, defineC(p, θ) = {q ∈ Sd−1 : ϑ(p, q) ≤ θ} where ϑ(p, q) = arccos(〈p, q〉) is the angle between p and q.... |
7 | Subspace clustering of high-dimensional data: a predictive approach.
- McWilliams, Montana
- 2014
(Show Context)
Citation Context ...for the agnostic subspace clustering and the graph connectivity problem which might be of independent interests. 1 Introduction Subspace clustering was originally proposed to solve very specific computer vision problems having a union-of-subspace structure in the data, e.g., motion segmentation under an affine camera model [11] or face clustering under Lambertian illumination models [15]. As it gains increasing attention in the statistics and machine learning community, people start to use it as an agnostic learning tool in social network [5], movie recommendation [33] and biological datasets [19]. The growing applicability of subspace clustering in these new domains inevitably raises the concern of data privacy, as many such applications involve dealing with sensitive information. For example, [19] applies subspace clustering to identify diseases from personalized medical data and [33] in fact uses subspace clustering as a effective tool to conduct linkage attacks on individuals in movie rating datasets. Nevertheless, privacy issues in subspace clustering have been less explored in the past literature, with the only exception of a brief analysis and discussion in [29]. However, the al... |
7 | Guess who rated this movie: Identifying users through subspace clustering.
- Zhang, Fawaz, et al.
- 2012
(Show Context)
Citation Context ... two new provable guarantees for the agnostic subspace clustering and the graph connectivity problem which might be of independent interests. 1 Introduction Subspace clustering was originally proposed to solve very specific computer vision problems having a union-of-subspace structure in the data, e.g., motion segmentation under an affine camera model [11] or face clustering under Lambertian illumination models [15]. As it gains increasing attention in the statistics and machine learning community, people start to use it as an agnostic learning tool in social network [5], movie recommendation [33] and biological datasets [19]. The growing applicability of subspace clustering in these new domains inevitably raises the concern of data privacy, as many such applications involve dealing with sensitive information. For example, [19] applies subspace clustering to identify diseases from personalized medical data and [33] in fact uses subspace clustering as a effective tool to conduct linkage attacks on individuals in movie rating datasets. Nevertheless, privacy issues in subspace clustering have been less explored in the past literature, with the only exception of a brief analysis and discus... |
4 | Robust and private bayesian inference.
- Dimitrakakis, Nelson, et al.
- 2014
(Show Context)
Citation Context ... sampled uniformly at random from the qdimensional unit ball and wi ∼ N (0, Id/ε). Connection between the above-mentioned generative model and Gibbs sampler is formally justified in Appendix D.3. The generative model is strikingly similar to the well-known mixtures of probabilistic PCA (MPPCA, [27]) model by setting variance parameters σ` in MPPCA to √ 1/ε. The only difference is that yi are sampled uniformly at random from a unit ball 4 and noisewi is constrained to U⊥` , the complement space of U`. Note that this is closely related to earlier observation that “posterior sampling is private” [20, 6, 31], but different in that we constructed a model from a private procedure rather than the other way round. As the privacy parameter ε → ∞ (i.e., no privacy guarantee), we arrive immediately at the exact k-plane algorithm and the posterior distribution concentrates around the optimal k-means solution (C∗, z∗). This behavior is similar to what a small-variance asymptotic analysis on MPPCA models reveals [30]. On the other hand, the proposed Gibbs sampler is significantly different from previous Bayesian probabilisitic PCA formulation [34, 30] in that the subspaces are sampled from a matrix Bingham... |
3 |
Analyze Gauss: Optimal bounds for privacy-preserving principal component analysis.
- Dwork, Talwar, et al.
- 2014
(Show Context)
Citation Context ...dition, we employ the exponential mechanism [18] and propose a novel Gibbs sampler for sampling from this distribution, which involves a novel tweak in sampling from a matrix Bingham distribution. The method works well in practice and we show it is closely related to the well-known mixtures of probabilistic PCA model [27]. Related work Subspace clustering can be thought as a generalization of PCA and k-means clustering. The former aims at finding a single low-dimensional subspace and the latter uses zerodimensional subspaces as cluster centers. There has been extensive research on private PCA [2, 4, 10] and k-means [2, 22, 26]. Perhaps the most similar work to ours is [22, 4]. [22] applies the sample-aggregate framework to k-means clustering and [4] employs the exponential mechanism to recover private principal vectors. In this paper we give non-trivial generalization of both work to the private subspace clustering setting. 2 Preliminaries 2.1 Notations For a vector x ∈ Rd, its p-norm is defined as ‖x‖p = ( ∑ i x p i ) 1/p. If p is not explicitly specified then the 2-norm is used. For a matrix A ∈ Rn×m, we use σ1(A) ≥ · · · ≥ σn(A) ≥ 0 to denote its singular values (assuming without loss of ... |
3 | Bayesian inference on principal component analysis using reversible jump markov chain monte carlo.
- Zhang, Chan, et al.
- 2004
(Show Context)
Citation Context ...ier observation that “posterior sampling is private” [20, 6, 31], but different in that we constructed a model from a private procedure rather than the other way round. As the privacy parameter ε → ∞ (i.e., no privacy guarantee), we arrive immediately at the exact k-plane algorithm and the posterior distribution concentrates around the optimal k-means solution (C∗, z∗). This behavior is similar to what a small-variance asymptotic analysis on MPPCA models reveals [30]. On the other hand, the proposed Gibbs sampler is significantly different from previous Bayesian probabilisitic PCA formulation [34, 30] in that the subspaces are sampled from a matrix Bingham distribution. Finally, we remark that the proposed Gibbs sampler is only asymptotically private because Proposition 4.1 requires exact (or nearly exact [31]) sampling from Eq. (10). 5 Numerical results We provide numerical results of both the sample-aggregate and Gibbs sampling algorithms on synthetic and real-world datasets. We also compare with a baseline method implemented based on the k-plane algorithm [3] with perturbed sample covariance matrix via the SuLQ framework [2] (details presented in Appendix E). Three solvers are considere... |
2 |
Near-optimal algorithms for differentially private principal components.
- Chaudhuri, Sarwate, et al.
- 2012
(Show Context)
Citation Context ...dition, we employ the exponential mechanism [18] and propose a novel Gibbs sampler for sampling from this distribution, which involves a novel tweak in sampling from a matrix Bingham distribution. The method works well in practice and we show it is closely related to the well-known mixtures of probabilistic PCA model [27]. Related work Subspace clustering can be thought as a generalization of PCA and k-means clustering. The former aims at finding a single low-dimensional subspace and the latter uses zerodimensional subspaces as cluster centers. There has been extensive research on private PCA [2, 4, 10] and k-means [2, 22, 26]. Perhaps the most similar work to ours is [22, 4]. [22] applies the sample-aggregate framework to k-means clustering and [4] employs the exponential mechanism to recover private principal vectors. In this paper we give non-trivial generalization of both work to the private subspace clustering setting. 2 Preliminaries 2.1 Notations For a vector x ∈ Rd, its p-norm is defined as ‖x‖p = ( ∑ i x p i ) 1/p. If p is not explicitly specified then the 2-norm is used. For a matrix A ∈ Rn×m, we use σ1(A) ≥ · · · ≥ σn(A) ≥ 0 to denote its singular values (assuming without loss of ... |
2 |
Privacy for free: Posterior sampling and stochastic gradient monte carlo.
- Wang, Fienberg, et al.
- 2015
(Show Context)
Citation Context ... sampled uniformly at random from the qdimensional unit ball and wi ∼ N (0, Id/ε). Connection between the above-mentioned generative model and Gibbs sampler is formally justified in Appendix D.3. The generative model is strikingly similar to the well-known mixtures of probabilistic PCA (MPPCA, [27]) model by setting variance parameters σ` in MPPCA to √ 1/ε. The only difference is that yi are sampled uniformly at random from a unit ball 4 and noisewi is constrained to U⊥` , the complement space of U`. Note that this is closely related to earlier observation that “posterior sampling is private” [20, 6, 31], but different in that we constructed a model from a private procedure rather than the other way round. As the privacy parameter ε → ∞ (i.e., no privacy guarantee), we arrive immediately at the exact k-plane algorithm and the posterior distribution concentrates around the optimal k-means solution (C∗, z∗). This behavior is similar to what a small-variance asymptotic analysis on MPPCA models reveals [30]. On the other hand, the proposed Gibbs sampler is significantly different from previous Bayesian probabilisitic PCA formulation [34, 30] in that the subspaces are sampled from a matrix Bingham... |
1 |
Simulation of the matrix bingham-conmises-fisher distribution, with applications to multivariate and relational data.
- Hoff
- 2009
(Show Context)
Citation Context ...ily done by sampling zj from a categorical distribution. Update of S`: Let X (`) = {xi ∈ X : zi = `} denote data points that are assigned to cluster ` and n` = |X (`)|. Denote X(`) ∈ Rd×n` as the matrix with columns corresponding to all data points in X (`). The distribution over S` conditioned on z can then be written as p(S` = range(U`)|z;X ) ∝ exp(ε/2 · tr(U>` A`U`)); U` ∈ Rd×q,U>` U` = Iq×q, (12) where A` = X(`)X(`) > is the unnormalized sample covariance matrix. Distribution of the form in Eq. (12) is a special case of the matrix Bingham distribution, which admits a Gibbs sampler [16]. We give implementation details in Appendix D.2 with modifications so that the resulting Gibbs sampler is empirically more efficient for a wide range of parameter settings. 3Recently [28] established full clustering guarantee for SSC, however, under strong assumptions. 6 4.2 Discussion The proposed Gibbs sampler resembles the k-plane algorithm for subspace clustering [3]. It is in fact a “probabilistic” version of k-plane since sampling is performed at each iteration rather than deterministic updates. Furthermore, the proposed Gibbs sampler could be viewed as posterior sampling for the follow... |
1 |
Differential privacy: an exploration of the privacy-utility landscape.
- Mir
- 2013
(Show Context)
Citation Context ... sampled uniformly at random from the qdimensional unit ball and wi ∼ N (0, Id/ε). Connection between the above-mentioned generative model and Gibbs sampler is formally justified in Appendix D.3. The generative model is strikingly similar to the well-known mixtures of probabilistic PCA (MPPCA, [27]) model by setting variance parameters σ` in MPPCA to √ 1/ε. The only difference is that yi are sampled uniformly at random from a unit ball 4 and noisewi is constrained to U⊥` , the complement space of U`. Note that this is closely related to earlier observation that “posterior sampling is private” [20, 6, 31], but different in that we constructed a model from a private procedure rather than the other way round. As the privacy parameter ε → ∞ (i.e., no privacy guarantee), we arrive immediately at the exact k-plane algorithm and the posterior distribution concentrates around the optimal k-means solution (C∗, z∗). This behavior is similar to what a small-variance asymptotic analysis on MPPCA models reveals [30]. On the other hand, the proposed Gibbs sampler is significantly different from previous Bayesian probabilisitic PCA formulation [34, 30] in that the subspaces are sampled from a matrix Bingham... |
1 |
Differentially private k-means clustering. arXiv,
- Su, Cao, et al.
- 2015
(Show Context)
Citation Context ...xponential mechanism [18] and propose a novel Gibbs sampler for sampling from this distribution, which involves a novel tweak in sampling from a matrix Bingham distribution. The method works well in practice and we show it is closely related to the well-known mixtures of probabilistic PCA model [27]. Related work Subspace clustering can be thought as a generalization of PCA and k-means clustering. The former aims at finding a single low-dimensional subspace and the latter uses zerodimensional subspaces as cluster centers. There has been extensive research on private PCA [2, 4, 10] and k-means [2, 22, 26]. Perhaps the most similar work to ours is [22, 4]. [22] applies the sample-aggregate framework to k-means clustering and [4] employs the exponential mechanism to recover private principal vectors. In this paper we give non-trivial generalization of both work to the private subspace clustering setting. 2 Preliminaries 2.1 Notations For a vector x ∈ Rd, its p-norm is defined as ‖x‖p = ( ∑ i x p i ) 1/p. If p is not explicitly specified then the 2-norm is used. For a matrix A ∈ Rn×m, we use σ1(A) ≥ · · · ≥ σn(A) ≥ 0 to denote its singular values (assuming without loss of generality that n ≤ m). ... |
1 | Clustering consistent sparse subspace clustering. arXiv,
- Wang, Wang, et al.
- 2015
(Show Context)
Citation Context ...d×n` as the matrix with columns corresponding to all data points in X (`). The distribution over S` conditioned on z can then be written as p(S` = range(U`)|z;X ) ∝ exp(ε/2 · tr(U>` A`U`)); U` ∈ Rd×q,U>` U` = Iq×q, (12) where A` = X(`)X(`) > is the unnormalized sample covariance matrix. Distribution of the form in Eq. (12) is a special case of the matrix Bingham distribution, which admits a Gibbs sampler [16]. We give implementation details in Appendix D.2 with modifications so that the resulting Gibbs sampler is empirically more efficient for a wide range of parameter settings. 3Recently [28] established full clustering guarantee for SSC, however, under strong assumptions. 6 4.2 Discussion The proposed Gibbs sampler resembles the k-plane algorithm for subspace clustering [3]. It is in fact a “probabilistic” version of k-plane since sampling is performed at each iteration rather than deterministic updates. Furthermore, the proposed Gibbs sampler could be viewed as posterior sampling for the following generative model: first sample U` uniformly at random from Sdq for each subspace S`; afterwards, cluster assignments {zi}ni=1 are sampled such that Pr[zi = j] = 1/k and xi is set as xi... |
1 | A deterministic analysis of noisy sparse subspace clustering for dimensionality-reduced data. In
- Wang, Wang, et al.
- 2015
(Show Context)
Citation Context ...logical datasets [19]. The growing applicability of subspace clustering in these new domains inevitably raises the concern of data privacy, as many such applications involve dealing with sensitive information. For example, [19] applies subspace clustering to identify diseases from personalized medical data and [33] in fact uses subspace clustering as a effective tool to conduct linkage attacks on individuals in movie rating datasets. Nevertheless, privacy issues in subspace clustering have been less explored in the past literature, with the only exception of a brief analysis and discussion in [29]. However, the algorithms and analysis presented in [29] have several notable deficiencies. For example, data points are assumed to be incoherent and it only protects the differential privacy of any feature of a user rather than the entire user profile in the database. The latter means it is possible for an attacker to infer with high confidence whether a particular user is in the database, given sufficient side information. It is perhaps reasonable why there is little work focusing on private subspace clustering, which is by all means a challenging task. For example, a negative result in [29]... |
1 | DP-space: Bayesian nonparametric subspace clustering with small-variance asymptotic analysis.
- Wang, Zhu
- 2015
(Show Context)
Citation Context ...t random from a unit ball 4 and noisewi is constrained to U⊥` , the complement space of U`. Note that this is closely related to earlier observation that “posterior sampling is private” [20, 6, 31], but different in that we constructed a model from a private procedure rather than the other way round. As the privacy parameter ε → ∞ (i.e., no privacy guarantee), we arrive immediately at the exact k-plane algorithm and the posterior distribution concentrates around the optimal k-means solution (C∗, z∗). This behavior is similar to what a small-variance asymptotic analysis on MPPCA models reveals [30]. On the other hand, the proposed Gibbs sampler is significantly different from previous Bayesian probabilisitic PCA formulation [34, 30] in that the subspaces are sampled from a matrix Bingham distribution. Finally, we remark that the proposed Gibbs sampler is only asymptotically private because Proposition 4.1 requires exact (or nearly exact [31]) sampling from Eq. (10). 5 Numerical results We provide numerical results of both the sample-aggregate and Gibbs sampling algorithms on synthetic and real-world datasets. We also compare with a baseline method implemented based on the k-plane algori... |