#### DMCA

## A Three Component Latent Class Model for Robust Semiparametric Gene Discovery A Three Component Latent Class Model for Robust Semiparametric Gene Discovery *

### Citations

8731 |
Controlling the false discovery rate: a practical and powerful approach to multiple testing
- Benjamini
- 1995
(Show Context)
Citation Context ..., and biological significance, that usually translates into a lower bound for the effect size (Kirk, 2007). Statistical significance is usually addressed by means of hypothesis testing and p-values, but in microarray experiments one must take into account the multiplicity issue. There are currently a lot of alternative ways for defining an error rate and keeping it under control and one can look at Westfall and Young (1993), and Farcomeni (2008) for a review of recent developments. Many modern procedures privilege in a first stage the control of the False Discovery Rate (FDR), as conceived in Benjamini and Hochberg (1995), and discard only at a second stage those genes with a fold-change between two experimental/biological conditions too close to 1. Formally, it is requested that the ratio between the average expression in one experimental condition is at least c times the average in the other condition, where typically c = 2, see e.g. Tusher, Tibshirani, and Chu (2001), Sabatti, Karsten, and Geschwind (2002). However, applying the criterion on the fold change after testing is inefficient, since it may deflate the power of the multiple testing procedure and even lead to loss of control (see the simulation stud... |

2479 |
Significance analysis of microarrays applied to the ionizing radiation response
- VG, Tibshirani, et al.
(Show Context)
Citation Context ...ding the effect size into the selection criterion is illustrated also by Yao, Rakhade, Li, Ahmed, Krauss, Draghici, and Leob (2004), who compare “selected” and “validated” genes. Also Zhu, Hero, Qin, and Swaroop (2005) link statistical significance with effect size. Alfo, Farcomeni, and Tardella (2007) and van de Wiel and Kim (2007) do this in the two-classes case with a paired design, briefly discussing generalizations. Other techniques are available that use the c-fold rule together with significance assessment, the most well-known being probably Significance Analysis of Microarrays (SAM) (Tusher et al., 2001). Alfo et al. (2007) focus on discovering up regulated genes as a two-class problem, merging down regulated with not differentially expressed genes. The discovery of down regulated genes can be implemented with a separate similar strategy. The procedure obviously loses power when merging two clusters of genes. van de Wiel and Kim (2007) incorporate the c-fold rule into FDR estimation, but 1 Alfo' et al.: Robust Semiparametric Gene Discovery not in the adopted statistical model, where the observed expressions are simply averaged. In this work, unlike van de Wiel and Kim (2007) and the majority... |

1982 | Practical Optimization - Gill, Murray - 1981 |

319 |
Resampling-Based Multiple Testing: Examples and Methods for P-value Adjustment
- Westfall, Young
- 1993
(Show Context)
Citation Context ...04) where the main goal is gene discovery, selected genes must fulfill two requirements: statistical significance, that is, generalizability of the estimated difference to the population of patients, and biological significance, that usually translates into a lower bound for the effect size (Kirk, 2007). Statistical significance is usually addressed by means of hypothesis testing and p-values, but in microarray experiments one must take into account the multiplicity issue. There are currently a lot of alternative ways for defining an error rate and keeping it under control and one can look at Westfall and Young (1993), and Farcomeni (2008) for a review of recent developments. Many modern procedures privilege in a first stage the control of the False Discovery Rate (FDR), as conceived in Benjamini and Hochberg (1995), and discard only at a second stage those genes with a fold-change between two experimental/biological conditions too close to 1. Formally, it is requested that the ratio between the average expression in one experimental condition is at least c times the average in the other condition, where typically c = 2, see e.g. Tusher, Tibshirani, and Chu (2001), Sabatti, Karsten, and Geschwind (2002). H... |

145 | Detecting differential gene expression with a semiparametric hierarchical mixture method,” - Newton, Noueiry, et al. - 2004 |

114 | Analyzing Microarray Gene Expression Data - McLachlan, Do, et al. - 2004 |

53 | Size, power and false discovery rates,”
- Efron
- 2007
(Show Context)
Citation Context ... Ψ. The i-th gene can be assigned to the set corresponding to the highest estimated posterior probability, using then a simple Maximum a Posteriori (MaP) rule. In practice, different probability thresholds may be used to give more conservative lists of potential differentially expressed genes. A natural question is how to calibrate the probability threshold and thus the effectiveness of the cutoff. In order to select the probability threshold one can try to keep under control some error rate (such as the FDR) of the resulting procedure, see for instance McLachlan, Do, and Ambroise (2004) and Efron (2007). Recall that the FDR is the expected proportion of false discoveries over the number of selected genes, or zero if no gene is selected. Therefore, along the lines of Newton, Noueiry, Sarkar, and Ahlquist (2004), we may use the following estimates FDR[τ] = ∑i:(1−wi0)>τ wi0 card{i : (1− wi0) > τ} (3) where card{H } is the cardinality of the set {H }. The threshold probability τ can be selected as the largest τ for which the estimated FDR is below a pre-specified level α , i.e. τα =max { τ : FDR[τ] ≤ α } . 7 Alfo' et al.: Robust Semiparametric Gene Discovery In practice, one need not evalu... |

52 |
Simultaneously modeling joint and marginal distributions of multivariate categorical responses.
- Lang, Agresti
- 1994
(Show Context)
Citation Context ...r a nominal observed variable; estimation can be simply accomplished by using a standard EM algorithm, leading to the following parameter estimates at the M-step: θ (t)u|k = ∑i∑ jmax(yi j,0)wik ∑i∑ j wik θ (t)l|k = ∑i∑ j−min(yi j,0)wik ∑i∑ j wik θ (t)0|k = 1− θu|k− θl|k π(t)k = ∑i wik G The E and M steps are iterated until convergence. Constrained maximum likelihood estimation, given the modeling assumptions, can be accomplished by transforming the Fisher-scoring algorithm proposed 6 Submission to Statistical Applications in Genetics and Molecular Biology http://www.bepress.com/sagmb by Lang and Agresti (1994) into an active-set method. For instance, Vermunt (1999) shows how to transform a simple uni-dimensional Newton-type algorithm for ML estimation with equality constraints into an active-set method. In the present context, the standard EM algorithm should be modified as follows: at each M-step, the inequality constraints which are no longer necessary are de-activated (i.e. if θu|1 > θu|0 the corresponding constraint is simply removed at the present iteration), while the ones which are violated are activated, see e.g. Gill, Murray, and Wright (1981), defining equality constraints for the param... |

22 |
A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion,”
- Farcomeni
- 2008
(Show Context)
Citation Context ... discovery, selected genes must fulfill two requirements: statistical significance, that is, generalizability of the estimated difference to the population of patients, and biological significance, that usually translates into a lower bound for the effect size (Kirk, 2007). Statistical significance is usually addressed by means of hypothesis testing and p-values, but in microarray experiments one must take into account the multiplicity issue. There are currently a lot of alternative ways for defining an error rate and keeping it under control and one can look at Westfall and Young (1993), and Farcomeni (2008) for a review of recent developments. Many modern procedures privilege in a first stage the control of the False Discovery Rate (FDR), as conceived in Benjamini and Hochberg (1995), and discard only at a second stage those genes with a fold-change between two experimental/biological conditions too close to 1. Formally, it is requested that the ratio between the average expression in one experimental condition is at least c times the average in the other condition, where typically c = 2, see e.g. Tusher, Tibshirani, and Chu (2001), Sabatti, Karsten, and Geschwind (2002). However, applying the c... |

22 | High throughput screening of co-expressed gene pairs with controlled false discovery rate (FDR) and minimum acceptable strength - Zhu, Hero, et al. - 2005 |

21 |
Constrained monotone em algorithms for finite mixture of multivariate gaussians.
- Ingrassia, Rocci
- 2007
(Show Context)
Citation Context ... constraints into an active-set method. In the present context, the standard EM algorithm should be modified as follows: at each M-step, the inequality constraints which are no longer necessary are de-activated (i.e. if θu|1 > θu|0 the corresponding constraint is simply removed at the present iteration), while the ones which are violated are activated, see e.g. Gill, Murray, and Wright (1981), defining equality constraints for the parameter estimates at the current iteration. Obviously the ”activation” step does not break down the monotone increasing behaviour of the loglikelihood, see also Ingrassia and Rocci (2007) for a more thorough discussion. 4 Gene Selection Strategies In order to select genes, we use a threshold for statistical significance, based on the posterior class probabilities estimated through the EM algorithm. The estimate of the posterior probability that the i-th gene is in one of the three sets is denoted with wik = Pr(Zik = 1 |yi,Ψ), k ∈ {−1,0,1}. This probability is conditional on the observed data Y and the MLE Ψ. The i-th gene can be assigned to the set corresponding to the highest estimated posterior probability, using then a simple Maximum a Posteriori (MaP) rule. In practice, ... |

6 | Estimating the false discovery rate using nonparametric deconvolution,” - Wiel, Kim - 2007 |

5 | Thresholding rules for recovering a sparse signal fom microarray experiments,”Math. - Sabatti, Karsten, et al. - 2002 |

4 | Accuracy of cDNA microarray methods to detect small gene expression changes induced by neuregulin on breast epithelial cells,” - Yao, Rakhade, et al. - 2004 |

3 | Robust semiparametric mixing for detecting differentially expressed genes in microarray experiments. - Alfò, Farcomeni, et al. - 2007 |

3 |
Effect magnitude: a different focus,”
- Kirk
- 2007
(Show Context)
Citation Context ...a real dataset on multiple sclerosis. KEYWORDS: differentially expressed genes, effect size, microarray data, mixture model ∗We wish to thank two anonymous referees and the associate editor for comments that helped improve the original paper. 1 Introduction In microarray studies (Amaratunga and Cabrera, 2004) where the main goal is gene discovery, selected genes must fulfill two requirements: statistical significance, that is, generalizability of the estimated difference to the population of patients, and biological significance, that usually translates into a lower bound for the effect size (Kirk, 2007). Statistical significance is usually addressed by means of hypothesis testing and p-values, but in microarray experiments one must take into account the multiplicity issue. There are currently a lot of alternative ways for defining an error rate and keeping it under control and one can look at Westfall and Young (1993), and Farcomeni (2008) for a review of recent developments. Many modern procedures privilege in a first stage the control of the False Discovery Rate (FDR), as conceived in Benjamini and Hochberg (1995), and discard only at a second stage those genes with a fold-change between t... |

3 |
Characterization of the mid-foregut transcriptome identifies genes regulated during lung bud induction,” Gene Expr.
- Millien, Beane, et al.
- 2008
(Show Context)
Citation Context ...ing to the following rule: yi j = ⎧⎨ ⎩ 1 if log fi j > logcu −1 if log fi j < logcl 0 otherwise so that data become an S-valued matrix Y , where S = {−1,0,1} corresponding to the i-th gene being, in the j-th slide, over an upper threshold cu (yi j = 1), below a lower threshold cl (yi j = −1) or within the two thresholds (cl,cu) (yi j = 0). Conventionally the rule with 1/cl = cu = c= 2 is used to consider genes as functionally important but there is sometimes the need to raise the threshold as large as c= 10 and sometimes to keep it as small as c= 1.5 (Cheng, Fabrizio, Ge, Longo, and Li, 2007, Millien, Beane, Lenburg, Tsao, Lu, Spira, and Ramirez, 2008). Our aim is to model this discrete outcome Y , by considering that each gene may be drawn from one of the three subsets of genes (G−1, G0, G1). Let us denote with Zik the latent variable indicating whether the i-th gene belongs to the k-th subset Gk, k ∈ {−1,0,1} and θu|k = Pr( fi j > cu |i ∈ Gk) = Pr(yi j = 1 |Zik = 1) θl|k = Pr( fi j < cl |i ∈ Gk) = Pr(yi j = −1 |Zik = 1) θ0|k = 1−θu|k−θl|k = = Pr(cl < fi j < cu |i ∈ Gk) = Pr(yi j = 0 |Zik = 1) the (conditional) probabilities for the i-th gene in the k-th set to yield a fold change respectively over the upper threshold, below the lower thr... |

3 |
Controlling Bayes directional false discovery rate in random effects model,”
- Sarkar, Zhou
- 2008
(Show Context)
Citation Context ...H0i : Zi0 = 1. We could also be a bit more strict and request not only that genes in G0 are not classified as differentially expressed, but also that up-regulated genes are not misclassified as down-regulated and the other way around. This is usually referred to as directional, or Type III, error rate control, and in practice reduces to substituting (3) with FDR[τ] = ∑i:(1−wi0)>τ wi0+∑i:(1−wi0)>τ&wi1>wi−1 wi−1+∑i:(1−wi0)>τ&wi−1>wi1 wi1 card{i : (1− wi0) > τ} , (4) which leads to a bit more conservative procedure. For a deeper discussion on the directional FDR refer for instance to Sarkar and Zhou (2008). When the directional errors are not of equal importance, it is straightforward to modify (4) so that different thresholds are used to declare up and down regulation. 5 Simulation Study In this section we apply and compare our method on synthetic data, with a simulation scheme similar to the one discussed by van de Wiel and Kim (2007). We simulate samples of G = 4000 genes with observed (log) expression levels defined by: log fi j = μi+ εi j, i= 1, . . . ,G j = 1, . . . ,m ; where i and j index genes and tissue samples, respectively, and m= 10 or m= 50. Mean log-ratios are drawn from a mixtur... |

2 |
A general class of nonparametric models for ordinal categorical data,”
- Vermunt
- 1999
(Show Context)
Citation Context ...hed by using a standard EM algorithm, leading to the following parameter estimates at the M-step: θ (t)u|k = ∑i∑ jmax(yi j,0)wik ∑i∑ j wik θ (t)l|k = ∑i∑ j−min(yi j,0)wik ∑i∑ j wik θ (t)0|k = 1− θu|k− θl|k π(t)k = ∑i wik G The E and M steps are iterated until convergence. Constrained maximum likelihood estimation, given the modeling assumptions, can be accomplished by transforming the Fisher-scoring algorithm proposed 6 Submission to Statistical Applications in Genetics and Molecular Biology http://www.bepress.com/sagmb by Lang and Agresti (1994) into an active-set method. For instance, Vermunt (1999) shows how to transform a simple uni-dimensional Newton-type algorithm for ML estimation with equality constraints into an active-set method. In the present context, the standard EM algorithm should be modified as follows: at each M-step, the inequality constraints which are no longer necessary are de-activated (i.e. if θu|1 > θu|0 the corresponding constraint is simply removed at the present iteration), while the ones which are violated are activated, see e.g. Gill, Murray, and Wright (1981), defining equality constraints for the parameter estimates at the current iteration. Obviously the ”... |

1 | Submission to Statistical Applications in Genetics and Molecular Biology http://www.bepress.com/sagmb - Lewin, Bochkina, et al. - 2007 |