#### DMCA

## The calculation of posterior distributions by data augmentation. (1987)

Venue: | Journal of the American Statistical Association |

Citations: | 923 - 10 self |

### Citations

11953 | Maximum likelihood from incomplete data via the EM algorithm (with discussion
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...ized and further developed in the influential paper of Dempster, Laird, and Rubin (1977), in which references to earlier research can be found. Briefly, based on a current estimate of the parameter value, the method seeks to compute the expected value of the log-likelihood of the complete data and then maximizes the log-likelihood to obtain the updated parameter value. Dempster et al. called this approach the EM algorithm because of the expectation and maximization calculations involved. Although the details of the EM algorithm are not of direct interest for the present article, the aspect of Dempster et al. (1977) that is most important for our purpose is the impressive list of examples, which includes missing data problems, mixture problems, factor analysis, iteratively reweighted least squares, and many others. In each example, enough detail is presented to show how the EM algorithm can be applied. By these examples, the authors make it clear that even in cases that at first sight may not appear to be an incomplete data problem, one may sometimes still profit by artificially formulating it as such to facilitate the maximum likelihood estimation. It seems that the potential usefulness of this problem ... |

939 |
Stochastic Processes
- Doob
- 1953
(Show Context)
Citation Context ...n be arbitrarily close to 1, depending on the starting value. This seems to be an intrinsic limit imposed by an unbounded parameter space and should not be regarded as a weakness of the method. Remark 3. The whole theory can be developed in the same way for finite or countable 0 . The simplest replacement for Condition (C) is to require K(8, 4) > 0 for all 8, 4 E 0 . Weaker conditions exist but they are cumbersome to state. Remark 4. It is clear from properties (a) and (b) in the proof of Theorem 1that T is a Markov transition operator. However, a search through standard references, including Doob (1953), does not produce results directly suitable Tanner and Wong: Data Augmentation Iteration Figure 10. Median and Upper and Lower Quartiles of 8 Values Across Iterations. The upper dashed line, the solid line, and the lower dashed line represent the upper quartile, the median, and the lower quartile, respectively. for our use. Especially, the L, convergence rate in Theorem 3 seems to be new. Remark 5. Similarly, there is a vast literature on fixed point operator equations and the method of successive substitution (see, e.g., Rall 1969, pp. 64-74). Again, we have not found results directly usable... |

930 |
Linear statistical inference and its applications
- Rao
- 1973
(Show Context)
Citation Context ...xtent on the ease of implementation of the imputation and posterior steps. In general, neither step is guaranteed to be easy. There is a parallel limitation on the EM algorithm; namely, that in general both the E and M steps may be difficult to implement. There remains, however, a rich class of problems, especially those connected with exponential families, for which there are natural ways to carry out these steps. This is illustrated by the examples here and the examples in Dempster et al. (1977). Linkage Example To illustrate the basic algorithm, we consider an example that was presented in Rao (1973) and reexamined in Dempster et al. (1977) and Louis (1982). In particular, from a genetic linkage model, it is believed that 197 animals are distributed multinomially into four categories, 531 Tanner and Wong: Data Augmentation y = (yl, y2, y3, y4) = (125, 18, 20, 34), with cell probabilities specified by To illustrate the algorithm, y is augmented by splitting the first cell into two cells, one of which having cell probability 4,the other having cell probability 014. Thus the augmented data set is given by x = (xl, x2, x3, x4, x,), where xl + x2 = 125, xs = y2, x4 = y3, and x, = y4. The likel... |

745 |
Bayesian Inference in Statistical Analysis.
- Box, Tiao
- 1992
(Show Context)
Citation Context ... (14, 0, 1, 5) (same legend as in Fig. 1). The dashed and doffed lines are superimposed. 2. If x2 is known, then generate the unobserved observation from N(P ' 6 ( l - P2)).x27 0 2 ~h~ covariance matrix z is then generated from the current guess of the posterior distribution p ( ~ y ) . the1 first iteration, can be generated from U[-1, 11and a: and t$ can be generated from weighted X1 distributions. At succeeding iterations, the updated posterior p(Z ( Y ) is a mixture of inverted wishart distributions, w hi^ last point follows from the fact that p (E ( x) is an inverted Wishart distribution (Box and Tiao 1973, p. 428) when the prior of E is given as ~ ( 2 ) 121-(p+1)'2,CC where p is the dimension of the multivariate normal distribution. Thus, in the second step of the algorithm, we generate m observations from this mixture of inverted Wishart distributions and compute the associated correlation coefficient for each observation. Regarding the implementation of the algorithm, it is noted that the algorithm of Ode11 and Feiveson (1966) can be used to generate observations from the inverted Wishart distribution. The amount of computation in this algorithm is not extensive, since the computation is of ... |

339 | Accurate approximations for posterior moments and marginal densities - Tierney, Kadane - 1986 |

321 |
Finding the observed information matrix when using the EM algorithm”.
- Louis
- 1982
(Show Context)
Citation Context ...that the augmented data, x = (y, z ) , is straightforward to analyze. In general, this data augmentation scheme is used for the calculation of maximum likelihood estimates or posterior modes. For making inferential statements, the validity of the normal approximation is assumed and the precision of the estimate is given by the observed Fisher information. In most cases, however, it is not possible to obtain the Fisher information directly from the basic EM calculations and one must do further calculations to obtain standard errors [see the discussion following Dempster et al. (1977); see also Louis (1982)l. Except in simple cases, it is difficult to obtain an indication to the validity of the normal approximation. In the present article, we are interested in the entire likelihood or posterior distribution, not just the maximizer and the curvature at the maximizer. The method we propose exploits the simplicity of the posterior distribution of the parameter given the augmented data, just as the EM algorithm exploits the simplicity of maximum likelihood estimation given the complete data. Even in large sample situations, when the normal approximation is expected to be valid, it would still be com... |

195 |
Linear Operators, Part I, General Theory
- Dunford, Schwartz
- 1988
(Show Context)
Citation Context ...). There exists a subsequence {f;) such that IITfi'lllllfi'll + a , and fit converges to some f, weakly in L,. (d) Since the set {fir) is bounded and equicontinuous, we must actually have f i converges to f, strongly in L1, and f, can be chosen to be continuous. ( 4 Hence a = lim(llTf;lllllf~ll)= IITf *lllllf*ll. But Sf*(8) dB = 0; hence by Lemma 2, 0 6 a < 1. From this, the theorem follows directly. It remains to establish statements (a)-(e). Statement (e) needs no proof, statement (a) follows from elementary manipulation, and statement (b) is a well-known property ' of L, spaces (see, e.g., Dunford and Schwartz 1958, p. 294). To prove (c), let ifi.) be a subsequence of ifi) such that IITfi.lllllfi,,ll + a . Now by (a) and (b), {fi,,) is weakly sequentially compact, so there must exist a further subsequence {fir) of {fi,,) convergent weakly in L,. This establishes (c). Finally, (d) can be established by standard analytical arguments. Remark 1.One of the conditions of Theorem 3 requires that go(8)lg,(8) be uniformly bounded. For a compact parameter space 0 , this condition is automatic if Condition (C) holds, since under (C), g, is continuous and strictly positive. For an unbounded parameter space, we need... |

72 |
Multiple Imputations in Sample Surveys — A Phenomenological Bayesian Approach to Nonresponse.”
- Rubin
- 1978
(Show Context)
Citation Context ...whereas our method relies on the simplicity of the posterior distribution of the parameter given the augmented data. Upon completion of both works, it was realized that when one identifies the unknown parameters as part of the missing values, then the two algorithms become essentially the same. Our present results are, to a considerable extent, anticipated in the work of Rubin. In particular, the two key concepts of data augmentation and multiple imputation have been advocated and studied by Rubin in a series of 530 papers on inference in the presence of incomplete data (Dempster et al. 1977; Rubin 1978, 1980). 2. THE BASIC ALGORITHM The algorithm is motivated by the following simple representation of the desired posterior density: (2.1) where p(0 ( y) denotes the posterior density of the parameter 8 given the data y, p(z ( y) denotes the predictive density of the latent data z given y, andp(0 1 z, y) denotes the conditional density of 0 given the augmented data x = (z, y). The predictive density of z can, in turn, be related to the desired posterior density by In the above equations, the sample space for the latent data z is denoted by Z and the parameter space for 0 is denoted by O . (From... |

67 |
Analysis of qualitative data:
- Haberman
- 1979
(Show Context)
Citation Context ...let observations, the value of 8 that gives the closest (p2, p3, 04, 05) vector is found using least squares. The resulting histograms of the 6 values (using 10,000 initial values and 3,000 accepted values) and the true posterior distribution are presented in Figure 6. An examination of this figure reveals that the estimated distribution of 8 based on the restricted set of 8 values is quite similar to the true distribution. 5. THE TRADITIONAL LATENT-CLASS MODEL The data in Table 2 represent the responses of 3,181 participants in the 1972, 1973, and 1974 General Social Surveys, as presented in Haberman (1979). The participants in these surveys are cross-classified by the year of the survey and their responses to each of three questions regarding abortion. Thus the cell entry represents the number of subjects who in year D = d give responses a to question A, b to question B, and c to question C. Regarding question A, subjects are asked, "Please tell me whether or not you think it should be possible for a pregnant woman to obtain a legal abortion if she is married Table 2. White Christian Subjects in the 1972-1974 General Social Surveys, Cross-Classified by Year of Survey and Responses to Three Ques... |

63 |
The analysis of systems of qualitative variables when some of the variables are unobservable: Part I - A modified latent structure approach.
- Goodman
- 1974
(Show Context)
Citation Context ...o No No 1973 Yes Yes Yes Yes Yes No Yes No Yes Yes No No No Yes Yes No Yes No No No Yes No No No 1974 Yes Yes Yes Yes Yes No Yes No Yes Yes No No No Yes Yes No Yes No No No Yes No No No Source: Haberman (1979, p. 559). and does not want any more children." In question B, the italicized phrase is replaced with "if the family has a very low income and cannot afford any more children," and in question C it is replaced with "if she is not married and does not want to marry the man." For these data, Haberman (1979) considered several models, one of which is the traditional latent-class model. [See Goodman (1974a,b), Haberman (1979), or Clogg (1977) for an exposition of this model.] In this example, the traditional latent-class model assumes that the manifest variables (A, B, C, D) are conditionally independent, given a dichotomous latent variable (X). In other words, if the value of the dichotomous latent variable is known for a given participant, then knowledge of the response to a given question provides no further information regarding the responses to either of the other two questions. Haberman used the EM and scoring algorithms to obtain maximum likelihood estimates of the cell probabilities. O... |

40 |
Computational Solution of Nonlinear Operator Equations,
- Rall
- 1969
(Show Context)
Citation Context ...r. However, a search through standard references, including Doob (1953), does not produce results directly suitable Tanner and Wong: Data Augmentation Iteration Figure 10. Median and Upper and Lower Quartiles of 8 Values Across Iterations. The upper dashed line, the solid line, and the lower dashed line represent the upper quartile, the median, and the lower quartile, respectively. for our use. Especially, the L, convergence rate in Theorem 3 seems to be new. Remark 5. Similarly, there is a vast literature on fixed point operator equations and the method of successive substitution (see, e.g., Rall 1969, pp. 64-74). Again, we have not found results directly usable here. 7. PRACTICAL IMPLEMENTATION OF THE ALGORITHM As indicated in the introduction, if the sample size m is taken to be large in each iteration, then the algorithm can be interpreted as the method of successive substitution for solving a fixed point problem. In practice, however, it is inefficient to take m large during the first few iterations when the estimated posterior distribution is far from the true distribution. Rather, it is suggested that m initially be small and then increased with successive iterations. In addition, we... |

24 |
Bayesian Analysis of Dichotomous Quantal Response Models,"
- ZELLNER, RossI
- 1984
(Show Context)
Citation Context ...ased on the entire posterior distribution (or the entire likelihood). The examples presented in this article will illustrate that a few steps of the iterative algorithm will provide a diagnostic for the adequacy of the normal approximation for the maximum likelihood estimate. In practice, one is often interested in the marginal distribution of various parameters of interest. Even if one can evaluate the joint posterior distribution, obtaining the marginal distribution can be difficult and is a topic of current interest (Smith, Skene, Shaw, Naylor, and Dransfield 1985; Tierney and Kadane 1985; Zellner and Rossi 1984). In the data augmentation setup, one is faced with the additional complication that the posterior distribution given the observed data may not be expressible in closed form. Ideally, one would want to choose the augmentation such that the posterior given the augmented data can be sampled from with ease. In cases where this cannot be done, one would have to resort to approximate sampling methods. The Dirichlet sampling scheme discussed in Section 4 provides a simple approach for approximate sampling in the case of multinomial data. Moreover, the recent works on marginalization referred to prev... |

23 | A numerical procedure to generate a sample covariance matrix. - Odell, Feiveson - 1966 |

19 | The Implementation of the Bayesian Paradigm," - Smith, Skene, et al. - 1985 |

15 |
Unrestricted and Restricted Maximum Likelihood Latent Structure Analysis: A Manual for Users," Working Paper 1977-09,
- Clogg
- 1977
(Show Context)
Citation Context ... No Yes Yes No No No Yes Yes No Yes No No No Yes No No No 1974 Yes Yes Yes Yes Yes No Yes No Yes Yes No No No Yes Yes No Yes No No No Yes No No No Source: Haberman (1979, p. 559). and does not want any more children." In question B, the italicized phrase is replaced with "if the family has a very low income and cannot afford any more children," and in question C it is replaced with "if she is not married and does not want to marry the man." For these data, Haberman (1979) considered several models, one of which is the traditional latent-class model. [See Goodman (1974a,b), Haberman (1979), or Clogg (1977) for an exposition of this model.] In this example, the traditional latent-class model assumes that the manifest variables (A, B, C, D) are conditionally independent, given a dichotomous latent variable (X). In other words, if the value of the dichotomous latent variable is known for a given participant, then knowledge of the response to a given question provides no further information regarding the responses to either of the other two questions. Haberman used the EM and scoring algorithms to obtain maximum likelihood estimates of the cell probabilities. One parameter of interest associated wi... |

7 |
Comments on “Maximum likelihood from incomplete data via the EM algorithm” by
- Murray
- 1977
(Show Context)
Citation Context ...the normal approximation, the estimated posterior distribution, and the true posterior, respectively. The dashed and dotted lines are superimposed. Journal of the American Statistical Association, June 1987 Theta Figure 2. Log-Posterior Density of B (same data and legend as in Fig. 1). algorithm would indicate the inadequacy of the normal approximation. 3. FUNCVIONALS OF THE MULTIVARIATE NORMAL COVARIANCE MATRIX In this section, the posterior distribution of the correlation coefficient from the bivariate normal distribution will be investigated. To illustrate, suppose that the data in Table 1(Murray 1977) represent 12 observations from the bivariate normal distribution with p, = p2 = 0, correlation coefficient p, and variances al and 02,. Before proceeding to the formal analysis, we note that in the four pairs of observations, two pairs have correlation 1and the remaining two pairs have correlation -1. Thus we can expect a nonunimodal posterior distribution for p in this data set. In such a case, the maximum likelihood estimate and the associated standard error will clearly be mislead-. ing. Furthermore, we point out that the information regarding a: and a; in the eight incomplete observations... |

2 |
Hypothesis Testing in Multiple Imputation With Emphasis on Mixed-Up Frequencies in Contingency Tables,"
- Li
- 1985
(Show Context)
Citation Context ...ay not be expressible in closed form. Ideally, one would want to choose the augmentation such that the posterior given the augmented data can be sampled from with ease. In cases where this cannot be done, one would have to resort to approximate sampling methods. The Dirichlet sampling scheme discussed in Section 4 provides a simple approach for approximate sampling in the case of multinomial data. Moreover, the recent works on marginalization referred to previously may potentially be helpful in this regard. We wish to draw the reader's attention to the concurrent and independent work of K. H. Li (1985a,b), who has devised an algorithm for doing multiple imputation of missing values that is very similar in its formal structure to our method. Whereas the main goal in the present article is to exploit the data augmentation formulation in the Bayesian inference of parameters, in Li's work, the initial focus, as well as sources of examples, have been the imputation of missing values. Thus the essential difference is that Li's method exploits the simplicity of the distribution of one component of the missing values given both the observed data and the remainder of the missing values, whereas our... |

1 |
Bayesian Estimation of Latent Roots and Vectors With Special Reference to the Bivariate Normal Distribution,"
- Tiao, Fienberg
- 1969
(Show Context)
Citation Context ...fteenth iterations (m = 6,400). In addition, the true posTable 1. Twelve Observations From a Bivariate Normal Distribution terior of the correlation coefficient, which is proportional to [(I - p2)4.5]/[(I .25 - is also plotted. As is evident from the plot, the estimated posterior distribution recovers the bimodal nature of the true distribution. Finally, it is noted that the algorithm presented in this article can be used to examine the posterior distribution of any functional of the covariance matrix. For example, the posterior distribution of the largest eigenvalue of the covariance matrix (Tiao and Fienberg 1969) may be examinedb~simply computing the largest eigenvalue of each of the observations from the inverted Wishart distribution in the second step of the 4. THE DlRlCHLET SAMPLING PROCESS In the linkage example of Section 2, the augmented posterior distribution p(0 I x) is a beta distribution. Thus it is a trivial matter to carry out the P step. In more complicated models, the sampling of 0 from p(0 I x) may not be so simple. We now present a primitive but generally applicable procedure, based on a Dirichlet sampling proce'ss, which can be used to approximately sample from the posterior distribut... |