Results 1 -
2 of
2
Bayesian Data Analysis for Data Mining
- In Handbook of Data Mining
, 2002
"... Introduction The Bayesian approach to data analysis computes conditional probability distribu- tions of quantities of interest (such as future observables) given the observed data. Bayesian analyses usually begin with a .full probability model - a joint probability dis- tribution for all the observ ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Introduction The Bayesian approach to data analysis computes conditional probability distribu- tions of quantities of interest (such as future observables) given the observed data. Bayesian analyses usually begin with a .full probability model - a joint probability dis- tribution for all the observable and unobservable quantities under study - and then use Bayes' theorem (Bayes, 1763) to compute the requisite conditional probability distributions (called poster'Joy distributions). The theorem itself is innocuous enough. In its simplest form, if Q denotes a quantity of interest and D denotes data, the theorem states: P(ql D) P(;lq) X P(q)/P(). This theorem prescribes the basis for statistical learning in the probabilistic frame- work. With p(Q) regarded as a probabilistic statement of prior knowledge about Q before obtaining the data D, p(QI D) becomes a revised probabilistic statement of our knowledge about Q in the light of the data (Bernardo and Smith, 1994, p.2). The marginal lik
Selection Sampling from Large Data sets for Targeted Inference in Mixture Modeling
, 2009
"... One of the challenges of Markov chain Monte Carlo in large datasets is the need to scan through the whole data at each iteration of the sampler, which can be computationally prohibitive. Several approaches have been developed to address this, typically drawing computationally manageable subsamples o ..."
Abstract
- Add to MetaCart
One of the challenges of Markov chain Monte Carlo in large datasets is the need to scan through the whole data at each iteration of the sampler, which can be computationally prohibitive. Several approaches have been developed to address this, typically drawing computationally manageable subsamples of the data. Here we consider the specific case where most of the data from a mixture model provides little or no information about the parameters of interest, and we aim to select subsamples such that the information extracted is most relevant. The motivating application arises in flow cytometry, where several measurements from a vast number of cells are available. Interest lies in identifying specific rare cell subtypes and characterizing them according to their corresponding markers. We present a Markov chain Monte Carlo approach where an initial subsample of the full data is used to draw a further set of observations from a low probability region of interest, and describe how inferences can be made efficiently by reducing the dimensionality of the problem. Finally, we extend our method to a Sequential Monte Carlo framework whereby the targeted subsample is augmented sequentially as estimates improve, and introduce a stopping rule for determining the size of the targeted subsample. We implement our algorithm on a flow cytometry dataset, providing higher resolution inferences for rare cell subtypes.

