MCMC diagnostics

As we saw with the orangutan example, posterior distributions are described in the JAGS output by “a mountain of numbers” – a large number of random draws from the posterior distribution of each parameter. We need to check two things: (1) bias: are the draws representative of the whole posterior distribution? and (2) precision: do we have enough draws to get sufficiently precise summary statistics?

Convergence

A Markov chain is a sequence of numbers where each one depends on the one before. The very first value is a random draw from the prior distribution, unless we provide specific initial values. Then the chain moves over the posterior distribution, eventually converging on the target distribution. The animation below shows this for the occupancy parameter psi for the salamanders analysis.

We don’t know the target distribution, but we’ll use the values produced with 1 million iterations with a very efficient algorithm instead. This is shown as the dotted red curve in the right plot, and the vertical dotted red line is its mean.

The chain moves quickly away from the starting value, but after just 200 iterations, it’s still very different from the target. We would like to have some measure of how far we are from the target, but if we knew the target distribution, we would not be doing this!

The usual way to assess convergence is to run multiple chains. If all the chains are close to the target distribution, they must be close to each other. (Unfortunately the inverse is not true: they can be close to each other but far from the target.) The animation below shows 200 iterations for three chains:

Clearly, after only 200 iterations, the three chains are still a long way apart, as shown in the density plot. But no one runs 200 iterations; the plots below show the situation after 2000 iterations:

Now the chains are very close together and we’d be happy to assume they had converged not only on each other but hopefully on the target distribution too. Checking the trace plots and density plots – with separate density curves for each chain – is an essential part of checking output.

Convergence and Rhat

Checking all plots may be ideal, but with hundreds of nodes it’s tedious and won’t be done properly. So a numerical index of convergence is useful, at least to flag which plots we should look at carefully. With multiple chains, we can use Rhat, technically the “Potential Scale Reduction Factor” or PSRF. It compares the spreads of the individual chains with the spread of all the draws pooled.

There are various ways to calculate the spread, some quite computer-intensive. The Brooks-Gelman-Rubin version1p441 in Brooks, S.P. & Gelman, A. (1998) General methods for monitoring convergence of iterative simulations. J Computational and Graphical Statistics, 7, 434-455. uses the difference between the 10th and 90th percentiles; this makes no normality assumptions, unlike methods based on variances. It’s implemented in WinBUGS and the wiqid package.

For the chains of 200 draws from psi plotted above, the spreads of the three chains are [0.3066266, 0.2958543, 0.3076736] with a mean of 0.245221. The spread for all 600 values pooled is 0.2689908. Rhat is the ratio of these, 0.2689908 / 0.245221 = 1.096932 or 1.10 after rounding. The equivalent for the chains of length 2000 is 1.005.

At convergence, Rhat should be 1. Many authors2 eg, Gelman, A., Carlin, J.B., Stern, H.S., & Rubin, D.B. (2004) Bayesian data analysis, 2 edn. Chapman and Hall/CRC, Boca Raton p297 (3rd edition, 2014, p287); Gelman, A. & Hill, J. (2007) Data analysis using regression and multilevel/hierarchical models Cambridge University Press, p352; Kruschke, J.K. (2014) Doing Bayesian data analysis: a tutorial with R, JAGS and Stan, 2 edn. Elsevier, Amsterdam etc. p181 say that Rhats < 1.1 for all nodes indicate acceptable convergence, but I [Mike] like to see values < 1.05.

Precision

A basic idea of the “mountain of numbers” approach is that we can get summaries that are as precise as we like if we have a big enough mountain. In practice, big mountains can take a lot of computer time to generate, so we need some way to assess whether the mountain we have will give us an acceptable level of precision.

Monte Carlo error

For any method that uses random draws – not just JAGS analysis – the result will depend on which random numbers were used. Run the same procedure twice (without using set.seed) and you will get slightly different answers. If we ran the procedure very many times, we would see the spread of the results. The Monte Carlo standard error is an estimate of how big the error is based on a single result.

The simplest procedure is to divide the chain of draws into batches; a chain of length 10,000 could be divided into 100 batches with 100 draws in each. Calculate the statistic you are interested in, usually the mean, for each batch and then get the MCSE from the distribution of the statistic for the batches. See Lunn et al (2013) The BUGS Book p77 for details. The batch method is implemented in wiqid::getMCerror.

Whether the estimated MCSE is adequately small depends on the uncertainty in the estimate of the mean: a given MCSE will matter less if the posterior is very broad than if it is very narrow. For that reason, it is usually expressed as a percentage of the SD of the posterior distribution. This is shown as MCEpc in plots generated by wiqid::diagPlot. Estimates of the mean of the distribution are generally acceptable if the MCSE is less than 5% of the SD. The end points of credible intervals are more variable than the mean, so an MCSE below 1.5% is recommended for reliable credible intervals (Lunn et al, 2013).

Effective sample size

While the MC error tells us directly what we want to know – how precise are our summary statistics – other ways are popular.

A little experimentation will show that 1000 independent draws from a normal distribution will give adequately precise estimates of the population mean (root mean square error about 3% of population SD). But Markov chains are not independent – by definition, each value is related to the previous value in the chain – and auto-correlation is often high. Hence the need to estimate the effective sample size (ESS) or effective number of draws (n.eff), the number of independent draws which would contain the same information as the Markov chain to hand. This is done by the function coda::effectiveSize, which underlies most such estimates.

So what should we be aiming at for our ESS? Lunn et al (2013) suggest that 400 may be enough in practice for estimation of means, as it’s equivalent to an MCSE of 5% of SD. However, 4000 are recommended for proper estimation of a 95% credible interval: with that number the coverage will be between 94 and 96%. (An the other hand, Gelman et al (2014, p288)3Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., & Rubin, D.B. (2014) Bayesian data analysis, 3 edn. Chapman and Hall/CRC, Boca Raton say that ESS of 10 per chain is adequate.)

Leave a Reply

Your email address will not be published. Required fields are marked *