# Data augmentation and closed captures

In order to write a JAGS model for abundance from mark-recapture data, we need to use a technique called “data augmentation”. This video gives the background:

This is a trial recording and I will do a better one if this all works.

With data augmentation, we recognise that our mark-recapture data are incomplete, the animals present but never detected are not included. We need to include all-zero rows for these, but we don’t know how many there are. The solution is to add lots of all-zero rows, and then estimate how many of those correspond to animals present but not detected.

We’ll look at a “closed capture” analysis; this type of analysis has been superseded by spatial capture-recapture (SCR), but it does provide a simple framework to try out data augmentation.

## The Kanha tiger data set

For our example, we will use camera trap data for tigers from Kanha National Park in northern India. This work is reported by Karanth et al (2004)1Karanth, K.U., Nichols, J.D., Kumar, N.S., Link, W.A., & Hines, J.E. (2004) Tigers and their prey: Predicting carnivore densities from prey abundance. Proceedings of the National Academy of Sciences, 101, 4854-4858. and Ullas Karanth and WCS India kindly provided the data.

The data are a data frame with rows for 26 individual animals and 10 columns for the capture occasions. A value of 1 indicates that the individual was captured somewhere on the trap array on that occasion. There is no information here on trap location or where the animals were caught. You can load and check the data with the following code:


str(tigs)
'data.frame':   26 obs. of  10 variables:
$V1 : int 1 1 1 0 0 0 0 0 0 0 ...$ V2 : int  0 0 1 1 1 1 0 0 0 0 ...
$V3 : int 0 0 0 1 0 0 1 1 0 0 ...$ V4 : int  1 0 1 0 1 0 0 1 1 1 ...
$V5 : int 0 0 0 0 0 0 0 0 0 1 ...$ V6 : int  0 0 0 0 0 0 1 0 0 0 ...
$V7 : int 0 0 0 0 0 0 1 0 0 0 ...$ V8 : int  1 1 0 1 0 0 0 0 0 1 ...
$V9 : int 1 0 1 1 0 0 0 0 1 1 ...$ V10: int  0 1 1 1 1 0 0 0 1 0 ...
summary(tigs)
# ... output not shown

# Get maximum likelihood estimates:
wiqid::closedCapM0(tigs)
Real values:
est   lowCI   uppCI
Nhat 28.4464 26.4289 39.9533
phat  0.2039  0.1536  0.2655

For the simple model we can aggregate the captures, so that $y$ is a vector with the number of occasions (out of $n$ = 10) that each animal was captured:

( y <- rowSums(tigs) )
[1] 4 3 5 5 3 1 3 2 3 4 3 1 1 2 1 2 2 2 1 1
[21] 3 1 2 1 1 1

## The closed-capture model

We are using the simplest closed-capture model, often referred to as M02Otis, D.L., Burnham, K.P., White, G.C., & Anderson, D.R. (1978) Statistical inference from capture data on closed animal populations. Wildlife Monographs, 62, 1-135. or N(.) p(.).

The model is based on the following assumptions:

1. Animals do not lose their marks and marks are correctly read: Tigers do not change their stripes, but it’s important that photographs are correctly identified.
2. The population is closed: no animals enter or leave the population during the study; no births, deaths, immigration or emigration: That is usually satisfied if the study period is short.
3. Every animal in the population has a non-zero probability of being captured: For territorial animals, we need to have a sufficiently dense array of camera traps so that every animal has at least one trap in their territory.
4. All animals have equal probability of capture: This is problematic! Even if there is no differences in behaviour among animals, some with have more than one trap in their territory, and will have a higher probability of capture.
5. Captures are independent: This is the least important assumption, and is reasonable here; tigers do tend to avoid each other, so there may be a little non-independence.

The violation of #4, all animals have equal probability of capture, is serious and we should be using a model that allows for heterogeneity or a spatial capture-recapture (SCR) model that deals separately with each trap. We will explore SCR models next, but will press ahead with the simple model on this page.

If you compare this with the model we used for salamanders, you will see that it is exactly the same apart from the change in notation: $\psi$ becomes $\Omega$ and $z_i$ becomes $w_i$. Next, the model in JAGS code, which is the same as the salamanders model apart from the notation:

# Save this in the file "closedCaptures.jags"
model{
# Likelihood
for(i in 1:M) {
# Ecological model
w[i] ~ dbern(omega)  # w=1 if present
# Observation model
y[i] ~ dbin(p * w[i], n)
}

# Priors
p ~ dbeta(1, 1) # Uninformative prior
omega ~ dbeta(1, 1)

# Derived values
N <- sum(w)
}

## R code for preparations and running the model

We will augment the data set by adding 30 zeros to the end of $y$ to represent 30 extra rows with all zeros in the data set. Then we put all the data into a list, including the parameter w, which we know must be 1 if the animal was captured, otherwise NA; we do not need to provide initial values for w. We complete the other preparations and run the model with jagsUI::jags:

# Augment the data
naug <- 30
yaug <- c(y, rep(0, naug))
( M <- length(yaug) )
# [1] 56

# Organise the data etc
jagsData <- list(y = yaug, n = 10, M = M,
w = ifelse(yaug > 0, 1, NA))
str(jagsData)
# List of 4
# $y: num [1:56] 4 3 5 5 3 1 3 2 3 4 ... #$ n: num 10
# $M: int 56 #$ w: num [1:56] 1 1 1 1 1 1 1 1 1 1 ...

wanted <- c("p", "omega", "N")

( outM0 <- jags(jagsData, NULL, wanted,
model="closedCaptures.jags", DIC=FALSE,
n.chains=3, n.iter=20000, n.thin=2) )
#         mean    sd   2.5%    50%  97.5% Rhat n.eff
# p      0.200 0.028  0.148  0.199  0.256    1 30000
# omega  0.527 0.077  0.379  0.525  0.682    1 14374
# N     29.519 2.455 26.000 29.000 35.000    1 17947

The model runs in less than 5 secs on my laptop, and the diagnostic statistics look good. We’ll convert to an mcmcOutput object and look at the diagnostic plots; apart from the usual diagnostics, we need to check that the posterior for $\Omega$ and $N$ are not constrained on the right.

# convert to mcmcOutput and do plots
library(mcmcOutput)
outB <- mcmcOutput(outM0, default='N')
diagPlot(outB)

plot(outB)

The diagnostic plot for $\Omega$ only goes up to 0.8 and the highest value for $N$ is 37, well below the value for $M$ = 56.

The maximum likelihood estimate for $N$ earlier was 28.45, and we see in the histogram that the mode is at 28. But the posterior distribution is skewed, with a heavy right tail, so the mean is higher than the mode.

## Prior distribution for N

We have not dealt with prior distributions for data augmentation models, simply treating them as the same as for an occupancy model.

By choosing a value for $M$ and using a uniform distribution for $\Omega$, we are implicitly setting a discrete uniform prior for $N$ over the range from 0 to $M$. And if we plan to check and increase $M$ until the posterior for $N$ is not constrained by the prior, it’s effectively a range from 0 to infinity.

There is no guarantee that a model with this prior will converge (Link 2013)3Link, W.A. (2013) A cautionary note on the discrete uniform prior for the binomial N. Ecology, 94, 2173-2179., and with some data sets we may be led to increase $M$ to values which are not biologically plausible.

If our favoured “uninformative” prior becomes nonsensical, we need to put on our biologist hats and devise a weakly informative prior. Link’s (2013) suggestion to use ${\rm Beta}(0.001, 1)$ as the prior for $\Omega$ has probability decreasing as $N$ increases, but it gives an almost infinite probability for $N = 0$, which is no good if we know the species is there. We can do better.

The area of Kanha NP is almost 1000 km2, and if we reckon that 10 tigers per 100 km2 is the limit of plausibility, we have a maximum of 100 animals. This is our value for $M$.

Maybe those who know Kanha reckon that 2 animals per 100 km2 is the most likely value: 20 will become the mode of our prior distribution. We set the spread of the prior with the concentration parameter: a value of 2 gives a uniform prior, and we want a bit more than that. We’ll plot the distribution for concentrations of 3, 4 and 5 and see which reflects our knowledge.

library(wiqid)

# Maximum plausible value
M <- 100
# Prior mode for N and omega
Nmode <- 20
omegaMode <- Nmode/M
# Range of values for N and omega
N <- 0:M
omega <- N/M

# Do the plots
op <- par(mfrow=c(1,3))
pN3 <- dbeta3(omega, Nmode/M, 3)/M
plot(N, pN3, main="mode=20, concentration=3")
pN4 <- dbeta3(omega, Nmode/M, 4)/M
plot(N, pN4, main="mode=20, concentration=4")
pN5 <- dbeta3(omega, Nmode/M, 5)/M
plot(N, pN5, main="mode=20, concentration=5")
par(op)

All three priors give zero probability for $N = 0$ and $N = 100$. My feeling is that the distribution should tail off gradually for the larger values rather than dropping steeply as in the left plot. If 100 is implausible, 99 and 98 are little better. The right plot gives smaller probabilities for just 1 or 2 animals, which seems right.

JAGS uses the shape parameters for the beta distribution, and we can calculate those with:

pars <- getBeta3Par(Nmode/M, 5)
pars
#      shape1 shape2
# [1,]    1.6    3.4

Then we change the code for the prior for omega in the JAGS model file:

omega ~ dbeta(1.6, 3.4) #Informative prior

We change the data augmentation step so that $M = 100$ and rerun the model with the new prior. Here are my results:

#         mean    sd   2.5%    50%  97.5% Rhat n.eff
# p      0.200 0.028  0.147  0.199  0.256    1 22422
# omega  0.296 0.050  0.204  0.294  0.400    1 30000
# N     29.484 2.432 26.000 29.000 35.000    1 30000

For the Kanha tiger data, the results are the same, differences are well within the range we’d expect when running the same model twice. The histogram for the posterior of $N$ looks the same as before, with the same mean and same HDI interval.

The new model takes longer to run than the old model. Run time increases linearly with $M$, so it makes sense to try the uniform prior with a smaller $M$ and only resort to the informative prior and maximum plausible $M$ if that does not work.