Simulating data is an excellent way to understand a model. You really have to think through how the model works in detail. And by running hundreds of simulations, you can see how good the analysis method is when applied to data with known true values, and check the bias and errors.

## The biological process

We’ll simulate data for 150 sites, *i* = 1, 2, …, 150; these are our spatial replicates. The expected number at each site is the same, $\lambda = 2.5$. The sites are all identical and independent, and the differences in the numbers at each, $N_i$, are due to random location of individual animals, so we use a Poisson distribution:$$N_i \sim {\rm Poisson}(\lambda)$$

**# Biological model
nSites <- 150 # Spatial replicates
lambda <- 2.5 # Expected number at each site
set.seed(1)
N <- rpois(nSites, lambda)
table(N)
# N
# 0 1 2 3 4 5 6 7
# 8 30 43 36 19 8 3 3
plot(table(N))**

Most sites have 1 to 4 animals, some have none, and the maximum is 7. We would get different numbers if we chose a different seed.

## The observation model

Now we collect data on our population. We visit each site 3 times,* j* = 1, 2, 3, and record the number of animals we detect. The number of animals, $N_i$, is the same at each visit, but we don’t detect all animals and the counts are not the same.

We’ll set the probability of detecting an individual on a single visit, *p*, at 0.6, the same value for all individuals at all sites on all visits, and detections are independent. There are no misidentifications or double counts. The counts are drawn from a binomial distribution:$$C_{ij} \sim {\rm Binomial}(N_i, p)$$

**# Observation model
nOcc <- 3 # Temporal replicates
p <- 0.6 # Probability of detecting an individual
C <- matrix(NA, nSites, nOcc)
for(i in 1:nOcc) {
C[, i] <- rbinom(nSites, N, p)
}
head(cbind(truth=N, C))
# truth
# [1,] 1 0 0 1
# [2,] 2 1 2 1
# [3,] 3 2 2 1
# [4,] 5 3 3 5
# [5,] 1 1 1 1
# [6,] 5 4 4 1**

Of course, the counts are on average lower than the true value, so an estimate of the population based on counts will be biased low.

## Old-style analysis

Before Andy Royle’s 2004 paper, the best estimates were obtained by taking the maximum count for each of the sites.

**# Naive analysis
Cmax <- apply(C, 1, max)
head(cbind(truth=N, C, Cmax=Cmax))
# truth Cmax
# [1,] 1 0 0 1 1
# [2,] 2 1 2 1 2
# [3,] 3 2 2 1 2
# [4,] 5 3 3 5 5
# [5,] 1 1 1 1 1
# [6,] 5 4 4 1 4
# Estimate total population size:
sum(Cmax)
# 316
sum(N) # True value
# 379
# Number of sites occupied?
sum(Cmax > 0)
# 140
sum(N > 0) # True value
# 142**

The maximum count is still biased low, though less biased than a count from a single visit. The estimate of overall abundance is only 83% of the true value. In contrast, we did detect the species at most of the sites where they occur, so even the naive estimate of occupancy is not too bad.

## N-mixture analysis

We run the N-mixture analysis with a simple Poisson-binomial model, which corresponds to the data generation scenario. Here are the Krushke-style diagram and the JAGS code:

**# File name "Nmix1.jags"
model{
for(i in 1:nSites) {
# Biological model
N[i] ~ dpois(lambda)
# Observation model
for(j in 1:nOcc) {
C[i,j] ~ dbin(p, N[i])
}
}
# Priors
lambda ~ dunif(0, 10)
p ~ dbeta(1,1)
# Derived variable
Ntotal <- sum(N)
}**

We use a uniform Beta(1,1) prior for the probability of detection, *p*, and a uniform prior for $\lambda$ between 0 and 10. The upper limit here is chosen so that it does not constrain the posterior for $\lambda$.

Now we can bundle the data and run the analysis:

**JAGSdata <- list(nSites=nSites, nOcc=nOcc, C=C)
str(JAGSdata)
# List of 3
# $ nSites: num 150
# $ nOcc : num 3
# $ C : int [1:150, 1:3] 0 1 2 3 1 4 3 3 2 0 ...
wanted <- c("lambda", "p", "Ntotal")
library(jagsUI)
jagsOut <- jags(JAGSdata, NULL, wanted, "Nmix1.jags", DIC=FALSE,
n.chains=3, n.iter=20000, n.adapt=5000, parallel=TRUE)
# Error in checkForRemoteErrors(val) :
# 3 nodes produced errors; first error: Error in node C[18,1]
# Node inconsistent with parents**

If you are lucky, that will run without error, but we were not lucky this time. We trusted JAGS to generate the starting values for all the parameters, including `lambda`

and `N`

, by drawing random values from the priors. A problem arises when the starting value for `N`

is smaller than one of the counts. In our case, `C[18,1]`

= 6, so if the initial value of `N[18]`

is less than 6 we get this error.

We can avoid this by specifying starting values for `N`

in an `inits`

function. This produces a named list with the required starting values. We could simply set the same large value of `N`

for all sites, or use the maximum count for each site, as we do here:

**# Specify initial values
Cmax <- apply(C, 1, max)
inits <- function() list(N = Cmax)
str(inits())
# List of 1
# $ N: int [1:150] 1 2 2 5 1 4 5 3 2 0 ...
jagsOut <- jags(JAGSdata, inits, wanted, "Nmix1.jags", DIC=FALSE,
n.chains=3, n.iter=20000, n.adapt=5000, parallel=TRUE)
mco <- mcmcOutput(jagsOut)
diagPlot(mco)
plot(mco)
summary(mco)
# mean sd median l95 u95 Rhat MCEpc
# lambda 2.691 0.204 2.680 2.302 3.097 1.001 1.499
# p 0.571 0.035 0.573 0.503 0.638 1.000 1.735
# Ntotal 402.567 23.056 400.000 359.000 446.000 1.000 1.821**

Convergence is good and lambda is well away from the upper limit of the prior, ie, 10.

The estimates of `Ntotal`

and `lambda`

are a bit high; the true value for this simulated population is 379 with a realised lambda of 379/150 = 2.53. And the estimate of `p`

is a bit low, which makes sense – if we underestimate `p`

we shall overestimate `lambda`

.

We can’t judge a modelling procedure on the basis of one simulation. I did 1000 simulations with the model and found that `lambda`

and `Ntotal`

are biased high, with mean `lambda`

of 2.54 and `Ntotal`

380 (vs. 2.5 x 150 = 375), while the mean estimate of `p`

was 0.59. Root mean squared errors (RMSEs) were 0.13 for `lambda`

, 19.6 for `Ntotal`

and 0.03 for `p`

.

Download a ZIP file with all the code here.