N-mixture model with simulated data

Simulating data is an excellent way to understand a model. You really have to think through how the model works in detail. And by running hundreds of simulations, you can see how good the analysis method is when applied to data with known true values, and check the bias and errors.

The biological process

We’ll simulate data for 150 sites, i = 1, 2, …, 150; these are our spatial replicates. The expected number at each site is the same, $\lambda = 2.5$. The sites are all identical and independent, and the differences in the numbers at each, $N_i$, are due to random location of individual animals, so we use a Poisson distribution:$$N_i \sim {\rm Poisson}(\lambda)$$

# Biological model
nSites <- 150  # Spatial replicates
lambda <- 2.5  # Expected number at each site

set.seed(1)
N <- rpois(nSites, lambda)
table(N)
# N
#  0  1  2  3  4  5  6  7
#  8 30 43 36 19  8  3  3
plot(table(N))

Most sites have 1 to 4 animals, some have none, and the maximum is 7. We would get different numbers if we chose a different seed.

The observation model

Now we collect data on our population. We visit each site 3 times, j = 1, 2, 3, and record the number of animals we detect. The number of animals, $N_i$, is the same at each visit, but we don’t detect all animals and the counts are not the same.

We’ll set the probability of detecting an individual on a single visit, p, at 0.6, the same value for all individuals at all sites on all visits, and detections are independent. There are no misidentifications or double counts. The counts are drawn from a binomial distribution:$$C_{ij} \sim {\rm Binomial}(N_i, p)$$

# Observation model
nOcc <- 3  # Temporal replicates
p <- 0.6   # Probability of detecting an individual

C <- matrix(NA, nSites, nOcc)
for(i in 1:nOcc) {
  C[, i] <- rbinom(nSites, N, p)
}
head(cbind(truth=N, C))
#      truth
# [1,]     1 0 0 1
# [2,]     2 1 2 1
# [3,]     3 2 2 1
# [4,]     5 3 3 5
# [5,]     1 1 1 1
# [6,]     5 4 4 1

Of course, the counts are on average lower than the true value, so an estimate of the population based on counts will be biased low.

Old-style analysis

Before Andy Royle’s 2004 paper, the best estimates were obtained by taking the maximum count for each of the sites.

# Naive analysis
Cmax <- apply(C, 1, max)
head(cbind(truth=N, C, Cmax=Cmax))
#      truth       Cmax
# [1,]     1 0 0 1    1
# [2,]     2 1 2 1    2
# [3,]     3 2 2 1    2
# [4,]     5 3 3 5    5
# [5,]     1 1 1 1    1
# [6,]     5 4 4 1    4

# Estimate total population size:
sum(Cmax)
# 316
sum(N)  # True value
# 379

# Number of sites occupied? 
sum(Cmax > 0)
# 140
sum(N > 0)  # True value
# 142

The maximum count is still biased low, though less biased than a count from a single visit. The estimate of overall abundance is only 83% of the true value. In contrast, we did detect the species at most of the sites where they occur, so even the naive estimate of occupancy is not too bad.

N-mixture analysis

We run the N-mixture analysis with a simple Poisson-binomial model, which corresponds to the data generation scenario. Here are the Krushke-style diagram and the JAGS code:

# File name "Nmix1.jags"
model{
  for(i in 1:nSites) {
    # Biological model
    N[i] ~ dpois(lambda)
    # Observation model
    for(j in 1:nOcc) {
      C[i,j] ~ dbin(p, N[i])
    }
  }
  # Priors
  lambda ~ dunif(0, 10)
  p ~ dbeta(1,1)
  # Derived variable
  Ntotal <- sum(N)
}

We use a uniform Beta(1,1) prior for the probability of detection, p, and a uniform prior for $\lambda$ between 0 and 10. The upper limit here is chosen so that it does not constrain the posterior for $\lambda$.

Now we can bundle the data and run the analysis:

JAGSdata <- list(nSites=nSites, nOcc=nOcc, C=C)
str(JAGSdata)
# List of 3
#  $ nSites: num 150
#  $ nOcc  : num 3
#  $ C     : int [1:150, 1:3] 0 1 2 3 1 4 3 3 2 0 ...

wanted <- c("lambda", "p", "Ntotal")

library(jagsUI)
jagsOut <- jags(JAGSdata, NULL, wanted, "Nmix1.jags", DIC=FALSE,
  n.chains=3, n.iter=20000, n.adapt=5000, parallel=TRUE)
# Error in checkForRemoteErrors(val) :
#   3 nodes produced errors; first error: Error in node C[18,1]
# Node inconsistent with parents

If you are lucky, that will run without error, but we were not lucky this time. We trusted JAGS to generate the starting values for all the parameters, including lambda and N, by drawing random values from the priors. A problem arises when the starting value for N is smaller than one of the counts. In our case, C[18,1] = 6, so if the initial value of N[18] is less than 6 we get this error.

We can avoid this by specifying starting values for N in an inits function. This produces a named list with the required starting values. We could simply set the same large value of N for all sites, or use the maximum count for each site, as we do here:

# Specify initial values
Cmax <- apply(C, 1, max)
inits <- function() list(N = Cmax)
str(inits())
# List of 1
#  $ N: int [1:150] 1 2 2 5 1 4 5 3 2 0 ...

jagsOut <- jags(JAGSdata, inits, wanted, "Nmix1.jags", DIC=FALSE,
  n.chains=3, n.iter=20000, n.adapt=5000, parallel=TRUE)
mco <- mcmcOutput(jagsOut)
diagPlot(mco)
plot(mco)
summary(mco)
#           mean     sd  median     l95     u95  Rhat MCEpc
# lambda   2.691  0.204   2.680   2.302   3.097 1.001 1.499
# p        0.571  0.035   0.573   0.503   0.638 1.000 1.735
# Ntotal 402.567 23.056 400.000 359.000 446.000 1.000 1.821

Convergence is good and lambda is well away from the upper limit of the prior, ie, 10.

The estimates of Ntotal and lambda are a bit high; the true value for this simulated population is 379 with a realised lambda of 379/150 = 2.53. And the estimate of p is a bit low, which makes sense – if we underestimate p we shall overestimate lambda.

We can’t judge a modelling procedure on the basis of one simulation. I did 1000 simulations with the model and found that lambda and Ntotal are biased slightly high, with mean lambda of 2.54 and Ntotal 380 (vs. 2.5 x 150 = 375), while the mean estimate of p was 0.59. Root mean squared errors (RMSEs) were 0.13 for lambda, 19.6 for Ntotal and 0.03 for p.

Real vs simulated data

Here we worked with simulated data generated by a Poisson distribution, ie, simulating the results for animals independently and randomly distributed. In reality, animals are rarely independent and some degree of clustering is normal. Also some sites may have 0 animals because they are unsuitable for the species. Using the Poisson distribution for real data will usually give disastrously bad estimates!

The negative binomial is a much better choice. more to come!

Download a ZIP file with all the code here (includes NIMBLE code).

2 thoughts on “N-mixture model with simulated data”

  1. Hi, Mike. Have really benefited from this site and your curated web page. I have seen zero-inflated models where either phi or its compliment, theta = (1-phi), taken as the Bernoulli trial on habitat suitability/zero-inflation. These variations should yield the same result with changing signs on lambda (aka mu) covariates. True?

    1. Brian, Glad you have found these pages useful.

      The “more to come” at the end of the page above refers to zero-inflated and over-dispersed models. It’ll be a while before I get around to those: currently in the middle of a workshop, and distance sampling may be higher priority. In the meantime, there’s an overview of the options here.

      This data set was created with just a Poisson model. If you tried a zero-inflated analysis, I’d expect a small amount of zero inflation and a corresponding higher value of lambda for the occupied sites.

Leave a Reply

Your email address will not be published.

The maximum upload file size: 1 MB. You can upload: image, document, text, archive. Drop file here