In this section we will look at the simplest occupancy model: one species and one time period. The species may be present at a site but not be detected, but we assume that the species is not recorded when it is absent, ie, no **false positives**. We also assume **closure**, that is the species is either present or absent at a site throughout the study; individual animals can come and go, but the status of the species must not change. To estimate (and correct for) probability of detection, we need replicate data for some of the sites, and we assume these replicate observations are independent.

### Salamanders

For our example we’ll use the data collected in 2001 for Blue Ridge two-lined salamanders (*Eurycea wilderae*) in the Great Smokey Mountains National Park in the USA. These data are used by MacKenzie et al (2018, p135)^{1}MacKenzie, D.I., Nichols, J.D., Royle, A.J., Pollock, K.H., Bailey, L.L., & Hines, J.E. (2018) *Occupancy estimation and modeling : inferring patterns and dynamics of species occurrence* (2nd edn) Elsevier Publishing.

Each site consisted of a 50 x 3m plot where natural cover objects were examined and a coverboard transect with 5 coverboards at 10m intervals. The 39 sites were visited 5 times at 2-week intervals. The data consist of the number of visits where the target species was detected.

**# Create vector of number of detections
( y <- rep(4:0, c(1,4,1,12,21)) )
# [1] 4 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 0 0
# [21] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
table(y)
# y
# 0 1 2 3 4
# 21 12 1 4 1
n <- 5 # No. of surveys at each site
( nSites <- length(y) )
# [1] 39
mean(y > 0) # naive occupancy estimate
# [1] 0.4615**

Salamanders were detected at least once at 18 of the 39 sites, while 21 sites had no detections. If we were to assume no detection means absence, ie, probability of detection if present = 1, we would estimate occupancy as 18/39 = 0.46. But that would be naïve; if it were true, the data would consist entirely of 5s and 0s.

## The model

#### Likelihood

Our likelihood model has 2 parts. The biological model describes the way we think presence works, with a parameter $z_i = 1$ if site $i$ is occupied and $z_i = 0$ if not occupied; $z$ is a latent variable, inferred from the observed capture histories. For this simple model, $z_i$ is drawn from a Bernoulli distribution with probability parameter $\psi$: $$ z_i \sim {\rm Bernoulli}(\psi)$$

The second part of the model describes the observation process. A detection is a success, and we model it with a binomial distribution. If the site is occupied ($z_i = 1$) the probability of success is $p$, but if it’s not occupied ($z_i = 0$) probability of success is 0. Thus: $$y_i \sim {\rm Binomial}(n, pz_i)$$

#### Priors

We need priors for both $\psi$ and $p$. They are probabilities, so we use the Beta distribution. If we had good information about these salamanders at these sites we could use informative priors, but here we’ll use minimally informative uniform priors: $$\psi \sim {\rm Beta}(1, 1)$$ $$p \sim {\rm Beta}(1, 1)$$

## The model diagram

### The JAGS code

**# File name "salamanders.jags"
model {
# Likelihood
for(i in 1:nSites) {
# Biological model
z[i] ~ dbern(psi) # z=1 if occupied
# Observation model
y[i] ~ dbin(p * z[i], n)
}
# Priors
p ~ dbeta(1, 1) # Uninformative prior
psi ~ dbeta(1, 1)
# Derived values
N <- sum(z)
}**

*JAGS syntax: Note that the *`dbin `

*distribution has probability as the first argument and the number of trials as the second argument; this differs from the conventional order.*

## R code to run the model

**# Organise the data etc
jagsData <- list(y = y, n = n, nSites = nSites)
str(jagsData)
# List of 3
# $ y : int [1:39] 4 3 3 3 3 2 1 1 1 1 ...
# $ n : num 5
# $ nSites: int 39
wanted <- c("p", "psi", "N")
( outSal <- jags(jagsData, NULL, wanted,
model="salamanders.jags", DIC=FALSE,
n.chains=3, n.iter=10000) )
# ...
# Error in jags.model(file = model.file, data = data, inits = inits, n.chains = n.chains, :
# Error in node y[1]
# Node inconsistent with parents
y[1]
# [1] 4**

Hmm, that didn’t work. The node `y[1]`

is inconsistent with parents; I checked `y[1]`

and it’s 4, ie, salamanders were detected on 4 out of 5 visits to site #1. In our model we have `y[i] ~ dbin(p * z[i], n)`

, so the parents of `y[1]`

are `p`

, `z[1]`

and `n`

.

The problem is due to the starting value of `z[1]`

, which we left to JAGS to generate – we didn’t specify any initial values. JAGS draws random 0/1 values from a Bernoulli distribution. If it sets `z[1] = 0`

, ie, site #1 is not occupied, then detecting the species on 4 visits is obviously not possible, hence the error. The usual way to solve this is to specify initial values for `z`

, often making them all 1s.

**# Specify initial values for 'z'
inits <- function() list(z = rep(1, nSites))
inits()
# $z
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [21] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
( outSal <- jags(jagsData, inits, wanted,
model="salamanders.jags", DIC=FALSE,
n.chains=3, n.iter=11000, n.burn=1000) )
# ...
# mean sd 2.5% 50% 97.5% Rhat n.eff
# p 0.259 0.056 0.157 0.257 0.374 1 11013
# psi 0.613 0.123 0.395 0.604 0.884 1 9768
# N 24.103 4.061 19.000 23.000 34.000 1 9065
mcmcOutput::diagPlot(outSal)**

But wait! Why is JAGS is picking starting values anyway? We know that the first 18 sites are occupied, so the corresponding *z*‘s have to be 1; *z* is a *partially observed* variable, with some values known and others latent. We can pass a vector of values for *z* to JAGS as data, with 1 if the species was detected at that site and NA where it was not detected and may be present or absent. JAGS will only work on estimating the unknown `z`

‘s and the model will take less time to run.

**# Add 'z' to the data
z <- ifelse(y > 0, 1, NA)
jagsData2 <- list(y = y, n = n, nSites = nSites, z = z)
str(jagsData2)
# List of 3
# $ y : int [1:39] 4 3 3 3 3 2 1 1 1 1 ...
# $ n : num 5
# $ nSites: int 39
# $ z : num [1:39] 1 1 1 1 1 1 1 1 1 1 ...
( outSal2 <- jags(jagsData2, NULL, wanted,
model="salamanders.jags", DIC=FALSE,
n.chains=3, n.iter=11000, n.burn=1000) )
# ...
# mean sd 2.5% 50% 97.5% Rhat n.eff
# p 0.259 0.056 0.159 0.257 0.374 1.000 7020
# psi 0.611 0.123 0.396 0.603 0.882 1.000 8604
# N 24.082 4.009 18.000 23.000 34.000 1.001 3609
mcmcOutput::diagPlot(outSal2)**

The diagnostic plots look good, though we should increase the number of iterations to get good estimates of the credible interval; it runs quickly, so that would be easy.

### Looking at the results

**# convert to mcmcOutput and do nice plots of marginal probabilities
library(mcmcOutput)
salB <- mcmcOutput(outSal2, default='psi')
plot(salB)**

The plots above show the marginal probabilities: the plot for `psi `

ignores the values of `p`

. Looking at those plots it’s easy to forget that the estimates of psi and p are not independent. We’ll plot the two together and estimate the correlation:

**# Look at joint probability
plot(salB$psi, salB$p, pch=19, cex=0.2, col="#00000033",
xlab="psi", ylab="p", las=1)
abline(v=mean(salB$psi), h=mean(salB$p), col='blue')
# ... and correlation
cor(salB$psi, salB$p)
# [1] -0.5487149**

We see a negative correlation between the posterior distributions of `psi `

and `p`

, with high `p`

related to low `psi`

. The correlation of -0.5 is moderate, but we should still take it into account if we are including both in a calculation.

There’s not a lot we can explore with this simple data set. In the next section we’ll work with a data set for willow tits in Switzerland, where we have covariates.

Download a ZIP file with the code here.

Good evening,

I started to look out on Bayesian statistics only recently. So, maybe, the question I’m about to do is the product of my scarse knowledge on this topic. However, can someone please explain why the diagnostic plot looks good? Based on which this inference was made? Many thanks.

Marcello

Marcello,

Have you looked at our diagnostics page? You’ll find examples of diagnostic plots that don’t look good here and here.

Dear Admin,

Many thanks. I didn’t look at these pages. Both are very useful. Thanks again.

Sincerely,

Marcello