Occupancy with site covariates

In the last section we looked at a simple single-species, single-season closed occupancy model. We still keep those assumptions, but now we model probability of occupancy as a function of habitat covariates.

Estimating the effects of predictors requires more information than just estimating $\psi$. We should count an additional 20 sites for each coefficient in the model. In the example below, we have 237 sites and we want to estimate 3 coefficients, so that should work well.

Swiss willow tits

Our example data come from the Swiss Breeding Bird Survey (“Monitoring Häufige Brutvögel” MHB). Surveys are done 2-3 times per year during the breeding season at 267 1-km square quadrats across Switzerland. We will use the data for willow tits (Poecile montanus) analysed by Royle and Dorazio (2008, p.87ff)1Royle, J.A. & Dorazio, R.M. (2008) Hierarchical modeling and inference in ecology: the analysis of data from populations, metapopulations and communities. Academic Press..

You can download and check the data with the following code:

# Read in data
wt <- read.csv("http://bcss.org.my/data/willowtits.csv",
  comment="#")
str(wt)
#'data.frame':   237 obs. of  15 variables:
# $ y.1   : int  0 0 0 0 0 0 0 0 1 0 ...
# $ y.2   : int  0 0 0 0 0 0 0 0 1 0 ...
# $ y.3   : int  0 0 0 0 0 0 0 0 0 1 ...
# $ c.1   : int  0 0 0 0 0 0 0 0 2 0 ...
# $ c.2   : int  0 0 0 0 0 0 0 0 1 0 ...
# $ c.3   : int  0 0 0 0 0 0 0 0 0 1 ...
# $ elev  : int  420 450 1050 1110 510 630 590 530 ...
# $ forest: int  3 21 32 35 2 60 5 13 50 57 ...
# $ dur.1 : int  240 160 120 180 210 150 115 155 165 ...
# $ dur.2 : int  240 155 105 170 225 180 105 140 165 ...
# $ dur.3 : int  240 140 85 145 235 195 95 135 180 ...
# $ day.1 : int  29 13 30 23 28 17 16 24 25 21 ...
# $ day.2 : int  58 39 47 44 56 56 37 47 46 38 ...
# $ day.3 : int  73 62 74 71 73 73 76 74 70 50 ...
# $ length: num  6.2 5.1 4.3 5.4 3.6 6.1 5.1 3.7 ...

Here’s what the variables are:

  • y.1 to y.3 : Detection/nondetection (1/0) on each of three visits to each quadrat;
  • c.1 to c.3 : Counts of territories on each of the three visits;
  • elev : quadrat elevation (m);
  • forest : forest cover (%);
  • dur.1 to dur.3 : duration of each visit (mins);
  • day.1 to day.3 : Julian date of each visit, 1 = 1st April;
  • length : length of the transect walked within the quadrat (km).

In this section we will use the detection/nondetection data (y.1 to y.3) and the elevation (elev) and forest covariates.

Data preparation

Since we are not using the survey-level covariates (dur.1-dur.3 and day.1-day.3), we can simplify the model by aggregating the detection data to produce vectors with the number of visits to each site $n$ and the number of visits when willow tits were detected $y$.

# Extract and aggregate the detection data:
Y <- wt[, 1:3]
( n <- rowSums(!is.na(Y)) ) # number of visits
#[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3...
( y <- rowSums(Y, na.rm=TRUE) ) # number with detection
#[1] 0 0 0 0 0 0 0 0 2 1 0 0 0 0 1 2 0 3 0 1 0 0...

We will also standardise the elevation and forest data so that our predictors have mean 0 and standard deviation (SD) 1. We use the standardize function in the wiqid package.

# Standardise the covariates:
library(wiqid)
elevS <- standardize(wt$elev)
mean(elevS)
# [1] 8.442321e-17
sd(elevS)
# [1] 1
forestS <- standardize(wt$forest)
mean(forestS)
# [1] 8.606644e-17
sd(forestS)
# [1] 1

I like to add an ‘S’ to the end of the variable to indicate that it’s standardised. The means are not quite zero because the SD is adjusted after centering.

The model

Biological process

As with the salamander model, $z_i$ is drawn from a Bernoulli distribution. But now the probability of occupancy depends on the elevation and forest cover, and is different for each site:$$z_i \sim {\rm Bernoulli}(\psi_i)$$

Because $\psi$ is a probability we use a logistic link, with an expression that’s essentially a logistic regression model. Switzerland is mountainous, with elevations ranging from 250 to 2750m; it’s likely that willow tit occupancy is highest at some middle elevation, so we include elevation^2 as a predictor to give a “humpy” curve rather than a straight line. $${\rm logit}(\psi_i) = \beta_0 + \beta_{for} Forest_i + \beta_{elev} Elev_i + \beta_{elev2} Elev_i^2$$

Observation process

We won’t use covariates for the probability of detection, so the observation model is the same as before:$$y_i \sim {\rm Binomial}(n_i, pz_i)$$

Priors

Intercept: The intercept, $\beta_0$, is the probability of occupancy on the logit scale for a site with all the predictors equal to zero. Since we standardised the covariates to form the predictors, that’s a site with forest cover 34.8% and elevation 1183m. I have difficulty imagining a suitable prior on the logit scale, but I can easily define a uniform prior for the probability and then convert to the logit scale: $$\psi_0 \sim {\rm Beta}(1,1)$$ $$ \beta_0 = {\rm logit}(\psi_0)$$

Coefficients: Since we are working on the logit scale and we have standardised the predictors to have SD = 1, we’d be very surprised if the coefficients were outside the range $\pm 5$. We’ll use ${\rm Uniform}(-5, 5)$, as we did with the socks in the box.

Probability of detection: We can use a uniform Beta prior for this, just as we did for the salamanders example.

Model diagram

JAGS code

# Save this in the file "wt_siteCovs.jags"
model{
  # Likelihood
  for(i in 1:nSites) {
    # biological model
    logit(psi[i]) <- b0 + bFor * fst[i] + 
        bElev * ele[i] + bElev2 * ele2[i]
    z[i] ~ dbern(psi[i])
    # observation model
    y[i] ~ dbin(p * z[i], n[i])
  }

  # Priors
  psi0 ~ dbeta(1, 1)
  b0 <- logit(psi0)
  bFor ~ dunif(-5, 5)    # forest
  bElev ~ dunif(-5, 5)   # elevation
  bElev2 ~ dunif(-5, 5)  # elevation^2
  p ~ dbeta(1, 1)
  
  # Derived variable
  N <- sum(z)
}

R code for preparations and running the model

We need to include the predictors in the data to be passed to JAGS. We also include $\tt z$, with 1 for sites where willow tits were detected and NA elsewhere. My initial run with $\tt n.iter = 5000$ gave $\tt n.eff$ values that were too small, so I increased that to get values of 10,000 or more; it still takes less than 1 minute to run provided we run the chains in parallel.

library(jagsUI)

# Prepare the data object:
jagsData <- list(y = y, n = n, nSites = length(n),
                z = ifelse(y > 0, 1, NA),
                fst = forestS, ele = elevS, ele2 = elevS^2)
str(jagsData)
# List of 6
# $ y     : num [1:237] 0 0 0 0 0 0 0 0 2 1 ...
# $ n     : num [1:237] 3 3 3 3 3 3 3 3 3 3 ...
# $ nSites: int 237
# $ z     : num [1:237] NA NA NA NA NA NA NA NA 1 1 ...
# $ fst   : num [1:237] -1.15389 -0.50052 -0.10124 0.00766 ...
# $ ele   : num [1:237] -1.18 -1.133 -0.205 -0.112 -1.041 ...
# $ ele2  : num [1:237] 1.392 1.2847 0.0421 0.0126 1.0828 ...

wanted <- c("p", "psi0", "b0", "bFor", "bElev", "bElev2",
  "N")

# Run the model (20 secs)
( out1 <- jags(jagsData, NULL, wanted, 
  model="wt_siteCovs.jags", DIC=FALSE,
  n.chains=3, n.iter=10000, parallel=TRUE) )
# ...
#          mean    sd   2.5%    50%  97.5%   Rhat n.eff
# p       0.787 0.029  0.727  0.788  0.840  1.000 30000
# psi0    0.452 0.067  0.323  0.452  0.587  1.001  1870
# b0     -0.195 0.277 -0.739 -0.194  0.352  1.001  1852
# bFor    0.871 0.238  0.419  0.867  1.355  1.000  6115
# bElev   2.070 0.315  1.489  2.057  2.722  1.001  1547
# bElev2 -1.156 0.279 -1.726 -1.147 -0.641  1.002  1232
# N      80.510 1.335 79.000 80.000 84.000  1.000 30000

mcmcOutput::diagPlot(out1)

The diagnostic plots look fine. In particular we see that the posterior distributions for the coefficients, bFor, bElev and bElev2, are well clear of the limits of $\pm 5$ imposed by our prior.

The coefficient values show that forest has a positive effect – occupancy is higher with higher forest cover – and the elevation effect has a hump (bElev2 is negative)

Plotting the output

The best way understand the results is to generate some plots. We’ll start with a simple plot of how occupancy varies with elevation for sites with average forest cover (35%) with corresponds to forestS = 0. We’ll use the point estimates for the coefficients in out1$mean.

# Create a vector of values for the x axis
range(wt$elev)
# [1]  250 2750
xx <- seq(250, 2750, length=101)
# Need to scale xx the same way we scaled the elev data
xxS <- standardize2match(xx, wt$elev)

logit.psi <- with(out1$mean, b0 + bElev * xxS +
    bElev2 * xxS^2)
psi <- plogis(logit.psi)
plot(xx, psi, type='l')

Occupancy reaches a maximum – according to our model – at an elevation of 1760 for sites with average forest cover. That’s quite high: let’s check what values for forest cover were observed.

plot(wt$elev, wt$forest)
abline(h=35)

High forest cover (70-80%) only occurs up to about 2000m, and even average cover only gets to 2200m. So the line beyond 2200m on our simple plot of occupancy above is nonsense. We’ll do new plots for three different values of forest cover, 0%, 35% and 80%. We’ll also plot credible intervals, for which we need to use the whole MCMC chains, not just the point estimates. That will give us a vector of 30,000 values for each elevation and forest cover value; we’ll do these in a for loop, just keeping the values we need to plot.

# Standardize the forest cover values
( forS <- standardize2match(c(0, 35, 80), wt$forest) )
# [1] -1.262784855  0.007657883  1.641084260

# Get the data to plot (mean and credible interval)
toplot0 <- matrix(NA, 101, 3)
for(i in 1:101) {
  logit.psi.tmp <- with(out1$sims.list, 
      b0 + bFor * forS[1] + bElev * xxS[i] +
      bElev2 * xxS[i]^2)
  psi.tmp <- plogis(logit.psi.tmp)
  toplot0[i, 1] <- mean(psi.tmp)
  toplot0[i, 2:3] <- hdi(psi.tmp)
}

which(xx == 2200)
# [1] 79
toplot35 <- matrix(NA, 79, 3)
for(i in 1:79) {
  logit.psi.tmp <- with(out1$sims.list, 
      b0 + bFor * forS[2] + bElev * xxS[i] +
      bElev2 * xxS[i]^2)
  psi.tmp <- plogis(logit.psi.tmp)
  toplot35[i, 1] <- mean(psi.tmp)
  toplot35[i, 2:3] <- hdi(psi.tmp)
}

which(xx == 2000)
# [1] 71
toplot80 <- matrix(NA, 79, 3)
for(i in 1:71) {
  logit.psi.tmp <- with(out1$sims.list, 
      b0 + bFor * forS[3] + bElev * xxS[i] + 
      bElev2 * xxS[i]^2)
  psi.tmp <- plogis(logit.psi.tmp)
  toplot80[i, 1] <- mean(psi.tmp)
  toplot80[i, 2:3] <- hdi(psi.tmp)
}

# Plot semi-transparent polygons first, then add the lines
plot(xx, toplot0[, 1], ylim=0:1, type='n', las=1,
  xlab="Elevation", ylab="Occupancy")
polygon(x=c(xx, rev(xx)),
  y=c(toplot0[, 2], rev(toplot0[, 3])),
  col=adjustcolor('yellow', 0.5), border=NA)
polygon(x=c(xx[1:79], rev(xx[1:79])),
  y=c(toplot35[, 2], rev(toplot35[, 3])),
  col=adjustcolor('skyblue', 0.5), border=NA)
polygon(x=c(xx[1:71], rev(xx[1:71])),
  y=c(toplot80[, 2], rev(toplot80[, 3])),
  col=adjustcolor('lightgreen', 0.5), border=NA)
lines(xx, toplot0[, 1], lwd=2, col='brown')
lines(xx[1:79], toplot35[, 1], lwd=2, col='blue')
lines(xx[1:71], toplot80[, 1], lwd=2, col='darkgreen')
legend('topleft', c("80% forest", "35% forest",
    "0% forest"),
    lwd=2, col=c('darkgreen', 'blue', 'brown'), bty='n')

Distribution map of Swiss willow tits

Once we have fitted our model, we can predict the probability of occupancy for other 1×1 km quadrats, provided we have the values of the covariates for each of them. The unmarked package has a data set with elevation and forest cover for all of Switzerland.

library(unmarked)
data(Switzerland)
str(Switzerland)
# 'data.frame':   42275 obs. of  5 variables:
# $ x        : int  910942 910942 911942 911942 911942 911942 ...
# $ y        : int  54276 55276 54276 55276 56276 57276 ...
# $ elevation: int  340 340 380 390 357 357 500 472 462 472 ...
# $ forest   : int  6 11 4 72 9 5 0 43 6 0 ...
# $ water    : int  0 1 0 3 7 5 0 0 0 0 ...

# Convert elevation and forest cover to rasters and plot
library(raster)
elevR <- rasterFromXYZ(Switzerland[,1:3])
forR <- rasterFromXYZ(Switzerland[,c(1,2,4)])
par(mfrow=1:2)
plot(elevR, main="Elevation", axes=FALSE, box=FALSE,
    col=terrain.colors(255))
plot(forR, main="Forest cover", axes=FALSE, box=FALSE)
par(mfrow=c(1, 1))

Since we have the habitat variables in a data frame, we could use those to calculate $\psi$ and then convert that to a raster for plotting. But as Ugyen pointed out in the comments, you will usually be starting with a raster layer with those variables, so we’ll work with the rasters. Ordinary arithmetical operations work fine for rasters, but some functions – notably standardize2match and plogis – do not.

We standardise the elevation and forest data to match the way we standardised the original data for the willow tits, then we can calculate occupancy for every pixel of the map and plot it. We’ll use the means of the posterior distributions to do this.

# Scale the elevation and forest covariates as we did for the willow tits data
elevRS <- (elevR - mean(wt$elev)) / sd(wt$elev)
forRS <- (forR - mean(wt$forest)) / sd(wt$forest)

# Calculate occupancy for each pixel of the map
logit.psi <- with(out1$mean, b0 + bFor * forRS + 
    bElev * elevRS + bElev2 * elevRS^2)
psiR <- 1 / (1 + exp(-logit.psi))

# Now plot the new raster
mapPalette <- colorRampPalette(c("grey", "yellow",
  "orange", "red"))
plot(psiR, col=mapPalette(100),
    main="Willow tit occupancy",
    axes=FALSE, box=FALSE)

Once we have the raster for occupancy, we can easily calculate an overall mean occupancy figure for willow tits in Switzerland:

# Overall mean occupancy in Switzerland
mean(values(psiR), na.rm=TRUE)
# 0.2677942

What next?

This data set is used by Royle & Dorazio (2008) Hierarchical modeling and inference in ecology and Hooten & Hobbs (2015) A guide to Bayesian model selection for ecologists. Ecological Monographs, 85, 3-28. We plan to add pages for goodness-of-fit checks and model selection in the future. But priority goes to an analysis with survey covariates for detection probability.

Download a ZIP file with the code here (includes NIMBLE code).

14 thoughts on “Occupancy with site covariates”

  1. This looks great! Is there a possibility to demonstrate how to produce predictions maps from raster layers (importing the raster files) by using the parameter estimates (beta coefficients) from the model? Not an urgency but we can think of that later.

    1. That’s a good point Ugyen, I’ve rewritten the code to work with rasters instead of the data frame.

  2. Mike,
    This is really good.
    Would it be possible to elaborate on categorical co-variates? I know that many students struggle with these in Bayesian model code and to interpret output?
    Only thought – or direct to some relevant examples and code?
    Thanks!

    1. Hi Lourens, Yes, that’s on the to-do list, preferably with an example where we do have a categorical covariate. In the meantime, there’s a bit on the topic here.

  3. I can’t wait to learn in practice how to employ Reversible Jump MCMC for variable selection in JAGS! Thank you for this awsome tutorial!
    Any estimation when you may provide variable selection stuff?

    1. No plans for RJMCMC I’m afraid. That’s the most complicated way to do model selection and there’s currently no easy “recipe” for RJMCMC, you have to reinvent it to deal with your specific set of models. See the Hooten & Hobbs paper cited at the end of the post and go from there if you want to try it.

  4. Hi Mike,

    Your posts have been a valuable resource on my quest to learn JAGS. My question here concerns structuring occupancy matrices to deal with missing covariate values.

    I am modeling changes in occupancy over time as a function of a set of covariates that also change over time. Sites have been surveyed between 2 and 8 times over a 20-year period. When a site is surveyed, both occupancy and covariate data are collected. I have configured my dataset so that data from the first survey at each site, regardless of when that first survey occurred, is in the first column. The second column contains data from the second survey, and so on for 8 columns (representing maximum number of surveys completed at any site). Because the number of times each site has been surveyed varies, the data contains many NAs. For example, a site that was sampled twice, but most recently during the final year that surveys occurred, would have data in the first two columns and NAs for the rest. The real problem here is supplying the covariate data to support estimation of the NAs in the occupancy data. Have you ever worked with a situation like this? Any thoughts you have are much appreciated. Thank you!

    Bryant

    1. Hi Bryant,

      That’s the third question in the last week about dynamic occupancy models – maybe we should move that topic to the top of the to-do list.

      This is not really on topic, so I’ll contact you be email.

  5. Hi Mike,

    Thanks for this informative tutorial. I apology for the question but, as far as the Willow tit occupancy figure is concerned, is there the possibility to change the scale bar interval? I’m running a similar example but my scale bar starts from 0.4 up to 0.7 (thus following the occupancy values in the raster). However, I would like to change them starting from 0.0 up to 1.0 with 0.2 interval units. What is the syntax I have to put? Many thanks in advance.

    Kind regards,
    Marcello

    1. Hi Marcello,

      The scale bar has to show the same values as in the plot, so you need to fix the plot. Just add zlim=c(0,1) to the call to plot, ie,
      plot(psiR, col=mapPalette(100), main="Willow tit occupancy", axes=FALSE, box=FALSE, zlim=c(0,1))

      Don’t know how to fix the intervals shown when you have a continuous color scale, maybe something in axis.args would do it.

  6. Hi everyone, I am working on single species modelling with many sites covariates in unmarked. Can someone help me to plot the effect of the covariates? Best regards,

Leave a Reply to admin Cancel reply

Your email address will not be published.

The maximum upload file size: 1 MB. You can upload: image, document, text, archive. Drop file here