Multi-species occupancy with covariates

In the previous unit we worked with a data set with capture histories but no information on habitat. Here we want to show how habitat covariates can be included in the model.

Birds on islands

Greg Irving and colleagues studied resident forest bird species on islands in the Chiew Larn hydropower reservoir in southern Thailand and the near-by mainland. Details are here. Thanks to Greg Irving for providing the data files.

They collected data for a number of habitat variables, but a preliminary analysis showed that the area of the island and distance from the mainland were enough to get good predictions of occupancy; adding further covariates to the model did not improve performance.

You can download and look at the data with the following code:

CL <- read.csv("http://bcss.org.my/data/ChiewLarn_birds.csv",
    comment="#")
str(CL)
# 'data.frame':   24 obs. of  83 variables:
# $ siteID: chr  "IS05" "IS06" "IS07" "IS39" ...
# $ area  : num  9.763 49.785 1.499 0.835 0.827 ...
# $ MLdist: num  725 250 150 425 67 575 325 320 1000 25 ...
# $ PinTailed_Parrotfinch : int  3 3 1 1 3 3 3 3 3 3 ...
# $ BlackNaped_Monarch    : int  3 3 0 0 1 3 2 3 1 2 ...
# [...]
# $ RedBearded_BeeEater   : int  0 0 0 0 0 0 0 0 0 0 ...

Ymat <- as.matrix(CL[, -(1:3)])
rownames(Ymat) <- CL$siteID
table(Ymat)
#    0    1    2    3 
# 1370  249  137  164

colSums(Ymat)
# PinTailed_Parrotfinch   BlackNaped_Monarch 
#                    68                   52 
#     Everetts_Whiteeye    Wreathed_Hornbill 
#                    52                   49
# [...]
# Oriental_Magpie_Robin  RedBearded_BeeEater 
#                     1                    1

nOcc <- 3  # Each site visited 3 times
( nSobs <- ncol(Ymat) )   # Number of species observed
# 80
( nSites <- nrow(Ymat) )  # Number of sites
# 24

# Prepare predictors
logArea <- log(CL$area)
exp(mean(logArea))  # 10.13575 ha
logAreaS <- wiqid::standardize(logArea)
mean(CL$MLdist)     # 285.2918 m
MLdistS <- wiqid::standardize(CL$MLdist)

We have data for 24 sites, 23 islands and the mainland, and 80 species. As with the previous example, the number of detections per species ranges widely, with a maximum of 68, and 6 species with just 1 detection each.

The predictors we’ll use are log(area) and distance from mainland. We standardise both to have mean = 0 and SD = 1.

The probability of occupancy differs among sites, depending on the covariates, just as for the willow tits example, but now replicated for each species $k$: $${\rm logit}(\psi_{ik}) = \beta_{0,k} + \beta_{area,k} logArea_i + \beta_{dist,k} MLdist_i$$ where the coefficients, $\beta$, differ among species.

Species level priors

We model the intercept and each of the coefficients as random variables drawn from a normal distribution with mean and SD to be estimated:$$\beta_{x,k} \sim {\rm Normal}(\mu_{x}, \sigma_{x})$$

In the previous unit we included correlation between $\psi$ and $p$; similarly here we include correlation between $\beta_0$ and $p$.

Community level hyperpriors

The intercept, $\beta_{0,k}$, is the probability on the logit scale of occupancy of species $k$ at a site with area = 10.14 ha, 285 m from the mainland, and $\mu_0$ is the mean of the intercepts for the community. We thus use a uniform Beta(1,1) prior for the probability and convert to the logit scale:$$\overline{\beta_0} \sim {\rm Beta}(1,1)$$ $$\mu_0 = {\rm logit}(\overline{\beta_0})$$and we use a uniform prior for the SD as usual: $$\sigma_0 \sim {\rm Uniform}(0, 5)$$

We are working on the logit scale with standardised predictors, so a value for a coefficient outside $\pm 5$ would indicate a problem. We use Uniform(-5, 5) priors for $\mu$ and check the output to see if the posterior is constrained by this:$$\mu_{area} \sim {\rm Uniform}(-5,5)$$ and we use a uniform prior for the SD also$$\sigma_{area} \sim {\rm Uniform}(0,5)$$We use similar hyperpriors for distance.

Here’s the JAGS code for the model:

# MSOM model with covariates
# File name "MSOM_cov.jags"

model{
  for(k in 1:nSobs){  # Loop through species
    # Likelihood
    for(i in 1:nSites) {
       # Ecological model
      logit(psi[i, k]) <- b0[k] + bArea[k] * logArea[i] +
          bDist[k] * MLdist[i]
       z[i, k] ~ dbern(psi[i, k])
       # Observation model
       y[i, k] ~ dbin(p[k] * z[i, k], nOcc)
    }

    # Priors (species level)
    b0[k] ~ dnorm(mu.b0, tau.b0)
    mu.eta[k] <- mu.lp + rho * sd.lp/sd.b0 *
        (b0[k] - mu.b0)
    lp[k] ~ dnorm(mu.eta[k], tau.eta)
    p[k] <- ilogit(lp[k])

    bArea[k] ~ dnorm(mu.area, tau.area)
    bDist[k] ~ dnorm(mu.dist, tau.dist)

  }

  # Hyperpriors (community level)
  b0.mean ~ dbeta(1, 1)
  mu.b0 <- logit(b0.mean)
  sd.b0 ~ dunif(0, 5)
  tau.b0 <- 1/sd.b0^2

  mu.betalpsi1 ~ dnorm(0,0.1)
  sd.area ~ dunif(0, 5)
  tau.area <- 1/sd.area^2
  mu.dist ~ dnorm(0,0.1)
  sd.dist ~ dunif(0,5)
  tau.dist <- 1/sd.dist^2

  p.mean ~ dbeta(1, 1)
  mu.lp <- logit(p.mean)
  sd.lp ~ dunif(0, 5)
  tau.lp <- 1/sd.lp^2

  rho ~ dunif(-1, 1)
  tau.eta <- tau.lp/(1 - rho^2)
}

Running the analysis

Here’s the code to prepare the data, run the analysis and check the output. Mixing tends to be poor for the SDs of the coefficients, so to get good MCEs we increase the number of iterations. We are monitoring a lot of nodes, so we do some brutal thinning to keep the size of the output object reasonable. We also tell jags not to calculate statistics for psi and z, which have 1920 nodes each, just return the MCMC chains.

# Pack up everything for JAGS
z <- (Ymat > 0)*1
z[z == 0] <- NA
jagsData <- list(y = Ymat, nSobs = nSobs, nSites = nSites,
    nOcc = nOcc, z = z, logArea = logAreaS, MLdist = MLdistS)
str(jagsData)
# List of 5
# $ y     : int [1:24, 1:80] 3 3 1 1 3 3 3 3 3 3 ...
# $ nSobs : int 80
# $ nSites: int 24
# $ nOcc  : num 3
# $ z     : num [1:24, 1:80] 1 1 1 1 1 1 1 1 1 1 ...
# $ logArea: num [1:24] -0.0205 0.8702 -1.045 -1.3652 -1.3702 ...
# $ MLdist : num [1:24] 1.543 -0.124 -0.475 0.49 -0.766 ...

wanted <- c("b0.mean", "mu.b0", "sd.b0", 
    "mu.area", "sd.area", "mu.dist", "sd.dist",
    "p.mean", "sd.lp", "mu.lp",
    "b0", "bArea", "bDist", "p", "psi", "z")

# takes 10 mins
library(jagsUI)
out <- jags(jagsData, NULL, wanted, "MSOM_cov.jags",
    n.chains=3, n.adapt=1000, n.iter=60000, n.thin=30,
    codaOnly=c("psi", "z"), parallel=TRUE, DIC=FALSE)
mco <- mcmcOutput(out)
mcoKey <- mco[1:10]
diagPlot(mcoKey)
summary(mcoKey)
#           mean    sd median    l95    u95  Rhat MCEpc
# b0.mean  0.233 0.055  0.229  0.133  0.342 1.000 1.506
# mu.b0   -1.216 0.309 -1.217 -1.829 -0.624 1.000 1.487
# sd.b0    2.521 0.312  2.497  1.906  3.098 0.999 1.881
# mu.area  2.587 0.271  2.570  2.085  3.132 1.000 2.393
# sd.area  0.865 0.332  0.874  0.145  1.462 0.999 3.582
# mu.dist -0.642 0.132 -0.636 -0.900 -0.392 0.994 1.782
# sd.dist  0.494 0.207  0.499  0.028  0.840 1.008 3.001
# p.mean   0.470 0.030  0.471  0.411  0.526 1.001 1.383
# sd.lp    0.766 0.128  0.751  0.528  1.022 1.002 1.883
# mu.lp   -0.120 0.119 -0.117 -0.360  0.104 1.001 1.383

The Rhats and MCEs are good, and the diagnostic plots are ok. Posterior distributions are not constrained by our uniform priors.

Plotting $\psi$ and $p$

Intercepts

For the basic model we plotted the posterior distributions of $\psi$ for each species. With this model, that differs among sites, so we’ll plot the intercept, $\psi_0$, the probability of occupancy for a site with area = 10.14 ha, 285 m from the mainland.

# Convert intercepts to the probability scale
psi0 <- plogis(out$sims.list$b0)
psi0.mean <- colMeans(psi0)
psi0.CRI <- apply(psi0, 2, quantile, probs=c(0.025, 0.975))

op <- par(mfrow=1:2)
plot(psi0.mean, 1:nSobs, xlim=0:1,
    pch=16, xlab="psi0", ylab="Species")
segments(psi0.CRI[1,], 1:nSobs, psi0.CRI[2,], 1:nSobs)
abline(v=plogis(out$mean$mu.b0), col='red')
abline(v=plogis(out$mean$mu.b0 + c(2, -2)*out$mean$sd.b0),
    col='red', lty=3)

plot(out$mean$p, 1:nSobs, xlim=0:1,
    pch=16, xlab="p", ylab="Species")
segments(out$q2.5$p, 1:nSobs,out$q97.5$p, 1:nSobs)
abline(v=plogis(out$mean$mu.lp), col='red')
abline(v=plogis(out$mean$mu.lp + c(2, -2)*out$mean$sd.lp),
    col='red', lty=3)
par(op)

The species with most detections are at the bottom of the plot and have high $\psi_0$ and high $p$, as we’d expect. The top 6 species have only 1 detection each: the estimates of $p$ for these are relatively high, around 0.4, and the small number of detections seems to be due to very low occupancy rates.

Effect of covariates

We use the posterior means for the coefficients for each species to predict its occupancy probability, $\psi$, for a range of values of area and distance from the mainland. Then we plot each of these.

# Values for the x axis for Area
range(logArea)
logxx <- seq(-0.28, 8.62,, 101)
xxA <- exp(logxx)
# Standardise logxx with the mean and SD of logArea
logxxS <- wiqid::standardize2match(logxx, logArea)

# Get the predictions for each value of xxA
predA <- matrix(NA, 101, nSobs)
for(i in 1:101)
  predA[i,] <- plogis(out$mean$b0 +
      out$mean$bArea * logxxS[i])
matplot(xxA, predA, type='l', log='x', lty=1,
    ylab="Occupancy", xlab="Area (ha)")

# Do the same for distance to mainland
range(CL$MLdist)
xxD <- seq(0, 1000,, 101)
xxDS <- wiqid::standardize2match(xxD, CL$MLdist)

predD <- matrix(NA, 101, nSobs)
for(i in 1:101)
  predD[i,] <- plogis(out$mean$b0 +
      out$mean$bDist * xxDS[i])
matplot(xxD, predD, type='l', lty=1,
    ylab="Occupancy", xlab="Distance from mainland (m)")

For all species the probability of occupancy increases with the area of the island, and for all but one (Cream-vented Bulbul) it decreases with distance from the mainland.

Working with the $z$ matrix

The $z$ matrix – a sites x species matrix with 1 for presence and 0 for absence – is the key to a lot of community analysis. The JAGS output above includes z. We extract and check it with this code:

z <- out$sims.list$z
str(z)
# num [1:6000, 1:24, 1:80] 1 1 1 1 1 1 1 1 1 1 ...
range(z)
# [1] 0 1

How many species at each site?

The first question we might ask is “How many of these 80 species occur at each site?” The answer is going to be higher than the number observed since detection is imperfect. For each iteration of the MCMC chain we sum across rows in the $z$ matrix, so get an MCMC chain for richness at each site, and we can then plot the posterior distribution for each. We’ll add the number of species observed at each site.

siteN <- apply(z, 1:2, sum)
siteNobs <- rowSums(Ymat > 0)
windows(5,5)
op <- par(mfrow=c(3,3), mar=c(5,1,4,1), ask=TRUE)
for(i in 1:nSites) {
  plot(table(siteN[, i]), frame=FALSE, yaxt='n', ylab="",
  xlab="Number of species", main=CL$siteID[i])
  abline(v=siteNobs[i]-0.1, col='grey', lwd=2, lend='butt')
}
par(op)

The code above plots all 24 sites, but we’ll only show a selection here:

It may be more interesting to plot species richness against the site covariates.

op <- par(mfrow=1:2)
# Richness vs distance from mainland
plot(CL$MLdist, colMeans(siteN), pch=16, las=1,
    ylab="Species richness", xlab="Distance from mainland (m)")
cri <- apply(siteN, 2, quantile, probs=c(0.025, 0.975))
segments(CL$MLdist, cri[1,], CL$MLdist, cri[2,])

# Similarity vs area
plot(CL$area, colMeans(siteN), pch=16, las=1, log="x",
    ylab="Species Richness", xlab="Area (ha)")
cri <- apply(siteN, 2, quantile, probs=c(0.025, 0.975))
segments(CL$area, cri[1,], CL$area, cri[2,])
par(op)

How many sites per species?

We could ask the question the other way around: “At how many sites does each species occur?” We sum the columns in the $z$ matrix to get this:

spsN <- apply(z, c(1,3), sum)
spsNobs <- colSums(Ymat > 0)

The plotting code is similar to that for site-wise species richness. Again we show a selection only.

The first species, Pin-tailed Parrotfinch, had the highest total detections and was detected at all 24 sites. The last 2 species, Oriental Magpie Robin and Red-bearded Bee-eater were each detected once, but $\psi_0$ for the Magpie Robin has a wide CrI, so it could (just!) plausibly be present everywhere, while for the Bee-eater it’s more precisely estimated.

Which islands are most similar to the mainland?

The $z$ matrix is the starting point for estimating site similarity using the Jaccard or Sørensen indices for example. Any pair of sites can be compared, but the obvious question here is to compare each of the islands to the mainland. We’ll use the Jaccard index.

The code below calculates the indices and plots the results against the covariates.

# Function to calculate Jaccard index
Jaccard <- function(z1, z2)
  sum(z1 & z2) / sum(z1 | z2)

ref <- 24  # reference site (mainland)
niter <- dim(z)[1]
jaccSites <- matrix(NA, niter, nSites)
for(i in 1:niter) {
  tmp <- z[i,,]
  jaccSites[i, ] <- apply(tmp, 1, Jaccard, z2=tmp[ref,])
}

op <- par(mfrow=1:2)
# Similarity vs distance from mainland
plot(CL$MLdist, colMeans(jaccSites), pch=16, ylim=0:1, las=1,
    ylab="Jaccard index", xlab="Distance from mainland (m)")
cri <- apply(jaccSites, 2, HDInterval::hdi)
segments(CL$MLdist, cri[1,], CL$MLdist, cri[2,])

# Similarity vs area
plot(CL$area, colMeans(jaccSites), pch=16, ylim=0:1, las=1, log="x",
    ylab="Jaccard index", xlab="Area (ha)")
cri <- apply(jaccSites, 2, HDInterval::hdi)
segments(CL$area, cri[1,], CL$area, cri[2,])
title(main="Similarity of islands to mainland",
    outer=TRUE, line=-3)
par(op)

Clearly, proximity to the mainland does not lead to similarity, but large islands are more similar to the mainland in their species composition than small islands.

Which species have similar distributions?

And of course we can ask a similar question about similarity of species: “Which species are most similar in the sites they occupy?” This time we’ll do a full species x species matrix and plot the results as a dendrogram.

Base R has a dist function, but it only has a limited number of distance/similarity options; the wiqid::distShell function allows us to plug in our Jaccard function, but it expects a species x sites matrix, so we need to transpose our matrix. We’ll use the posterior means of the Jaccard indices for the next step.

library(wiqid)
simMCMC <- array(NA_real_, c(niter, nSobs, nSobs))
dimnames(simMCMC) <- list(NULL, names(spsNobs), names(spsNobs))
for(i in 1:niter)
  simMCMC[i,,] <- as.matrix(distShell(t(z[i,,]), Jaccard))
  # Takes 2.5 mins to do 6000
meanSim <- apply(simMCMC, 2:3, mean)
str(meanSim)
# num [1:80, 1:80] 0 0.877 1 0.894 0.888 ...
# - attr(*, "dimnames")=List of 2
#  ..$ : chr [1:80] "PinTailed_Parrotfinch" ...
#  ..$ : chr [1:80] "PinTailed_Parrotfinch" ...

For the dendrogram we need the distances (or dissimilarities) between species, and they need to be in an object of class dist. With 80 species, the dendrogram is too big to display on laptop screens, so we provide code to create a PDF file.

JaccDistSps <- as.dist(1 - meanSim)
par("mar" = c(4,2,1, 5), cex = 0.5)
plot(as.dendrogram(hclust(JaccDistSps), hang=-1), horiz=TRUE)

pdf("Jaccard species plot.pdf", 7, 10, paper="a4")
par("mar" = c(4,2,1,15), cex=0.5)
plot(as.dendrogram(hclust(JaccDistSps), hang=-1), horiz=TRUE)
dev.off()

What’s next?

There are more than 80 species of forest birds in the Chiew Larn study area; some were spotted there outside the formal surveys; we know they are there but have no data to estimate occupancy or probability of detection.

Hierarchical modelling allows us to get better estimates for species with only 1 detection. Can it help to estimate parameters for species with 0 detections?

In the next unit we’ll try this out and see if we can estimate the number of species in the community which were not detected, and hence the total species richness.

Download a ZIP file with data and code here.