# Logistic regression

Galton’s data were people’s heights, ie, continuous measure. Often in wildlife data analysis we have binary data (present/absent, marked/unmarked, dead/alive,…) and need to calculate probabilities and see how they relate to covariates.

We can use regression, but with a twist. Probabilities have to be in the range {0, 1}, but the basic regression model $$y_i = \beta_0 + \beta_1 \times x_i$$ can produce values outside that range. We get around this by using a link function, which converts numbers in the range $-\infty$ to $+\infty$ to numbers between 0 and 1. The usual link is the logistic or “logit” function, hence the name “logistic regression”. The plot below shows the shape of the logistic function:

With this link, when y = ∞, p = 1, when y = –∞, p = 0, and when y = 0, p = 0.5.

## The socks-in-box data

The example data set comes from an activity we do during the Boot Camp, where participants try to throw socks into a box from various distances and with either hand1This is based on Morrell, C H; R E Auer. 2007. Trashball: A logistic regression classroom activity. J Statistics Education 15:1. On-line at www.amstat.org/publications/jse/v15n1/morrell.html.

For each throw we record the distance from the box, which hand was used (“good hand” = the hand you write with) and the person’s name and gender, and of course the result, whether the sock went into the box or not. We also ask them to roll a 10-sided die and record the score. The full data set is now huge, so we’ll use a subset with data for 105 people. You can download and explore the data with the following R code:

sox <- read.csv("http://bcss.org.my/data/soxInBox_105.csv", comment="#")
#     Group  Name result gender distance hand die
# 1 1601RRI Laura      1 female        2 good   1
# 2 1601RRI Laura      1 female        2  bad   5
# 3 1601RRI Laura      0 female        3 good   1
# 4 1601RRI Laura      0 female        3  bad   5
# 5 1601RRI Laura      0 female        4 good   6
# 6 1601RRI Laura      0 female        4  bad   4
str(sox)
# 'data.frame':   840 obs. of  7 variables:
# $Group : Factor w/ 6 levels "1601RRI","1607NUS",..: 1 1 1 1 1 1 1 1 1 1 ... #$ Name    : Factor w/ 105 levels "Abid","Aiyat",..: 50 50 50 50 50 50 50 50 74 74 ...
# $result : int 1 1 0 0 0 0 0 0 1 1 ... #$ gender  : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
# $distance: int 2 2 3 3 4 4 5 5 2 2 ... #$ hand    : Factor w/ 2 levels "bad","good": 2 1 2 1 2 1 2 1 2 1 ...
# $die : int 1 5 1 5 6 4 3 6 3 3 ... summary(sox) # Group Name result gender # 1601RRI :176 Abid : 8 Min. :0.0000 female:368 # 1607NUS :152 Aiyat : 8 1st Qu.:0.0000 male :472 # 1609Bhutan:168 Aji : 8 Median :0.0000 # 1610Jogja :152 Akmal : 8 Mean :0.4476 # 1611SamaJ :120 Alfons : 8 3rd Qu.:1.0000 # 1612RRI : 72 Amirul : 8 Max. :1.0000 # (Other):792 # distance hand die # Min. :2.00 bad :420 Min. :0.000 # 1st Qu.:2.75 good:420 1st Qu.:2.000 # Median :3.50 Median :4.000 # Mean :3.50 Mean :4.414 # 3rd Qu.:4.25 3rd Qu.:7.000 # Max. :5.00 Max. :9.000 ( N <- nrow(sox) ) # [1] 840 ### The model We’ll start with a simple model with only two covariates: distance and hand. This looks similar to the simple regression model except that we have to work with the logistic link: $${\rm logit}(p_i) = \beta_0 + \beta_{distance} \times distance_i + \beta_{hand} \times hand_i$$ $$y_i \sim {\rm Bernoulli} (p_i)$$ Here$y_i$is the result, 1 if the sock went into the box, 0 otherwise. This is drawn from a Bernoulli distribution with probability of success$p_i$. ### The priors As with simple regression, we’ll centre the distance value by subtracting the mean (3.5 m) so that the intercept is sensible. Distance then goes from -1.5 to +1.5. We will convert the hand variable from a factor to a number (JAGS can’t handle factors), with 0 for bad hand and 1 for good hand. With logistic regression, the sensible range for${\rm logit}(p_i)$is quite limited: -5 corresponds to a probability of 0.007 and +5 to 0.993. Since the covariates here are limited in range, the coefficients will also be limited. I like to use uniform priors when exploring models for reasons given here. We’ll use${\rm Uniform}(-5, 5)$. The intercept tells us the probability (on the logit scale) of getting the sock into the box when all the covariates are 0, ie, at 3.5m from the box with bad hand. We may not have an intuitions about$\beta_0$, so we put a uniform Beta prior on the underlying probability and then convert to the logit scale. $$p_0 \sim {\rm Beta}(1,1)$$ $$\beta_0 = {\rm logit}(p_0)$$ ### The JAGS code  # Save this in the file "sox_DH.jags" model{ # Likelihood for(i in 1:N) { logit(p[i]) <- b0 + bDist * distance[i] + bHand * hand[i] y[i] ~ dbern(p[i]) } # Priors p0 ~ dbeta(1, 1) b0 <- logit(p0) bDist ~ dunif(-5, 5) bHand ~ dunif(-5, 5) } ### Running the model Here’s the R code to prepare the data and run the model: # Centre the distance values distanceC <- sox$distance - mean(sox$distance) # Convert the 'hand' factor to numeric, 0/1 hand <- as.numeric(sox$hand) - 1

jagsData <- list(N = N, y = sox$result, distance = distanceC, hand = hand) str(jagsData) List of 4 #$ N       : int 840
# $y : int [1:840] 1 1 0 0 0 0 0 0 1 1 ... #$ distance: num [1:840] -1.5 -1.5 -0.5 -0.5 0.5 0.5 1.5 ...
# \$ hand    : num [1:840] 1 0 1 0 1 0 1 0 1 0 ...

# List the parameters we want JAGS to produce:
wanted <- c("p0", "bDist", "bHand")

library(jagsUI)
( soxDH <- jags(jagsData, NULL, wanted, "sox_DH.jags",
n.chains=3, n.iter=10000, parallel=TRUE, DIC=FALSE) )
# ...
#         mean    sd   2.5%    50%  97.5%  Rhat n.eff
# p0     0.387 0.026  0.336  0.387  0.439     1  5952
# bDist -0.940 0.077 -1.094 -0.939 -0.791     1 30000
# bHand  0.388 0.157  0.083  0.387  0.696     1  7200
# ...
wiqid::diagPlot(soxDH)

Here I’ve used the diagPlot function from the wiqid package to do the diagnostic plots. We see that all three chains are mixing well and there’s no sign that the uniform priors are constraining the posterior.

The effect of increasing the distance is negative, as we’d expect, and the effect of the good hand is positive. Changing hands from bad to good is equivalent to moving forward about 40cm.

### The die score

What is the effect of the die on success in throwing the sock? Modify the code to include the die score as well as distance and hand. You should centre the die scores as we did distance by subtracting the mean.

### Non-independence

The analysis above assumes that 840 observations of sock-throwing are independent; one throw gives us no information about any other throw, at least from the same distance and with the same hand. But that isn’t true. We have data for only 105 people, some of whom are better at throwing socks than others. We can deal with this by including a parameter in the model to reflect the ability of the person throwing. That will be a hierarchical model, the topic of the next page.