Logistic regression

Galton’s data were people’s heights, ie, continuous measure. Often in wildlife data analysis we have binary data (present/absent, marked/unmarked, dead/alive,…) and need to calculate probabilities and see how they relate to covariates.

We can use regression, but with a twist. Probabilities have to be in the range {0, 1}, but the basic regression model $$y_i = \beta_0 + \beta_1 \times x_i$$ can produce values outside that range. We get around this by using a link function, which converts numbers in the range $-\infty$ to $+\infty$ to numbers between 0 and 1. The usual link is the logistic or “logit” function, hence the name “logistic regression”. The plot below shows the shape of the logistic function:

With this link, when y = ∞, p = 1, when y = –∞, p = 0, and when y = 0, p = 0.5.

The socks-in-box data

The example data set comes from an activity we do during the Boot Camp, where participants try to throw socks into a box from various distances and with either hand1This is based on Morrell, C H; R E Auer. 2007. Trashball: A logistic regression classroom activity. J Statistics Education 15:1. On-line at www.amstat.org/publications/jse/v15n1/morrell.html.

For each throw we record the distance from the box, which hand was used (“good hand” = the hand you write with) and the person’s name and gender, and of course the result, whether the sock went into the box or not. We also ask them to roll a 10-sided die and record the score. The full data set is now huge, so we’ll use a subset with data for 105 people. You can download and explore the data with the following R code:

sox <- read.csv("http://bcss.org.my/data/soxInBox_105.csv", comment="#")
head(sox)
#     Group  Name result gender distance hand die
# 1 1601RRI Laura      1 female        2 good   1
# 2 1601RRI Laura      1 female        2  bad   5
# 3 1601RRI Laura      0 female        3 good   1
# 4 1601RRI Laura      0 female        3  bad   5
# 5 1601RRI Laura      0 female        4 good   6
# 6 1601RRI Laura      0 female        4  bad   4
str(sox)
# 'data.frame':   840 obs. of  7 variables:
# $ Group   : Factor w/ 6 levels "1601RRI","1607NUS",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Name    : Factor w/ 105 levels "Abid","Aiyat",..: 50 50 50 50 50 50 50 50 74 74 ...
# $ result  : int  1 1 0 0 0 0 0 0 1 1 ...
# $ gender  : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
# $ distance: int  2 2 3 3 4 4 5 5 2 2 ...
# $ hand    : Factor w/ 2 levels "bad","good": 2 1 2 1 2 1 2 1 2 1 ...
# $ die     : int  1 5 1 5 6 4 3 6 3 3 ...
summary(sox)
#        Group          Name         result          gender   
# 1601RRI   :176   Abid   :  8   Min.   :0.0000   female:368  
# 1607NUS   :152   Aiyat  :  8   1st Qu.:0.0000   male  :472  
# 1609Bhutan:168   Aji    :  8   Median :0.0000               
# 1610Jogja :152   Akmal  :  8   Mean   :0.4476               
# 1611SamaJ :120   Alfons :  8   3rd Qu.:1.0000               
# 1612RRI   : 72   Amirul :  8   Max.   :1.0000               
#                  (Other):792                                
#    distance      hand          die       
# Min.   :2.00   bad :420   Min.   :0.000  
# 1st Qu.:2.75   good:420   1st Qu.:2.000  
# Median :3.50              Median :4.000  
# Mean   :3.50              Mean   :4.414  
# 3rd Qu.:4.25              3rd Qu.:7.000  
# Max.   :5.00              Max.   :9.000
( N <- nrow(sox) )
# [1] 840

The model

We’ll start with a simple model with only two covariates: distance and hand. This looks similar to the simple regression model except that we have to work with the logistic link: $$ {\rm logit}(p_i) = \beta_0 + \beta_{distance} \times distance_i + \beta_{hand} \times hand_i$$ $$y_i \sim {\rm Bernoulli} (p_i)$$ Here $y_i$ is the result, 1 if the sock went into the box, 0 otherwise. This is drawn from a Bernoulli distribution with probability of success $p_i$.

The priors

As with simple regression, we’ll centre the distance value by subtracting the mean (3.5 m) so that the intercept is sensible. Distance then goes from -1.5 to +1.5. We will convert the hand variable from a factor to a number (JAGS can’t handle factors), with 0 for bad hand and 1 for good hand.

With logistic regression, the sensible range for ${\rm logit}(p_i)$ is quite limited: -5 corresponds to a probability of 0.007 and +5 to 0.993. Since the covariates here are limited in range, the coefficients will also be limited. I like to use uniform priors when exploring models for reasons given here. We’ll use ${\rm Uniform}(-5, 5)$.

The intercept tells us the probability (on the logit scale) of getting the sock into the box when all the covariates are 0, ie, at 3.5m from the box with bad hand. We may not have an intuitions about $\beta_0$, so we put a uniform Beta prior on the underlying probability and then convert to the logit scale. $$p_0 \sim {\rm Beta}(1,1)$$ $$\beta_0 = {\rm logit}(p_0)$$

The JAGS code


# Save this in the file "sox_DH.jags"
model{
   # Likelihood
   for(i in 1:N) {
     logit(p[i]) <- b0 + bDist * distance[i] + bHand * hand[i]
     y[i] ~ dbern(p[i])
   }
   # Priors
   p0 ~ dbeta(1, 1)
   b0 <- logit(p0)
   bDist ~ dunif(-5, 5)
   bHand ~ dunif(-5, 5)
 }

Running the model

Here’s the R code to prepare the data and run the model:

# Centre the distance values
distanceC <- sox$distance - mean(sox$distance)
# Convert the 'hand' factor to numeric, 0/1
hand <- as.numeric(sox$hand) - 1

jagsData <- list(N = N, y = sox$result, distance = distanceC, hand = hand)
str(jagsData)
List of 4
# $ N       : int 840
# $ y       : int [1:840] 1 1 0 0 0 0 0 0 1 1 ...
# $ distance: num [1:840] -1.5 -1.5 -0.5 -0.5 0.5 0.5 1.5 ...
# $ hand    : num [1:840] 1 0 1 0 1 0 1 0 1 0 ...

# List the parameters we want JAGS to produce:
wanted <- c("p0", "bDist", "bHand")

library(jagsUI)
( soxDH <- jags(jagsData, NULL, wanted, "sox_DH.jags",
  n.chains=3, n.iter=10000, parallel=TRUE, DIC=FALSE) )
# ...
#         mean    sd   2.5%    50%  97.5%  Rhat n.eff
# p0     0.387 0.026  0.336  0.387  0.439     1  5952
# bDist -0.940 0.077 -1.094 -0.939 -0.791     1 30000
# bHand  0.388 0.157  0.083  0.387  0.696     1  7200
# ...
wiqid::diagPlot(soxDH)

Here I’ve used the diagPlot function from the wiqid package to do the diagnostic plots. We see that all three chains are mixing well and there’s no sign that the uniform priors are constraining the posterior.

The effect of increasing the distance is negative, as we’d expect, and the effect of the good hand is positive. Changing hands from bad to good is equivalent to moving forward about 40cm.

The die score

What is the effect of the die on success in throwing the sock? Modify the code to include the die score as well as distance and hand. You should centre the die scores as we did distance by subtracting the mean.

Non-independence

The analysis above assumes that 840 observations of sock-throwing are independent; one throw gives us no information about any other throw, at least from the same distance and with the same hand. But that isn’t true. We have data for only 105 people, some of whom are better at throwing socks than others. We can deal with this by including a parameter in the model to reflect the ability of the person throwing. That will be a hierarchical model, the topic of the next page.

Leave a Reply

Your email address will not be published. Required fields are marked *