Logistic regression

Galton’s data were people’s heights, ie, continuous measure. Often in wildlife data analysis we have binary data (present/absent, marked/unmarked, dead/alive,…) and need to calculate probabilities and see how they relate to covariates.

We can use regression, but with a twist. Probabilities have to be in the range {0, 1}, but the basic regression model $$\mu_i = \beta_0 + \beta_1 \times x_i$$ can produce values outside that range. We get around this by using a link function, which converts numbers in the range $-\infty$ to $+\infty$ to numbers between 0 and 1. The usual link is the logistic or “logit” function, hence the name “logistic regression”. The plot below shows the shape of the logistic function:

With this link, when $\mu = ∞$, p = 1, when $\mu = –∞$, p = 0, and when $\mu = 0$, p = 0.5.

We usually use functions in R or JAGS to convert between $\mu$ and p, but here are the equations: $$\mu = \log\left(\frac p {1-p}\right)$$ $$p = \frac{e^\mu}{1 + e^\mu}$$

The socks-in-box data

The example data set comes from an activity we do during the Boot Camp, where participants try to throw socks into a box from various distances and with either hand1This is based on Morrell, C H; R E Auer. 2007. Trashball: A logistic regression classroom activity. J Statistics Education 15:1. On-line at www.amstat.org/publications/jse/v15n1/morrell.html.

For each throw we record the distance from the box, which hand was used (“good hand” = the hand you write with) and the person’s name and gender, and of course the result, whether the sock went into the box or not. We also ask them to roll a 10-sided die and record the score. The full data set is now huge, so we’ll use a subset with data for 105 people. You can download and explore the data with the following R code:

sox <- read.csv("http://bcss.org.my/data/soxInBox_105.csv", 
    comment="#", stringsAsFactors=TRUE)
head(sox)
#     Group  Name result gender distance hand die
# 1 1601RRI Laura      1 female        2 good   1
# 2 1601RRI Laura      1 female        2  bad   5
# 3 1601RRI Laura      0 female        3 good   1
# 4 1601RRI Laura      0 female        3  bad   5
# 5 1601RRI Laura      0 female        4 good   6
# 6 1601RRI Laura      0 female        4  bad   4
str(sox)
# 'data.frame':   840 obs. of  7 variables:
# $ Group   : Factor w/ 6 levels "1601RRI","1607NUS",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Name    : Factor w/ 105 levels "Abid","Aiyat",..: 50 50 50 50 50 50 50 50 74 74 ...
# $ result  : int  1 1 0 0 0 0 0 0 1 1 ...
# $ gender  : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
# $ distance: int  2 2 3 3 4 4 5 5 2 2 ...
# $ hand    : Factor w/ 2 levels "bad","good": 2 1 2 1 2 1 2 1 2 1 ...
# $ die     : int  1 5 1 5 6 4 3 6 3 3 ...
summary(sox)
#        Group          Name         result          gender   
# 1601RRI   :176   Abid   :  8   Min.   :0.0000   female:368  
# 1607NUS   :152   Aiyat  :  8   1st Qu.:0.0000   male  :472  
# 1609Bhutan:168   Aji    :  8   Median :0.0000               
# 1610Jogja :152   Akmal  :  8   Mean   :0.4476               
# 1611SamaJ :120   Alfons :  8   3rd Qu.:1.0000               
# 1612RRI   : 72   Amirul :  8   Max.   :1.0000               
#                  (Other):792                                
#    distance      hand          die       
# Min.   :2.00   bad :420   Min.   :0.000  
# 1st Qu.:2.75   good:420   1st Qu.:2.000  
# Median :3.50              Median :4.000  
# Mean   :3.50              Mean   :4.414  
# 3rd Qu.:4.25              3rd Qu.:7.000  
# Max.   :5.00              Max.   :9.000
( N <- nrow(sox) )
# [1] 840

Note: From R version 4.0, the default for stringsAsFactors changed to FALSE. We do need strings to be converted to factors, so now have to specify stringsAsFactors = TRUE.

The model

Likelihood

We’ll start with a simple model with only two covariates: distance and hand. This looks similar to the simple regression model except that we have to work with the logistic link: $$ {\rm logit}(p_i) = \beta_0 + \beta_{distance} \times distance_i + \beta_{hand} \times hand_i$$ $$y_i \sim {\rm Bernoulli} (p_i)$$ Here $y_i$ is the result, 1 if the sock went into the box, 0 otherwise. This is drawn from a Bernoulli distribution with probability of success $p_i$.

Priors

As with simple regression, we’ll centre the distance value by subtracting the mean (3.5 m) so that the intercept is sensible. Distance then goes from -1.5 to +1.5. We will convert the hand variable from a factor to a number (JAGS can’t handle factors), with 0 for bad hand and 1 for good hand.

With logistic regression, the sensible range for ${\rm logit}(p_i)$ is quite limited: -5 corresponds to a probability of 0.007 and +5 to 0.993. Since the covariates here are limited in range, the coefficients will also be limited. I like to use uniform priors when exploring models for reasons given here; we check the posterior to see if this is constrained by our prior. We’ll use ${\rm Uniform}(-5, 5)$ and check to see if this is constraining the posterior.

The intercept tells us the probability (on the logit scale) of getting the sock into the box when all the covariates are 0, ie, at 3.5m from the box with bad hand. We may not have any intuitions about $\beta_0$, so we put a uniform Beta prior on the underlying probability and then convert to the logit scale. $$p_0 \sim {\rm Beta}(1,1)$$ $$\beta_0 = {\rm logit}(p_0)$$

Here’s the diagram:

The JAGS code


# Save this in the file "sox_DH.jags"
model{
   # Likelihood
   for(i in 1:N) {
     logit(p[i]) <- b0 + bDist * distance[i] + 
         bHand * hand[i]
     y[i] ~ dbern(p[i])
   }
   # Priors
   p0 ~ dbeta(1, 1)
   b0 <- logit(p0)
   bDist ~ dunif(-5, 5)
   bHand ~ dunif(-5, 5)
 }

Running the model

Here’s the R code to prepare the data and run the model:

# Centre the distance values
distanceC <- sox$distance - mean(sox$distance)
# Convert the 'hand' factor to numeric, 0/1
hand <- as.numeric(sox$hand) - 1

jagsData <- list(N = N, y = sox$result,
    distance = distanceC, hand = hand)
str(jagsData)
List of 4
# $ N       : int 840
# $ y       : int [1:840] 1 1 0 0 0 0 0 0 1 1 ...
# $ distance: num [1:840] -1.5 -1.5 -0.5 -0.5 0.5 0.5 1.5 ...
# $ hand    : num [1:840] 1 0 1 0 1 0 1 0 1 0 ...

# List the parameters we want JAGS to produce:
wanted <- c("p0", "bDist", "bHand")

library(jagsUI)
( soxDH <- jags(jagsData, NULL, wanted, "sox_DH.jags",
  n.chains=3, n.iter=10000, parallel=TRUE, DIC=FALSE) )
# ...
#         mean    sd   2.5%    50%  97.5%  Rhat n.eff
# p0     0.387 0.026  0.336  0.387  0.439     1  5952
# bDist -0.940 0.077 -1.094 -0.939 -0.791     1 30000
# bHand  0.388 0.157  0.083  0.387  0.696     1  7200
# ...
library(mcmcOutput)
diagPlot(soxDH)

Here I’ve used the diagPlot function from the mcmcOutput package to do the diagnostic plots. We see that all three chains are mixing well and there’s no sign that the uniform priors are constraining the posterior.

Next we plot the posterior distributions for the parameters:

mco <- mcmcOutput(soxDH)
plot(mco)

The effect of increasing the distance is clearly negative, as we’d expect, and the effect of the good hand is positive. Changing hands from bad to good is equivalent to moving forward about 40cm.

The die score

What is the effect of the die on success in throwing the sock? Modify the code to include the die score as well as distance and hand. You should centre the die scores as we did distance by subtracting the mean. See the model example.

Non-independence

The analysis above assumes that 840 observations of sock-throwing are independent; one throw gives us no information about any other throw, at least from the same distance and with the same hand. But that isn’t true. We have data for only 105 people, some of whom are better at throwing socks than others. We can deal with this by including a parameter in the model to reflect the ability of the person throwing. That will be a hierarchical model, the topic of the next page.

Download a ZIP file with the code here.

9 thoughts on “Logistic regression”

  1. Good evening,

    Unfortunately, considering the sentence “I like to use uniform priors when exploring models for reasons given here”, when I click on “here” I’m not postponed to the appropriate page, but to another one. Please, can anyone fix the problem (assuming that is not mine. In this case, sorry about that)? Many thanks.

  2. Furthermore, I checked even within the simple regression tutorial but I have not clear what is the meaning of the following sentences:

    1. As with simple regression, we’ll centre the distance value by subtracting the mean (3.5 m) so that the intercept is sensible. Distance then goes from -1.5 to +1.5.

    2. With logistic regression, the sensible range for logit(pi) is quite limited: -5 corresponds to a probability of 0.007 and +5 to 0.993.

    Many thanks again.

    1. Sorry, regarding the following question:

      1. As with simple regression, we’ll centre the distance value by subtracting the mean (3.5 m) so that the intercept is sensible. Distance then goes from -1.5 to +1.5.

      I solved the issue.

    2. Re 2: The idea is that for sock-throwing, a probability of success < 1% is practically zero and > 99% is practically certain. If we got results that big or small, we’d want to take another look at the analysis.
      It depends on the context though, lots of people buy lottery tickets when the probability of winning is ≪ 1%, but the actual probability is only of interest when lots of people are playing and prizes are huge.

      1. Thanks very much for the explation and, especially, for your patience.

  3. Good evening,

    I repeated the exercise even considering the “die” covariate. Please, can anyone tell me if my results match with yours? Here the attached word file. Many thanks in advance!

    Logistic Regression

Leave a Reply to Marcello Franchini Cancel reply

Your email address will not be published.

The maximum upload file size: 1 MB. You can upload: image, document, text, archive. Drop file here