Logistic regression

Galton’s data were people’s heights, ie, continuous measure. Often in wildlife data analysis we have binary data (present/absent, marked/unmarked, dead/alive,…) and need to calculate probabilities and see how they relate to covariates.

We can use regression, but with a twist. Probabilities have to be in the range {0, 1}, but the basic regression model $$\mu_i = \beta_0 + \beta_1 \times x_i$$ can produce values outside that range. We get around this by using a link function, which converts numbers in the range $-\infty$ to $+\infty$ to numbers between 0 and 1. The usual link is the logistic or “logit” function, hence the name “logistic regression”. The plot below shows the shape of the logistic function:

With this link, when $\mu = ∞$, p = 1, when $\mu = –∞$, p = 0, and when $\mu = 0$, p = 0.5.

We usually use functions in R or JAGS to convert between $\mu$ and p, but here are the equations: $$\mu = \log\left(\frac p {1-p}\right)$$ $$p = \frac{e^\mu}{1 + e^\mu}$$

The socks-in-box data

The example data set comes from an activity we do during the Boot Camp, where participants try to throw socks into a box from various distances and with either hand1This is based on Morrell, C H; R E Auer. 2007. Trashball: A logistic regression classroom activity. J Statistics Education 15:1. On-line at www.amstat.org/publications/jse/v15n1/morrell.html.

For each throw we record the distance from the box, which hand was used (“good hand” = the hand you write with) and the person’s name and gender, and of course the result, whether the sock went into the box or not. We also ask them to roll a 10-sided die and record the score. The full data set is now huge, so we’ll use a subset with data for 105 people. You can download and explore the data with the following R code:

sox <- read.csv("http://bcss.org.my/data/soxInBox_105.csv",
comment="#", stringsAsFactors=TRUE)
#     Group  Name result gender distance hand die
# 1 1601RRI Laura      1 female        2 good   1
# 2 1601RRI Laura      1 female        2  bad   5
# 3 1601RRI Laura      0 female        3 good   1
# 4 1601RRI Laura      0 female        3  bad   5
# 5 1601RRI Laura      0 female        4 good   6
# 6 1601RRI Laura      0 female        4  bad   4
str(sox)
# 'data.frame':   840 obs. of  7 variables:
# $Group : Factor w/ 6 levels "1601RRI","1607NUS",..: 1 1 1 1 1 1 1 1 1 1 ... #$ Name    : Factor w/ 105 levels "Abid","Aiyat",..: 50 50 50 50 50 50 50 50 74 74 ...
# $result : int 1 1 0 0 0 0 0 0 1 1 ... #$ gender  : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
# $distance: int 2 2 3 3 4 4 5 5 2 2 ... #$ hand    : Factor w/ 2 levels "bad","good": 2 1 2 1 2 1 2 1 2 1 ...
# $die : int 1 5 1 5 6 4 3 6 3 3 ... summary(sox) # Group Name result gender # 1601RRI :176 Abid : 8 Min. :0.0000 female:368 # 1607NUS :152 Aiyat : 8 1st Qu.:0.0000 male :472 # 1609Bhutan:168 Aji : 8 Median :0.0000 # 1610Jogja :152 Akmal : 8 Mean :0.4476 # 1611SamaJ :120 Alfons : 8 3rd Qu.:1.0000 # 1612RRI : 72 Amirul : 8 Max. :1.0000 # (Other):792 # distance hand die # Min. :2.00 bad :420 Min. :0.000 # 1st Qu.:2.75 good:420 1st Qu.:2.000 # Median :3.50 Median :4.000 # Mean :3.50 Mean :4.414 # 3rd Qu.:4.25 3rd Qu.:7.000 # Max. :5.00 Max. :9.000 ( N <- nrow(sox) ) # [1] 840 Note: From R version 4.0, the default for stringsAsFactors changed to FALSE. We do need strings to be converted to factors, so now have to specify stringsAsFactors = TRUE. The model Likelihood We’ll start with a simple model with only two covariates: distance and hand. This looks similar to the simple regression model except that we have to work with the logistic link: $${\rm logit}(p_i) = \beta_0 + \beta_{distance} \times distance_i + \beta_{hand} \times hand_i$$ $$y_i \sim {\rm Bernoulli} (p_i)$$ Here$y_i$is the result, 1 if the sock went into the box, 0 otherwise. This is drawn from a Bernoulli distribution with probability of success$p_i$. Priors As with simple regression, we’ll centre the distance value by subtracting the mean (3.5 m) so that the intercept is sensible. Distance then goes from -1.5 to +1.5. We will convert the hand variable from a factor to a number (JAGS can’t handle factors), with 0 for bad hand and 1 for good hand. With logistic regression, the sensible range for${\rm logit}(p_i)$is quite limited: -5 corresponds to a probability of 0.007 and +5 to 0.993. Since the covariates here are limited in range, the coefficients will also be limited. I like to use uniform priors when exploring models for reasons given here; we check the posterior to see if this is constrained by our prior. We’ll use${\rm Uniform}(-5, 5)$and check to see if this is constraining the posterior. The intercept tells us the probability (on the logit scale) of getting the sock into the box when all the covariates are 0, ie, at 3.5m from the box with bad hand. We may not have any intuitions about$\beta_0$, so we put a uniform Beta prior on the underlying probability and then convert to the logit scale. $$p_0 \sim {\rm Beta}(1,1)$$ $$\beta_0 = {\rm logit}(p_0)$$ Here’s the diagram: The JAGS code  # Save this in the file "sox_DH.jags" model{ # Likelihood for(i in 1:N) { logit(p[i]) <- b0 + bDist * distance[i] + bHand * hand[i] y[i] ~ dbern(p[i]) } # Priors p0 ~ dbeta(1, 1) b0 <- logit(p0) bDist ~ dunif(-5, 5) bHand ~ dunif(-5, 5) } Running the model Here’s the R code to prepare the data and run the model: # Centre the distance values distanceC <- sox$distance - mean(sox$distance) # Convert the 'hand' factor to numeric, 0/1 hand <- as.numeric(sox$hand) - 1

jagsData <- list(N = N, y = sox$result, distance = distanceC, hand = hand) str(jagsData) List of 4 #$ N       : int 840
# $y : int [1:840] 1 1 0 0 0 0 0 0 1 1 ... #$ distance: num [1:840] -1.5 -1.5 -0.5 -0.5 0.5 0.5 1.5 ...
# \$ hand    : num [1:840] 1 0 1 0 1 0 1 0 1 0 ...

# List the parameters we want JAGS to produce:
wanted <- c("p0", "bDist", "bHand")

library(jagsUI)
( soxDH <- jags(jagsData, NULL, wanted, "sox_DH.jags",
n.chains=3, n.iter=10000, parallel=TRUE, DIC=FALSE) )
# ...
#         mean    sd   2.5%    50%  97.5%  Rhat n.eff
# p0     0.387 0.026  0.336  0.387  0.439     1  5952
# bDist -0.940 0.077 -1.094 -0.939 -0.791     1 30000
# bHand  0.388 0.157  0.083  0.387  0.696     1  7200
# ...
library(mcmcOutput)
diagPlot(soxDH)

Here I’ve used the diagPlot function from the mcmcOutput package to do the diagnostic plots. We see that all three chains are mixing well and there’s no sign that the uniform priors are constraining the posterior.

Next we plot the posterior distributions for the parameters:

mco <- mcmcOutput(soxDH)
plot(mco)

The effect of increasing the distance is clearly negative, as we’d expect, and the effect of the good hand is positive. Changing hands from bad to good is equivalent to moving forward about 40cm.

The die score

What is the effect of the die on success in throwing the sock? Modify the code to include the die score as well as distance and hand. You should centre the die scores as we did distance by subtracting the mean. See the model example.

Non-independence

The analysis above assumes that 840 observations of sock-throwing are independent; one throw gives us no information about any other throw, at least from the same distance and with the same hand. But that isn’t true. We have data for only 105 people, some of whom are better at throwing socks than others. We can deal with this by including a parameter in the model to reflect the ability of the person throwing. That will be a hierarchical model, the topic of the next page.

9 thoughts on “Logistic regression”

1. Marcello Franchini says:

Good evening,

Unfortunately, considering the sentence “I like to use uniform priors when exploring models for reasons given here”, when I click on “here” I’m not postponed to the appropriate page, but to another one. Please, can anyone fix the problem (assuming that is not mine. In this case, sorry about that)? Many thanks.

1. Marcello Franchini says:

Don’t worry. Thank you.

2. Marcello Franchini says:

Furthermore, I checked even within the simple regression tutorial but I have not clear what is the meaning of the following sentences:

1. As with simple regression, we’ll centre the distance value by subtracting the mean (3.5 m) so that the intercept is sensible. Distance then goes from -1.5 to +1.5.

2. With logistic regression, the sensible range for logit(pi) is quite limited: -5 corresponds to a probability of 0.007 and +5 to 0.993.

Many thanks again.

1. Marcello Franchini says:

Sorry, regarding the following question:

1. As with simple regression, we’ll centre the distance value by subtracting the mean (3.5 m) so that the intercept is sensible. Distance then goes from -1.5 to +1.5.

I solved the issue.

Re 2: The idea is that for sock-throwing, a probability of success < 1% is practically zero and > 99% is practically certain. If we got results that big or small, we’d want to take another look at the analysis.
It depends on the context though, lots of people buy lottery tickets when the probability of winning is ≪ 1%, but the actual probability is only of interest when lots of people are playing and prizes are huge.

1. Marcello Franchini says:

Thanks very much for the explation and, especially, for your patience.

3. Marcello Franchini says:

Good evening,

I repeated the exercise even considering the “die” covariate. Please, can anyone tell me if my results match with yours? Here the attached word file. Many thanks in advance!

Logistic Regression