Simple regression

As our example of a simple linear model we’ll use the very first example published, by Francis Galton in 18861Galton (1886) Regression Towards Mediocrity in Hereditary Stature. The Journal of the Anthropological Institute of Great Britain and Ireland, Vol. 15, 246-263. He was interested in the heights of parents and their children and offered prizes for families who sent him family records: these are obviously not a random sample, but likely representative of family relations in the population.

We will only use the data for heights of mothers and daughters, and when mothers have more than one daughter, we selected one randomly. We also converted heights from inches to cm. You can download and explore the data with the following R code:

galton <- read.csv("http://bcss.org.my/data/galtonF.csv", comment="#")
head(galton)
#   Mother Daughter
# 1  170.2    175.3
# 2  168.9    166.4
# 3  162.6    172.7
# 4  162.6    170.2
# 5  148.6    158.8
# 6  172.7    176.5
str(galton)
# 'data.frame':   168 obs. of  2 variables:
#  $ Mother  : num  170 169 163 163 149 ...
#  $ Daughter: num  175 166 173 170 159 ...
plot(Daughter ~ Mother, data=galton)
(fit <- lm(Daughter ~ Mother, data=galton) )
# Call:
lm(formula = Daughter ~ Mother, data = galton)
#
# Coefficients:
# (Intercept)       Mother  
#     96.9597       0.4062  
abline(fit, col='red')

As we might have expected, tall women tend to have tall daughters and short women short daughters. The feature that Galton noted was that tall women have daughters who are on average less tall than their mothers, and short women have daughters who are less short. He called this phenomenon “Regression Towards Mediocrity” – today we say “regression to the mean”. It’s historically odd that his term “regression” has been attached to this approach to modelling!

A note on terminology: The variable we want to predict – in this case the daughter’s height – is the response, and the mother’s height is the covariate. The lm function calculates two coefficients: the intercept is the predicted value of the response variable when the covariate is zero (or when all the covariates are zero in multiple regression) and the slope is the change in the response associated with a one-unit difference in the covariate.

The model

For JAGS, we need to specify out model carefully. We’d like to write $$y_i = \beta_0 + \beta_1 \times x_i$$ where $y_i$ is the response, $x_i$ the covariate, $\beta_0$ the intercept and $\beta_1$ the slope. But as the scatterplot above shows, the actual response values are often far from the theoretical values. So instead we calculate a mean value ($\mu_i$) and assume the actual observations are normally distributed around the mean: $$\mu_i = \beta_0 + \beta_1 \times x_i$$ $$y_i \sim {\rm Normal}(\mu_i, \sigma)$$ Note that $\sigma$ is included as a parameter in the Bayesian model, in contrast to lm.

Priors

We need priors for the intercept, slope and $\sigma$. The intercept is a problem: it’s the expected height of the daughter of a mother who is 0 cm tall! That’s biological nonsense, but easily fixed. We centre the mothers’ heights by subtracting the mean. Our new predictor is the difference between a mother’s height and the mean, and the intercept is now the expected height of a daughter for a mother of mean height.

mean(galton$Mother)
# [1] 162.84
motherC <- galton$Mother - mean(galton$Mother)

A gamma prior would be good for the intercept, b0, with a mean of, say, 160cm and a fairly large SD of 20cm. We can check the shape of the curve with this code:
library(wiqid)
curve(dgamma2(x, 160, 20), 100, 250)

We would expect the slope, b1, to be positive, but let’s imagine this is 1886 and we don’t have proper data on this – Galton himself couldn’t find any. So we’ll use a normal prior with mean 0 and a broad SD of 10; b1 = 10 means that 1cm difference in the mother’s height means a 10cm difference in the daughter’s height, which is huge. For sigma we’ll use a uniform prior with a range 0 to 10 and check to see if it limits the posterior.

The JAGS code

Now we can put together the JAGS code. I’ve used parent/child instead of mother/daughter so that we could use the same code to analyse data for sons and fathers. The dgamma distribution in JAGS needs shape and rate, which we will calculate in R and pass to JAGS in the data.


# Save this in the file "galton.jags"
model{
   # Likelihood
   for(i in 1:N) {
     mu[i] <- b0 + b1 * parent[i]
     child[i] ~ dnorm(mu[i], tau)
   }
   # Priors
   b0 ~ dgamma(shape, rate)
   b1 ~ dnorm(0, 0.01)
   sigma ~ dunif(0, 10)
   tau <- 1/sigma^2
 }

JAGS syntax: JAGS uses the for loop structure to indicate how the values in the parent and child vectors are related to mu. Notice also that JAGS uses the precision (tau) instead of sigma in the call to dnorm; we still prefer to use a prior for sigma and then calculate the corresponding tau.

Running the model

Here’s the R code to run the model and check the output.

gampars <- getGammaPar(160, 20)
jagsData <- list(child=galton$Daughter, parent=motherC, N=length(motherC),
  shape=gampars[1], rate=gampars[2])
str(jagsData)
# List of 5
# $ child : num [1:168] 175 166 173 170 159 ...
# $ parent: num [1:168] 7.357 6.057 -0.243 -0.243 -14.243 ...
# $ N     : int 168
# $ shape : num 64
# $ rate  : num 0.4 

wanted <- c("b0", "b1", "sigma")

library(jagsUI)
out <- jags(jagsData, NULL, wanted, "galton.jags",
  n.chains=3, n.iter=1e4, n.burn=0, parallel=TRUE, DIC=FALSE)
# Processing function input....... 
#
# Done. 
# 
# Beginning parallel processing using 3 cores. Console output will be suppressed.
#
# Parallel processing completed.
#
# Calculating statistics....... 
#
# Done. 
out
# JAGS output for model 'galton.jags', generated by jagsUI.
# ...
#          mean    sd    2.5%     50%   97.5% overlap0 f Rhat n.eff
# b0    163.105 0.443 162.231 163.106 163.970    FALSE 1    1 17339
# b1      0.406 0.075   0.259   0.407   0.553    FALSE 1    1 30000
# sigma   5.690 0.313   5.118   5.673   6.342    FALSE 1    1 30000
#...
diagPlot(out)

This is a simple model, so we do not need to supply initial values. The default adaptation draws (100) were adequate, and they provided sufficient burn-in. The diagnostic plots look good; see here for more on interpreting these plots.

The results

The main result of interest is the effect of the mother’s height, the slope b1. We’ll use the plotting functions in the wiqid package to look at the posterior distribution of this.

# Convert to a 'Bwiqid' object
bw <- as.Bwiqid(out)
plot(bw, "b1")

Mother’s height has a positive association with daughter’s height: a 1 cm difference in mother’s height is associated with a 0.4 cm difference in daughter’s height. If regression to the mean was not occurring and daughters were on average just as tall as their mothers, b1 would be 1; we can be sure that’s not true.

If each generation of daughters is closer to the mean height than their mothers, women should all be equally tall after a few hundred years. Why is there still variation in women’s heights? The clue to that is $\sigma$.

As Galton noted, a very tall person could be the average child of an exceptionally tall parent, but is more likely to be the exceptionally tall child of an average parent.

Next -> Binary data and logistic regression