Linear regression and Broom
Last updated on 2024-12-04 | Edit this page
Estimated time: 100 minutes
Overview
Questions
- How can I explore relationships between variables in my data?
- How can I present model outputs in an easier to read way?
Objectives
- To be able to explore relationships between variables
- To be able to calculate predicted variables and residuals
- To be able to construct linear regression models
- To be able to present model outcomes using Broom
Content
- Linear Regression Models
- Use of Log transform
- Use of Categorical Variables
- Use of Broom
Data
R
# We will need these libraries and this data later.
library(ggplot2)
library(tidyverse)
library(lmtest)
library(sandwich)
library(broom)
lon_dims_imd_2019 <- read.csv("data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv")
We are going to use the data from the Consumer Data Research Centre, specifically the London IMD 2019 (English IMD 2019 Domains rebased).
Atribution: Data provided by the Consumer Data Research Centre, an ESRC Data Investment: ES/L011840/1, ES/L011891/1
The statistical unit areas across the country are Lower layer Super Output Areas (LSOAs). We will explore the relationships between the different dimensions of the Indices of Multiple Deprivation.
Linear regression
Linear Regression enables use to to explore the the linear relationship of the dependent variable Y and independent variable(s) X(s). We are going to explore the linear relationship between the Health Deprivation and Disability Domain and the Living Environment Deprivation Domain.
The Health Deprivation and Disability Domain measures the risk of premature death and the impairment of quality of life through poor physical or mental health. The domain measures morbidity, disability and premature mortality but not aspects of behaviour or environment that may be predictive of future health deprivation.
The Living Environment Deprivation Domain measures the quality of the local environment. The indicators fall into two sub-domains. The ‘indoors’ living environment measures the quality of housing; while the ‘outdoors’ living environment contains measures of air quality and road traffic accidents.
Reference: McLennan, David et al. The English Indices of Deprivation 2019 : Technical Report. Ministry of Housing, Communities and Local Government, 2019. Print.
Simple linear regression
In the simple linear regression example we have only one dependent variable (health_london_rank) and one independent variable (livingEnv_london_rank).
R
reg_LivEnv_health <- lm(health_london_rank ~ livingEnv_london_rank, data = lon_dims_imd_2019)
# We put the dependent variable to the left of the '~' and the independent variable(s) to the right
# and we tell R which dataset we are referring to.
summary(reg_LivEnv_health)
OUTPUT
Call:
lm(formula = health_london_rank ~ livingEnv_london_rank, data = lon_dims_imd_2019)
Residuals:
Min 1Q Median 3Q Max
-21549.6 -5948.2 609.1 6239.2 15792.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.692e+04 2.274e+02 74.40 <2e-16 ***
livingEnv_london_rank 3.430e-01 1.915e-02 17.91 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7625 on 4833 degrees of freedom
Multiple R-squared: 0.06225, Adjusted R-squared: 0.06205
F-statistic: 320.8 on 1 and 4833 DF, p-value: < 2.2e-16
From the result of this analysis, we can see that the Living Environment Deprivation Domain rank has a significant(small p-value, general rule of thumb <0.05) and positive relationship(positive coefficient) with the Health Deprivation and Disability Domain rank.
One way of interpreting the result is: One unit increase in the Living Environment rank is related to around 0.343 (3.430e-01) points increase of the Health Deprivation and Disability rank.
R-square shows the amount of variance of Y explained by X. In this case the Living Environment rank explains 6.225% of the variance in the Health Deprivation and Disability rank. Adj R2(6.205%) shows the same as R2 but adjusted by the # of cases and # of variables. When the # of variables is small and the # of cases is very large then Adj R2 is closer to R2.
Log transform
If your data is skewed, it can be useful to transform a variable to it’s log form when doing the regression. You can either transform the variable beforehand or do so in the equation.
R
reg_logbarriers_health <- lm(health_london_rank ~ log(barriers_london_rank), data = lon_dims_imd_2019)
summary(reg_logbarriers_health)
OUTPUT
Call:
lm(formula = health_london_rank ~ log(barriers_london_rank),
data = lon_dims_imd_2019)
Residuals:
Min 1Q Median 3Q Max
-20379.3 -5611.7 774.9 5988.6 23828.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1349.4 798.3 -1.69 0.091 .
log(barriers_london_rank) 2917.0 105.7 27.59 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7319 on 4833 degrees of freedom
Multiple R-squared: 0.1361, Adjusted R-squared: 0.1359
F-statistic: 761.1 on 1 and 4833 DF, p-value: < 2.2e-16
The interpretation of the log-transformed variable is a bit different. In this example only the predictor variable is log tranformed, therefore to interpret the slope coefficient we divide it by 100 (2917.0/100=29.170).
If the dependent/response variable is solely log-transformed, exponentiate the coefficient. This gives the multiplicative factor for every one-unit increase in the independent variable. Example: the coefficient is 0.198. exp(0.198) = 1.218962. For every one-unit increase in the independent variable, our dependent variable increases by a factor of about 1.22, or 22%. Recall that multiplying a number by 1.22 is the same as increasing the number by 22%. Likewise, multiplying a number by, say 0.84, is the same as decreasing the number by 1 – 0.84 = 0.16, or 16%.
If both are transformed, interpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198. For every 1% increase in the independent variable, our dependent variable increases by about 0.20%. For x percent increase, calculate 1.x to the power of the coefficient, subtract 1, and multiply by 100. Example: For every 20% increase in the independent variable, our dependent variable increases by about (1.20 0.198 - 1) * 100 = 3.7 percent.
Predicted values and residuals
We can expand our simple linear regression example to incorporate the Barriers to Housing and Services Domain rank. The Barriers to Housing and Services Domain measures the physical and financial accessibility of housing and local services. The indicators fall into two sub-domains: ‘geographical barriers’, which relate to the physical proximity of local services, and ‘wider barriers’ which includes issues relating to access to housing, such as affordability.
R
reg_LivEnv_barriers_health <- lm(
health_london_rank ~ livingEnv_london_rank + barriers_london_rank,
data = lon_dims_imd_2019
)
summary(reg_LivEnv_barriers_health)
OUTPUT
Call:
lm(formula = health_london_rank ~ livingEnv_london_rank + barriers_london_rank,
data = lon_dims_imd_2019)
Residuals:
Min 1Q Median 3Q Max
-21685.8 -4834.9 546.5 5142.4 18971.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.169e+04 2.502e+02 46.74 <2e-16 ***
livingEnv_london_rank 2.620e-01 1.721e-02 15.22 <2e-16 ***
barriers_london_rank 2.508e+00 7.059e-02 35.54 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6790 on 4832 degrees of freedom
Multiple R-squared: 0.2566, Adjusted R-squared: 0.2562
F-statistic: 833.7 on 2 and 4832 DF, p-value: < 2.2e-16
After running the regression model, we can access the model predicted values and the residuals compared to the real observations.
R
# first we fit the predictions
health_rank_pred <- fitted(reg_LivEnv_barriers_health)
health_rank_pred <- as.data.frame(health_rank_pred)
# now we add the residual values too
health_rank_resid <- residuals(reg_LivEnv_barriers_health)
health_rank_pred$resid <- health_rank_resid
# We can thenview the predictions and residuals
head(health_rank_pred)
OUTPUT
health_rank_pred resid
1 20454.30 11658.702
2 24260.97 5444.031
3 15233.92 2366.080
4 16671.36 1235.645
5 15719.80 5861.196
6 14981.47 1432.528
R
# You can view the full data in RStudio with the View() function
View(health_rank_pred)
Robust regression
We can run the robust standard error regressions(control for heteroskedasticity, meaning unequal variances):
R
reg_LivEnv_barriers_health$robse <- vcovHC(reg_LivEnv_barriers_health, type = "HC1")
coeftest(reg_LivEnv_barriers_health, reg_LivEnv_barriers_health$robse)
OUTPUT
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1694e+04 2.3776e+02 49.182 < 2.2e-16 ***
livingEnv_london_rank 2.6197e-01 1.7458e-02 15.006 < 2.2e-16 ***
barriers_london_rank 2.5085e+00 6.4394e-02 38.955 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In addition, we can access the cluster-robust standard errors regression results:
R
# cluster-robust standard errors
coeftest(reg_LivEnv_barriers_health, reg_LivEnv_barriers_health$clse)
OUTPUT
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1694e+04 2.5018e+02 46.741 < 2.2e-16 ***
livingEnv_london_rank 2.6197e-01 1.7207e-02 15.225 < 2.2e-16 ***
barriers_london_rank 2.5085e+00 7.0588e-02 35.537 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Challenge 1
Use the gapminder
data to create a linear model between
two continuous variables.
Discuss your question and your findings.
Regression with categorical independent variables
We will explore the use of categorical independent variables in linear regression in this episode. When the dependent variable is a categorical variable, you may consider the alternatives of linear regression like logit regression and multinomial regression.
R
# As a categorical variable we have added la19nm, these are the names of the London boroughs
reg_cat_var <- lm(health_london_rank ~ livingEnv_london_rank + barriers_london_rank + la19nm, data = lon_dims_imd_2019)
summary(reg_cat_var)
OUTPUT
Call:
lm(formula = health_london_rank ~ livingEnv_london_rank + barriers_london_rank +
la19nm, data = lon_dims_imd_2019)
Residuals:
Min 1Q Median 3Q Max
-21694.3 -3423.6 248.1 3691.8 19321.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.105e+04 5.429e+02 20.346 < 2e-16 ***
livingEnv_london_rank 3.595e-02 1.802e-02 1.996 0.046036 *
barriers_london_rank 3.149e+00 7.888e-02 39.924 < 2e-16 ***
la19nmBarnet 8.400e+03 6.508e+02 12.908 < 2e-16 ***
la19nmBexley 1.416e+03 7.186e+02 1.970 0.048856 *
la19nmBrent 7.765e+03 6.545e+02 11.863 < 2e-16 ***
la19nmBromley 5.208e+03 6.789e+02 7.670 2.06e-14 ***
la19nmCamden -1.842e+03 7.426e+02 -2.480 0.013184 *
la19nmCity of London 4.742e+03 2.251e+03 2.107 0.035185 *
la19nmCroydon 7.488e+02 6.388e+02 1.172 0.241192
la19nmEaling 3.697e+03 6.459e+02 5.724 1.11e-08 ***
la19nmEnfield 7.344e+03 6.501e+02 11.297 < 2e-16 ***
la19nmGreenwich -2.221e+03 6.873e+02 -3.232 0.001238 **
la19nmHackney -2.725e+03 6.811e+02 -4.000 6.43e-05 ***
la19nmHammersmith and Fulham -3.907e+03 7.438e+02 -5.252 1.57e-07 ***
la19nmHaringey 1.019e+03 6.877e+02 1.481 0.138580
la19nmHarrow 9.993e+03 7.053e+02 14.168 < 2e-16 ***
la19nmHavering -3.945e+02 7.317e+02 -0.539 0.589781
la19nmHillingdon 8.459e+02 6.937e+02 1.219 0.222791
la19nmHounslow 2.462e+03 6.890e+02 3.573 0.000356 ***
la19nmIslington -8.217e+03 7.305e+02 -11.248 < 2e-16 ***
la19nmKensington and Chelsea 9.191e+03 7.473e+02 12.299 < 2e-16 ***
la19nmKingston upon Thames 4.899e+03 7.804e+02 6.277 3.75e-10 ***
la19nmLambeth -6.270e+03 6.830e+02 -9.180 < 2e-16 ***
la19nmLewisham -1.991e+03 6.675e+02 -2.983 0.002869 **
la19nmMerton 3.387e+02 7.474e+02 0.453 0.650434
la19nmNewham 4.209e+03 6.617e+02 6.361 2.19e-10 ***
la19nmRedbridge 4.420e+03 6.914e+02 6.392 1.79e-10 ***
la19nmRichmond upon Thames 4.469e+03 7.740e+02 5.774 8.22e-09 ***
la19nmSouthwark -3.363e+03 6.731e+02 -4.996 6.05e-07 ***
la19nmSutton -1.729e+02 7.537e+02 -0.229 0.818563
la19nmTower Hamlets -5.862e+03 6.968e+02 -8.414 < 2e-16 ***
la19nmWaltham Forest 2.026e+03 6.854e+02 2.956 0.003137 **
la19nmWandsworth 1.087e+01 6.797e+02 0.016 0.987246
la19nmWestminster 5.683e+02 7.472e+02 0.761 0.446958
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5362 on 4800 degrees of freedom
Multiple R-squared: 0.5395, Adjusted R-squared: 0.5363
F-statistic: 165.4 on 34 and 4800 DF, p-value: < 2.2e-16
R automatically recognizes la19nm as a factor and treats it accordingly. The missing one in the coefficient summary (Barking and Dagenham) is treated as a base line, therefore the value is 0. However, we can also modify our model to show for all:
R
reg_cat_var_showall <- lm(
health_london_rank ~ 0 + livingEnv_london_rank + barriers_london_rank + la19nm,
data = lon_dims_imd_2019
)
summary(reg_cat_var_showall)
OUTPUT
Call:
lm(formula = health_london_rank ~ 0 + livingEnv_london_rank +
barriers_london_rank + la19nm, data = lon_dims_imd_2019)
Residuals:
Min 1Q Median 3Q Max
-21694.3 -3423.6 248.1 3691.8 19321.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
livingEnv_london_rank 3.595e-02 1.802e-02 1.996 0.046 *
barriers_london_rank 3.149e+00 7.888e-02 39.924 < 2e-16 ***
la19nmBarking and Dagenham 1.105e+04 5.429e+02 20.346 < 2e-16 ***
la19nmBarnet 1.945e+04 4.764e+02 40.820 < 2e-16 ***
la19nmBexley 1.246e+04 5.876e+02 21.209 < 2e-16 ***
la19nmBrent 1.881e+04 4.526e+02 41.564 < 2e-16 ***
la19nmBromley 1.625e+04 5.444e+02 29.858 < 2e-16 ***
la19nmCamden 9.205e+03 5.783e+02 15.918 < 2e-16 ***
la19nmCity of London 1.579e+04 2.198e+03 7.184 7.78e-13 ***
la19nmCroydon 1.180e+04 4.574e+02 25.789 < 2e-16 ***
la19nmEaling 1.474e+04 4.388e+02 33.601 < 2e-16 ***
la19nmEnfield 1.839e+04 4.549e+02 40.431 < 2e-16 ***
la19nmGreenwich 8.825e+03 5.233e+02 16.865 < 2e-16 ***
la19nmHackney 8.322e+03 4.675e+02 17.802 < 2e-16 ***
la19nmHammersmith and Fulham 7.140e+03 5.675e+02 12.582 < 2e-16 ***
la19nmHaringey 1.207e+04 4.848e+02 24.886 < 2e-16 ***
la19nmHarrow 2.104e+04 5.667e+02 37.127 < 2e-16 ***
la19nmHavering 1.065e+04 6.178e+02 17.243 < 2e-16 ***
la19nmHillingdon 1.189e+04 5.541e+02 21.464 < 2e-16 ***
la19nmHounslow 1.351e+04 5.050e+02 26.750 < 2e-16 ***
la19nmIslington 2.830e+03 5.500e+02 5.145 2.78e-07 ***
la19nmKensington and Chelsea 2.024e+04 5.491e+02 36.859 < 2e-16 ***
la19nmKingston upon Thames 1.595e+04 6.476e+02 24.624 < 2e-16 ***
la19nmLambeth 4.777e+03 4.657e+02 10.257 < 2e-16 ***
la19nmLewisham 9.055e+03 4.543e+02 19.931 < 2e-16 ***
la19nmMerton 1.139e+04 5.935e+02 19.182 < 2e-16 ***
la19nmNewham 1.526e+04 4.467e+02 34.149 < 2e-16 ***
la19nmRedbridge 1.547e+04 5.307e+02 29.141 < 2e-16 ***
la19nmRichmond upon Thames 1.552e+04 6.355e+02 24.414 < 2e-16 ***
la19nmSouthwark 7.684e+03 4.662e+02 16.481 < 2e-16 ***
la19nmSutton 1.087e+04 6.274e+02 17.332 < 2e-16 ***
la19nmTower Hamlets 5.184e+03 5.107e+02 10.151 < 2e-16 ***
la19nmWaltham Forest 1.307e+04 4.831e+02 27.057 < 2e-16 ***
la19nmWandsworth 1.106e+04 4.905e+02 22.543 < 2e-16 ***
la19nmWestminster 1.161e+04 5.649e+02 20.560 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5362 on 4800 degrees of freedom
Multiple R-squared: 0.9407, Adjusted R-squared: 0.9403
F-statistic: 2177 on 35 and 4800 DF, p-value: < 2.2e-16
Categorical variables with interaction terms
Sometimes we are interested in how a variable interacts with another variable. We can explore any interactions between locations (la19nm) and the living environment and barrier ranks.
R
reg_cat_var_int <- lm(health_london_rank ~ la19nm * (livingEnv_london_rank + barriers_london_rank), data = lon_dims_imd_2019)
summary(reg_cat_var_int)
OUTPUT
Call:
lm(formula = health_london_rank ~ la19nm * (livingEnv_london_rank +
barriers_london_rank), data = lon_dims_imd_2019)
Residuals:
Min 1Q Median 3Q Max
-19227.9 -3254.4 297.6 3372.8 17891.5
Coefficients:
Estimate Std. Error
(Intercept) 1.300e+04 1.431e+03
la19nmBarnet 9.411e+03 1.909e+03
la19nmBexley -5.447e+03 2.464e+03
la19nmBrent 3.068e+03 1.866e+03
la19nmBromley 1.577e+03 2.107e+03
la19nmCamden -1.547e+04 3.565e+03
la19nmCity of London -8.405e+02 5.283e+03
la19nmCroydon -3.824e+03 1.679e+03
la19nmEaling 2.124e+03 1.841e+03
la19nmEnfield 2.420e+03 1.750e+03
la19nmGreenwich -3.383e+03 2.268e+03
la19nmHackney -1.803e+03 1.937e+03
la19nmHammersmith and Fulham -1.123e+04 2.389e+03
la19nmHaringey -4.527e+02 1.975e+03
la19nmHarrow 7.800e+03 2.476e+03
la19nmHavering -8.133e+03 2.519e+03
la19nmHillingdon 7.792e+02 2.030e+03
la19nmHounslow 4.939e+03 2.064e+03
la19nmIslington -4.614e+03 2.592e+03
la19nmKensington and Chelsea 1.264e+04 2.185e+03
la19nmKingston upon Thames 5.297e+03 2.972e+03
la19nmLambeth -6.973e+03 2.276e+03
la19nmLewisham -4.745e+03 1.912e+03
la19nmMerton -8.433e+03 2.793e+03
la19nmNewham 3.468e+03 1.786e+03
la19nmRedbridge 4.339e+03 2.156e+03
la19nmRichmond upon Thames 3.822e+03 4.288e+03
la19nmSouthwark -6.830e+03 1.923e+03
la19nmSutton -8.242e+03 2.600e+03
la19nmTower Hamlets -2.705e+03 2.163e+03
la19nmWaltham Forest 2.070e+03 1.814e+03
la19nmWandsworth 6.438e+02 2.060e+03
la19nmWestminster -1.063e+04 3.281e+03
livingEnv_london_rank -1.900e-01 1.321e-01
barriers_london_rank 3.620e+00 8.023e-01
la19nmBarnet:livingEnv_london_rank 1.641e-01 1.553e-01
la19nmBexley:livingEnv_london_rank 5.286e-01 1.644e-01
la19nmBrent:livingEnv_london_rank 5.219e-01 1.663e-01
la19nmBromley:livingEnv_london_rank 4.577e-01 1.492e-01
la19nmCamden:livingEnv_london_rank -4.599e-02 1.783e-01
la19nmCity of London:livingEnv_london_rank 3.448e-01 9.482e-01
la19nmCroydon:livingEnv_london_rank 4.524e-01 1.417e-01
la19nmEaling:livingEnv_london_rank 2.053e-01 1.624e-01
la19nmEnfield:livingEnv_london_rank 5.217e-01 1.556e-01
la19nmGreenwich:livingEnv_london_rank 2.230e-01 1.620e-01
la19nmHackney:livingEnv_london_rank -2.882e-01 1.817e-01
la19nmHammersmith and Fulham:livingEnv_london_rank 1.915e-01 2.025e-01
la19nmHaringey:livingEnv_london_rank -9.428e-02 2.048e-01
la19nmHarrow:livingEnv_london_rank 3.559e-01 1.774e-01
la19nmHavering:livingEnv_london_rank 5.238e-01 1.584e-01
la19nmHillingdon:livingEnv_london_rank 3.026e-01 1.572e-01
la19nmHounslow:livingEnv_london_rank -3.776e-02 1.691e-01
la19nmIslington:livingEnv_london_rank -4.471e-01 1.967e-01
la19nmKensington and Chelsea:livingEnv_london_rank -1.673e+00 2.725e-01
la19nmKingston upon Thames:livingEnv_london_rank 2.551e-01 1.748e-01
la19nmLambeth:livingEnv_london_rank -1.901e-01 2.037e-01
la19nmLewisham:livingEnv_london_rank 1.834e-01 1.687e-01
la19nmMerton:livingEnv_london_rank 5.430e-01 1.761e-01
la19nmNewham:livingEnv_london_rank -4.372e-03 1.627e-01
la19nmRedbridge:livingEnv_london_rank 2.310e-01 1.655e-01
la19nmRichmond upon Thames:livingEnv_london_rank -2.672e-03 1.876e-01
la19nmSouthwark:livingEnv_london_rank 2.786e-01 1.656e-01
la19nmSutton:livingEnv_london_rank 5.311e-01 1.576e-01
la19nmTower Hamlets:livingEnv_london_rank 1.954e-01 1.629e-01
la19nmWaltham Forest:livingEnv_london_rank -2.538e-02 1.729e-01
la19nmWandsworth:livingEnv_london_rank -1.552e-01 1.637e-01
la19nmWestminster:livingEnv_london_rank -9.646e-01 2.107e-01
la19nmBarnet:barriers_london_rank -1.334e+00 8.565e-01
la19nmBexley:barriers_london_rank -4.421e-01 8.661e-01
la19nmBrent:barriers_london_rank -7.477e-01 9.505e-01
la19nmBromley:barriers_london_rank -1.240e+00 8.666e-01
la19nmCamden:barriers_london_rank 2.994e+00 1.172e+00
la19nmCity of London:barriers_london_rank 1.001e+00 3.050e+00
la19nmCroydon:barriers_london_rank -6.200e-01 8.595e-01
la19nmEaling:barriers_london_rank -5.816e-01 8.604e-01
la19nmEnfield:barriers_london_rank -6.740e-01 8.776e-01
la19nmGreenwich:barriers_london_rank -8.003e-01 8.842e-01
la19nmHackney:barriers_london_rank 4.611e-01 1.427e+00
la19nmHammersmith and Fulham:barriers_london_rank 1.442e+00 9.425e-01
la19nmHaringey:barriers_london_rank 4.613e-01 9.154e-01
la19nmHarrow:barriers_london_rank -1.413e+00 9.202e-01
la19nmHavering:barriers_london_rank -3.766e-01 9.029e-01
la19nmHillingdon:barriers_london_rank -1.592e+00 8.683e-01
la19nmHounslow:barriers_london_rank -1.445e+00 9.061e-01
la19nmIslington:barriers_london_rank -7.852e-01 1.031e+00
la19nmKensington and Chelsea:barriers_london_rank 1.076e+00 1.007e+00
la19nmKingston upon Thames:barriers_london_rank -1.257e+00 9.816e-01
la19nmLambeth:barriers_london_rank -3.110e-01 9.491e-01
la19nmLewisham:barriers_london_rank 1.004e-01 9.045e-01
la19nmMerton:barriers_london_rank 4.934e-01 9.399e-01
la19nmNewham:barriers_london_rank 2.625e+00 2.260e+00
la19nmRedbridge:barriers_london_rank -1.116e+00 8.944e-01
la19nmRichmond upon Thames:barriers_london_rank -1.884e-01 1.104e+00
la19nmSouthwark:barriers_london_rank 5.627e-02 9.074e-01
la19nmSutton:barriers_london_rank -5.571e-02 9.399e-01
la19nmTower Hamlets:barriers_london_rank -2.394e+00 9.808e-01
la19nmWaltham Forest:barriers_london_rank -5.478e-01 9.237e-01
la19nmWandsworth:barriers_london_rank -2.874e-01 8.705e-01
la19nmWestminster:barriers_london_rank 3.494e+00 1.111e+00
t value Pr(>|t|)
(Intercept) 9.089 < 2e-16 ***
la19nmBarnet 4.929 8.57e-07 ***
la19nmBexley -2.210 0.027119 *
la19nmBrent 1.645 0.100112
la19nmBromley 0.749 0.454155
la19nmCamden -4.338 1.47e-05 ***
la19nmCity of London -0.159 0.873612
la19nmCroydon -2.277 0.022851 *
la19nmEaling 1.154 0.248528
la19nmEnfield 1.382 0.166890
la19nmGreenwich -1.491 0.135952
la19nmHackney -0.931 0.352136
la19nmHammersmith and Fulham -4.700 2.68e-06 ***
la19nmHaringey -0.229 0.818706
la19nmHarrow 3.150 0.001640 **
la19nmHavering -3.228 0.001254 **
la19nmHillingdon 0.384 0.701039
la19nmHounslow 2.393 0.016735 *
la19nmIslington -1.780 0.075114 .
la19nmKensington and Chelsea 5.787 7.61e-09 ***
la19nmKingston upon Thames 1.782 0.074808 .
la19nmLambeth -3.064 0.002197 **
la19nmLewisham -2.482 0.013106 *
la19nmMerton -3.019 0.002549 **
la19nmNewham 1.942 0.052165 .
la19nmRedbridge 2.013 0.044191 *
la19nmRichmond upon Thames 0.891 0.372776
la19nmSouthwark -3.552 0.000387 ***
la19nmSutton -3.169 0.001538 **
la19nmTower Hamlets -1.251 0.211119
la19nmWaltham Forest 1.141 0.253966
la19nmWandsworth 0.313 0.754638
la19nmWestminster -3.240 0.001203 **
livingEnv_london_rank -1.438 0.150471
barriers_london_rank 4.511 6.60e-06 ***
la19nmBarnet:livingEnv_london_rank 1.057 0.290593
la19nmBexley:livingEnv_london_rank 3.215 0.001312 **
la19nmBrent:livingEnv_london_rank 3.139 0.001705 **
la19nmBromley:livingEnv_london_rank 3.068 0.002169 **
la19nmCamden:livingEnv_london_rank -0.258 0.796425
la19nmCity of London:livingEnv_london_rank 0.364 0.716133
la19nmCroydon:livingEnv_london_rank 3.191 0.001425 **
la19nmEaling:livingEnv_london_rank 1.264 0.206349
la19nmEnfield:livingEnv_london_rank 3.354 0.000803 ***
la19nmGreenwich:livingEnv_london_rank 1.376 0.168758
la19nmHackney:livingEnv_london_rank -1.586 0.112711
la19nmHammersmith and Fulham:livingEnv_london_rank 0.946 0.344403
la19nmHaringey:livingEnv_london_rank -0.460 0.645286
la19nmHarrow:livingEnv_london_rank 2.005 0.044971 *
la19nmHavering:livingEnv_london_rank 3.307 0.000951 ***
la19nmHillingdon:livingEnv_london_rank 1.925 0.054317 .
la19nmHounslow:livingEnv_london_rank -0.223 0.823335
la19nmIslington:livingEnv_london_rank -2.273 0.023082 *
la19nmKensington and Chelsea:livingEnv_london_rank -6.137 9.08e-10 ***
la19nmKingston upon Thames:livingEnv_london_rank 1.459 0.144509
la19nmLambeth:livingEnv_london_rank -0.934 0.350602
la19nmLewisham:livingEnv_london_rank 1.087 0.277108
la19nmMerton:livingEnv_london_rank 3.082 0.002065 **
la19nmNewham:livingEnv_london_rank -0.027 0.978565
la19nmRedbridge:livingEnv_london_rank 1.396 0.162864
la19nmRichmond upon Thames:livingEnv_london_rank -0.014 0.988639
la19nmSouthwark:livingEnv_london_rank 1.683 0.092533 .
la19nmSutton:livingEnv_london_rank 3.370 0.000759 ***
la19nmTower Hamlets:livingEnv_london_rank 1.200 0.230337
la19nmWaltham Forest:livingEnv_london_rank -0.147 0.883300
la19nmWandsworth:livingEnv_london_rank -0.948 0.343019
la19nmWestminster:livingEnv_london_rank -4.578 4.80e-06 ***
la19nmBarnet:barriers_london_rank -1.558 0.119396
la19nmBexley:barriers_london_rank -0.510 0.609807
la19nmBrent:barriers_london_rank -0.787 0.431493
la19nmBromley:barriers_london_rank -1.431 0.152500
la19nmCamden:barriers_london_rank 2.555 0.010647 *
la19nmCity of London:barriers_london_rank 0.328 0.742734
la19nmCroydon:barriers_london_rank -0.721 0.470718
la19nmEaling:barriers_london_rank -0.676 0.499110
la19nmEnfield:barriers_london_rank -0.768 0.442499
la19nmGreenwich:barriers_london_rank -0.905 0.365429
la19nmHackney:barriers_london_rank 0.323 0.746543
la19nmHammersmith and Fulham:barriers_london_rank 1.530 0.126056
la19nmHaringey:barriers_london_rank 0.504 0.614335
la19nmHarrow:barriers_london_rank -1.535 0.124794
la19nmHavering:barriers_london_rank -0.417 0.676607
la19nmHillingdon:barriers_london_rank -1.834 0.066734 .
la19nmHounslow:barriers_london_rank -1.595 0.110827
la19nmIslington:barriers_london_rank -0.762 0.446220
la19nmKensington and Chelsea:barriers_london_rank 1.068 0.285397
la19nmKingston upon Thames:barriers_london_rank -1.280 0.200487
la19nmLambeth:barriers_london_rank -0.328 0.743140
la19nmLewisham:barriers_london_rank 0.111 0.911614
la19nmMerton:barriers_london_rank 0.525 0.599625
la19nmNewham:barriers_london_rank 1.161 0.245561
la19nmRedbridge:barriers_london_rank -1.247 0.212307
la19nmRichmond upon Thames:barriers_london_rank -0.171 0.864559
la19nmSouthwark:barriers_london_rank 0.062 0.950560
la19nmSutton:barriers_london_rank -0.059 0.952740
la19nmTower Hamlets:barriers_london_rank -2.440 0.014708 *
la19nmWaltham Forest:barriers_london_rank -0.593 0.553200
la19nmWandsworth:barriers_london_rank -0.330 0.741265
la19nmWestminster:barriers_london_rank 3.146 0.001667 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5162 on 4736 degrees of freedom
Multiple R-squared: 0.5789, Adjusted R-squared: 0.5702
F-statistic: 66.45 on 98 and 4736 DF, p-value: < 2.2e-16
Challenge 2
Using the gapminder
data to create a linear model
between a categorical and a continuous variable .
Discuss your question and your findings.
Broom
The ‘broom’ package offers an alternative way of presenting the output of statistical analysis. It centers around three S3 methods, each of which take common objects produced by R statistical functions (lm, t.test, nls, etc) and convert them into a tibble.
These are:
- tidy: constructs a tibble that summarizes the model’s statistical findings. This includes coefficients and p-values for each term in a regression, per-cluster information in clustering applications, or per-test information for multtest functions.
- augment: add columns to the original data that was modeled. This includes predictions, residuals, and cluster assignments.
- glance: construct a concise one-row summary of the model. This typically contains values such as R^2, adjusted R^2, and residual standard error that are computed once for the entire model.
Let’s revisit our initial linear model:
R
reg_LivEnv_health <- lm(health_london_rank ~ livingEnv_london_rank, data = lon_dims_imd_2019)
summary(reg_LivEnv_health)
OUTPUT
Call:
lm(formula = health_london_rank ~ livingEnv_london_rank, data = lon_dims_imd_2019)
Residuals:
Min 1Q Median 3Q Max
-21549.6 -5948.2 609.1 6239.2 15792.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.692e+04 2.274e+02 74.40 <2e-16 ***
livingEnv_london_rank 3.430e-01 1.915e-02 17.91 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7625 on 4833 degrees of freedom
Multiple R-squared: 0.06225, Adjusted R-squared: 0.06205
F-statistic: 320.8 on 1 and 4833 DF, p-value: < 2.2e-16
There is a lot of useful information, but it not available in a way so that you can combine it with other models or do further analysis. We can convert this to tabular data using the ‘tidy’ function.
R
tidy(reg_LivEnv_health)
OUTPUT
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 16916. 227. 74.4 0
2 livingEnv_london_rank 0.343 0.0192 17.9 1.63e-69
The row names have been moved into a column called term, and the column names are simple and consistent (and can be accessed using $).
Information about the model can be explored with ‘augment’. The function augments the original data with information from the model, such as the fitted values and residuals for each of the original points in the regression.
R
augment(reg_LivEnv_health)
OUTPUT
# A tibble: 4,835 × 8
health_london_rank livingEnv_london_rank .fitted .resid .hat .sigma
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 32113 7789 19588. 12525. 0.000250 7624.
2 29705 13070 21400. 8305. 0.000252 7625.
3 17600 4092 18320. -720. 0.000458 7626.
4 17907 9397 20140. -2233. 0.000213 7626.
5 21581 10629 20562. 1019. 0.000207 7626.
6 16414 11162 20745. -4331. 0.000210 7626.
7 12334 8672 19891. -7557. 0.000226 7625.
8 9661 9611 20213. -10552. 0.000211 7625.
9 16050 2269 17694. -1644. 0.000624 7626.
10 18178 4309 18394. -216. 0.000441 7626.
# ℹ 4,825 more rows
# ℹ 2 more variables: .cooksd <dbl>, .std.resid <dbl>
Some of the data presented by ‘augment’ will be discussed in the episode Linear Regression Diagnostics.
Summary statistics are computed for the entire regression, such as R^2 and the F-statistic can be accessed with the ‘glance’ function:
R
glance(reg_LivEnv_health)
OUTPUT
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0622 0.0621 7625. 321. 1.63e-69 1 -50081. 100167. 100187.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Generalised linear models
We can also use the ‘broom’ functions to present data from Generalised linear and non-linear models. For example, if we wanted to explore the Income Rank in relation to whether or not an area was within the City of London.
R
# add a variable to indicate whether or not an area is within the City of London
lon_dims_imd_2019 <- lon_dims_imd_2019 %>% mutate(city = la19nm == "City of London")
# create a Generalised Linear Model
glmlondims <- glm(city ~ Income_london_rank, lon_dims_imd_2019, family = "binomial")
summary(glmlondims)
OUTPUT
Call:
glm(formula = city ~ Income_london_rank, family = "binomial",
data = lon_dims_imd_2019)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.868e+00 1.004e+00 -7.835 4.68e-15 ***
Income_london_rank 6.888e-05 4.635e-05 1.486 0.137
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 92.295 on 4834 degrees of freedom
Residual deviance: 90.062 on 4833 degrees of freedom
AIC: 94.062
Number of Fisher Scoring iterations: 10
Use of ‘tidy’:
R
tidy(glmlondims)
OUTPUT
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -7.87 1.00 -7.84 4.68e-15
2 Income_london_rank 0.0000689 0.0000463 1.49 1.37e- 1
Use of ‘augment’:
R
augment(glmlondims)
OUTPUT
# A tibble: 4,835 × 8
city Income_london_rank .fitted .resid .hat .sigma .cooksd .std.resid
<lgl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TRUE 32831 -5.61 3.35 0.00194 0.128 2.65e-1 3.35
2 TRUE 29901 -5.81 3.41 0.00115 0.127 1.93e-1 3.41
3 TRUE 18510 -6.59 3.63 0.000233 0.126 8.51e-2 3.63
4 TRUE 6029 -7.45 3.86 0.000332 0.125 2.87e-1 3.86
5 FALSE 14023 -6.90 -0.0448 0.000239 0.137 1.20e-7 -0.0448
6 FALSE 6261 -7.44 -0.0343 0.000330 0.137 9.72e-8 -0.0343
7 FALSE 3382 -7.64 -0.0311 0.000360 0.137 8.70e-8 -0.0311
8 FALSE 7506 -7.35 -0.0358 0.000315 0.137 1.01e-7 -0.0358
9 FALSE 8902 -7.25 -0.0376 0.000298 0.137 1.05e-7 -0.0376
10 FALSE 9033 -7.25 -0.0378 0.000296 0.137 1.06e-7 -0.0378
# ℹ 4,825 more rows
Use of ‘glance’:
R
glance(glmlondims)
OUTPUT
# A tibble: 1 × 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 92.3 4834 -45.0 94.1 107. 90.1 4833 4835
You will notice that the statistics computed by ‘glance’ are different for glm objects than for lm (e.g. deviance rather than R^2).
Hypothesis testing
The tidy function can also be applied a range of hypotheses tests, such as built-in functions like t.test, cor.test, and wilcox.test.
t-test
R
tt <- t.test(Income_london_rank ~ city, lon_dims_imd_2019)
tidy(tt)
OUTPUT
# A tibble: 1 × 10
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -5344. 14456. 19800. -1.24 0.270 5.01 -16407. 5719.
# ℹ 2 more variables: method <chr>, alternative <chr>
Some cases might have fewer columns (for example, no confidence interval).
Wilcox test:
R
wt <- wilcox.test(Income_london_rank ~ city, lon_dims_imd_2019)
tidy(wt)
OUTPUT
# A tibble: 1 × 4
statistic p.value method alternative
<dbl> <dbl> <chr> <chr>
1 9836. 0.174 Wilcoxon rank sum test with continuity correcti… two.sided
Since the ‘tidy’ output is already only one row, glance returns the same output:
R
# t-test
glance(tt)
OUTPUT
# A tibble: 1 × 10
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -5344. 14456. 19800. -1.24 0.270 5.01 -16407. 5719.
# ℹ 2 more variables: method <chr>, alternative <chr>
R
# Wilcox test
glance(wt)
OUTPUT
# A tibble: 1 × 4
statistic p.value method alternative
<dbl> <dbl> <chr> <chr>
1 9836. 0.174 Wilcoxon rank sum test with continuity correcti… two.sided
The chisq.test enables use to investigate whether changes in one categorical variable are related to changes in another categorical variable.
The ‘augment’ method is defined only for chi-squared tests, since there is no meaningful sense, for other tests, in which a hypothesis test produces output about each initial data point.
R
# convert IDAOP_london_decile to a factor so it is not interprested as continuous data
lon_dims_imd_2019$IDAOP_london_decile <- factor(lon_dims_imd_2019$IDAOP_london_decile)
# xtabs creates a frequency table of IMD deciles within London borooughs
chit <- chisq.test(xtabs(~ la19nm + IDAOP_london_decile, data = lon_dims_imd_2019))
WARNING
Warning in chisq.test(xtabs(~la19nm + IDAOP_london_decile, data =
lon_dims_imd_2019)): Chi-squared approximation may be incorrect
R
tidy(chit)
OUTPUT
# A tibble: 1 × 4
statistic p.value parameter method
<dbl> <dbl> <int> <chr>
1 2841. 0 288 Pearson's Chi-squared test
R
augment(chit)
OUTPUT
# A tibble: 330 × 9
la19nm IDAOP_london_decile .observed .prop .row.prop .col.prop .expected
<fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
1 Barking … 1 6 1.24e-3 0.0545 0.0124 11.0
2 Barnet 1 6 1.24e-3 0.0284 0.0124 21.1
3 Bexley 1 0 0 0 0 14.6
4 Brent 1 12 2.48e-3 0.0694 0.0248 17.3
5 Bromley 1 2 4.14e-4 0.0102 0.00414 19.7
6 Camden 1 12 2.48e-3 0.0902 0.0248 13.3
7 City of … 1 0 0 0 0 0.599
8 Croydon 1 8 1.65e-3 0.0364 0.0166 22.0
9 Ealing 1 11 2.28e-3 0.0561 0.0228 19.6
10 Enfield 1 9 1.86e-3 0.0492 0.0186 18.3
# ℹ 320 more rows
# ℹ 2 more variables: .resid <dbl>, .std.resid <dbl>
There are a number of underlying assumptions of a Chi-Square test, these are:
Independence: The Chi-Square test assumes that the observations in the data are independent of each other. This means that the outcome of one observation should not influence the outcome of another.
Random sampling: The data should be obtained through random sampling to ensure that it is representative of the population from which it was drawn.
Expected frequency: The Chi-Square test assumes that the expected frequency count for each cell in the contingency table should be at least 5. If this assumption is not met, the test results may not be reliable.
As we have received a warning about the reliability of our test, it is likely that one of these assumptions has not been met, and that this is not a suitable test for this data.
Challenge 3
Use broom to amend the display of your model outputs.
Which function(s) did you use and why?
Conventions
There are some conventions that enable consistency across the broom functions, these are: * The output of the tidy, augment and glance functions is always a tibble.
- The output never has rownames. This ensures that you can combine it with other tidy outputs without fear of losing information (since rownames in R cannot contain duplicates).
- Some column names are kept consistent, so that they can be combined across different models and so that you know what to expect (in contrast to asking “is it pval or PValue?” every time).