Logistic Regression

Last updated on 2024-12-04 | Edit this page

Estimated time: 120 minutes

Overview

Questions

  • How can I identify factors for antibiotic resistance?
  • How can I check the validity of model?

Objectives

  • To be able to construct regression models for a binary outcome
  • To be able to calculate predicted variables and residuals
  • To be able to present model outcomes using Broom

Content


  • Exploratory Data Analysis
  • Model Creation and Estimation
  • Reporting the Logistic Regression Results

Data


R

# We will need these libraries and this data later.
library(aod)
library(broom)
library(ggplot2)
library(lubridate)
library(odds.n.ends)
library(tidyverse)

amrData <- read.csv("data/dig_health_hub_amr.csv")

The data used in this episode was provided by Simon Thelwall from the UKHSA. It has been created to represent the sort of data that might be obtained from the Second Generation Surveillance System (SGSS). The data has 100,000 rows of 12 variables.

Exploratory Data Analysis


We can preview our data by using ‘head’:

R

head(amrData)

OUTPUT

  id        dob  spec_date sex_male     region had_surgery_past_yr
1  1 1981-04-04 2014-05-25        1     London                   0
2  2 1969-05-22 2014-05-25        0       East                   0
3  3 1935-11-20 2014-05-25        0     London                   1
4  4 1967-05-19 2014-05-25        1     London                   0
5  5 1961-06-30 2014-05-25        0 North West                   0
6  6 1987-01-11 2014-05-25        1 North East                   0
                                                ethnicity imd organism coamox
1                                      Other ethnic group   4  E. coli      1
2                                      Other ethnic group   3  E. coli      0
3 Black, Black British, Black Welsh, Caribbean or African   2  E. coli      1
4                         Mixed or Multiple ethnic groups   5  E. coli      0
5                         Mixed or Multiple ethnic groups   4  E. coli      0
6                         Mixed or Multiple ethnic groups   4  E. coli      0
  cipro gentam
1     0      0
2     0      0
3     0      0
4     0      0
5     0      0
6     0      0

We can also request information about variable names and data types:

R

sapply(amrData, class)

OUTPUT

                 id                 dob           spec_date            sex_male
          "integer"         "character"         "character"           "integer"
             region had_surgery_past_yr           ethnicity                 imd
        "character"           "integer"         "character"           "integer"
           organism              coamox               cipro              gentam
        "character"           "integer"           "integer"           "integer" 

We can see that dates are currently stored as the char data type. We also do not know the age of the participant when the sample was taken.

R

# Calculate age (in years) as of their last birthday and add as an additional variable to our data.
# The %--% and %/% are synax specific to lubridate. 
# In the first part we are asking it to find the difference between the two dates.
# We are then rounding down to the nearest year.

amrData <- amrData %>%
  mutate(
    age_years_sd = (dob %--% spec_date) %/% years(1)
  )

We can also convert ‘spec_date’, the date the specimen was taken from text to a date:

R

# Convert char to date and store as additional variable
amrData <- amrData %>%
  mutate(
    spec_date_YMD = as.Date(amrData$spec_date)
  )

We can use a histogram to explore the age distribution of the participants:

R

# histogram of age
ageHisto <- amrData %>%
  ggplot(aes(x = age_years_sd)) +
  geom_histogram(bins = 10, color = "white") +
  theme_minimal(base_size = 14, base_family = "sans") +
  labs(x = "Age when specimen taken (years)", y = "Frequency")
ageHisto

We can also look at where the specimens were processed:

R

xtabs(~region, data = amrData)

OUTPUT

region
                    East            East Midlands                   London
                   11084                    11160                    11253
              North East               North West               South East
                   11082                    10919                    11101
              South West            West Midlands Yorkshire and The Humber
                   11278                    11032                    11091 

and for which organism:

R

xtabs(~organism, data = amrData)

OUTPUT

organism
E. coli
 100000 

In addition, we can use cross-tabulation to identify if the specimen indicated resistance to one or more of Coamoxiclav, Gentamicin and Ciprofloxacin for the participants:

R

xtabs(~ coamox + cipro + gentam, data = amrData)

OUTPUT

, , gentam = 0

      cipro
coamox     0     1
     0 61844  5814
     1 24319  3109

, , gentam = 1

      cipro
coamox     0     1
     0  2815   362
     1  1510   227

We can see from our table that only 227 participants indicated reistance to Coamoxiclav, Gentamicin and Ciprofloxacin. Coamoxiclav appears to have the highest individual indication of resistance. We will explore indicators to Coamoxiclav first.

Model Creation and Estimation


As the dependent variable we want to explore is binary (0,1), we will use a binomial generalised linear model.

R

coamox_logit <- glm(coamox ~ age_years_sd + sex_male, data = amrData, family = "binomial")

summary(coamox_logit)

OUTPUT


Call:
glm(formula = coamox ~ age_years_sd + sex_male, family = "binomial",
    data = amrData)

Coefficients:
               Estimate Std. Error  z value Pr(>|z|)
(Intercept)  -3.2215791  0.0284626 -113.186  < 2e-16 ***
age_years_sd  0.0404280  0.0004449   90.867  < 2e-16 ***
sex_male      0.0540494  0.0146011    3.702 0.000214 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 120724  on 99999  degrees of freedom
Residual deviance: 111460  on 99997  degrees of freedom
AIC: 111466

Number of Fisher Scoring iterations: 4

In this initial model focusing on demographic indicators, age at time of sample being taken and whether or not the participant is male both seem to be statistically significant.

age_years_sd: For every unit increase in age_years_sd the log-odds of Coamoxiclav resistance increase by 0.0404280.

sex_male: The difference in the log-odds of Coamoxiclav resistance between males and non-males is 0.0540494.

Older participants and male participants have higher log-odds of Coamoxicalv resistance.

We can check to see that our indicators sex_male and age_years_sd are independent:

R

# check VIF for no perfect multicollinearity assumption
car::vif(coamox_logit)

OUTPUT

age_years_sd     sex_male
    1.000015     1.000015 

We can also check the linearity of the variable age_years_sd

R

# make a variable of the logit of the predicted values
logit.use <- log(coamox_logit$fitted.values / (1 - coamox_logit$fitted.values))

# make a small data frame with the logit variable and the age predictor
linearity.data <- data.frame(logit.use, age = coamox_logit$model$age_years_sd)

# create a plot with linear and actual relationships shown
linearPlot <- linearity.data %>%
  ggplot(aes(x = age, y = logit.use)) +
  geom_point(aes(size = "Observation"), color = "blue", alpha = .6) +
  geom_smooth(se = FALSE, aes(color = "Loess curve")) +
  geom_smooth(method = lm, se = FALSE, aes(color = "linear")) +
  theme_minimal(base_size = 14, base_family = "serif") +
  labs(x = "Age in years on sample date", y = "Log-odds of coamox resistance predicted probability") +
  scale_color_manual(name = "Type of fit line", values = c("red", "black")) +
  scale_size_manual(values = 1.5, name = "")

linearPlot

OUTPUT

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
`geom_smooth()` using formula = 'y ~ x'

Challenge 1

Update the model coamox_logit to include had_surgery_past_yr as an independent variable. What is the log-odds reported and is it statistically significant?

You may choose to create a new glm:

R

coamox_surg_logit <- glm(coamox ~ age_years_sd + sex_male + had_surgery_past_yr, data = amrData, family = "binomial")

summary(coamox_surg_logit)

OUTPUT


Call:
glm(formula = coamox ~ age_years_sd + sex_male + had_surgery_past_yr,
    family = "binomial", data = amrData)

Coefficients:
                      Estimate Std. Error  z value Pr(>|z|)
(Intercept)         -3.2407115  0.0286059 -113.288  < 2e-16 ***
age_years_sd         0.0404424  0.0004451   90.870  < 2e-16 ***
sex_male             0.0538298  0.0146056    3.686 0.000228 ***
had_surgery_past_yr  0.1802123  0.0238522    7.555 4.18e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 120724  on 99999  degrees of freedom
Residual deviance: 111404  on 99996  degrees of freedom
AIC: 111412

Number of Fisher Scoring iterations: 4

had_surgery_past_yr: The difference in the log-odds of Coamoxiclav resistance between those who have had surgery in the past year and those who have not is 0.1802123. It is statistically significant.

You may also want to check for culticollinearity:

R

car::vif(coamox_surg_logit)

OUTPUT

       age_years_sd            sex_male had_surgery_past_yr
           1.000156            1.000018            1.000144 

As the value of GVIF is lower than 4, it suggests that the assumption of independnce between the variables is held.

Incorporating a multi-level factor

So far we have considered binary and continuos variable indicators in our model(s). Our data set also contains some categorical variables: region, ethinicity and imd.

We are going to incorporate region into our model:

R

# To include reporting for all regions I will include the 0 flag
coamox_region_logit <- glm(coamox ~ 0 + age_years_sd + sex_male + region, data = amrData, family = "binomial")
summary(coamox_region_logit)

OUTPUT


Call:
glm(formula = coamox ~ 0 + age_years_sd + sex_male + region,
    family = "binomial", data = amrData)

Coefficients:
                                Estimate Std. Error z value Pr(>|z|)
age_years_sd                    0.040432   0.000445  90.867  < 2e-16 ***
sex_male                        0.054155   0.014602   3.709 0.000208 ***
regionEast                     -3.246673   0.035328 -91.900  < 2e-16 ***
regionEast Midlands            -3.213569   0.035077 -91.616  < 2e-16 ***
regionLondon                   -3.204244   0.034972 -91.625  < 2e-16 ***
regionNorth East               -3.219060   0.035140 -91.606  < 2e-16 ***
regionNorth West               -3.217271   0.035244 -91.285  < 2e-16 ***
regionSouth East               -3.183229   0.035058 -90.799  < 2e-16 ***
regionSouth West               -3.222338   0.035057 -91.917  < 2e-16 ***
regionWest Midlands            -3.236012   0.035275 -91.736  < 2e-16 ***
regionYorkshire and The Humber -3.255309   0.035302 -92.213  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138629  on 100000  degrees of freedom
Residual deviance: 111452  on  99989  degrees of freedom
AIC: 111474

Number of Fisher Scoring iterations: 4

It initially seems that all regions are significant and have a negative log-odds of Coamoxiclav resistance.

We can use the Wald test to explore the overall effect of region. We will use the wald.test function of the aod library. The order in which the coefficients are given in the table of coefficients is the same as the order of the terms in the model. This is important because the wald.test function refers to the coefficients by their order in the model. We use the wald.test function. b supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms, finally Terms tells R which terms in the model are to be tested, in this case, terms 3 to 11, are the three terms for the different regions.

R

# model with just the regions added
wald.test(b = coef(coamox_region_logit), Sigma = vcov(coamox_region_logit), Terms = 3:11)

OUTPUT

Wald test:
----------

Chi-squared test:
X2 = 12815.9, df = 9, P(> X2) = 0.0

The chi-squared test statistic of 12815.9, with nine degrees of freedom is associated with a p-value of 0.00 indicating that the overall effect of region is statistically significant.

Challenge 2

Update the model you created in Challenge 1 to include either ethnicity or imd as an independent variable. What is the log-odds reported and is it statistically significant?

If you chose to incorporate imd, this would first need to be converted to a factor

R

amrData$imd <- factor(amrData$imd)

coamox_surg_imd_logit <- glm(
  coamox ~ age_years_sd + sex_male + had_surgery_past_yr + imd,
  data = amrData,
  family = "binomial"
)

summary(coamox_surg_imd_logit)

OUTPUT


Call:
glm(formula = coamox ~ age_years_sd + sex_male + had_surgery_past_yr +
    imd, family = "binomial", data = amrData)

Coefficients:
                      Estimate Std. Error  z value Pr(>|z|)
(Intercept)         -3.2405156  0.0320951 -100.966  < 2e-16 ***
age_years_sd         0.0404419  0.0004451   90.866  < 2e-16 ***
sex_male             0.0538963  0.0146067    3.690 0.000224 ***
had_surgery_past_yr  0.1801131  0.0238532    7.551 4.32e-14 ***
imd2                -0.0135384  0.0231010   -0.586 0.557840
imd3                -0.0044812  0.0230452   -0.194 0.845821
imd4                -0.0046000  0.0230556   -0.200 0.841857
imd5                 0.0216405  0.0230414    0.939 0.347627
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 120724  on 99999  degrees of freedom
Residual deviance: 111401  on 99992  degrees of freedom
AIC: 111417

Number of Fisher Scoring iterations: 4

As we did not include the 0 flag IMD quintile 1 has been used a reference for the other Quintiles. Quintiles 2 to 5 are not significantly significant. We may want to test that the coefficient for imd=1 is equal to the coefficient for imd=2. The first line of code below creates a vector l that defines the test we want to perform. In this case, we want to test the difference (subtraction) of the terms for imd=1 and imd=2 (i.e., the 4th and 5th terms in the model). To contrast these two terms, we multiply one of them by 1, and the other by -1. The other terms in the model are not involved in the test, so they are multiplied by 0. The second line of code below uses L=l to tell R that we wish to base the test on the vector l (rather than using the Terms option as we did above).

R

l <- cbind(0, 0, 0, 1, -1, 0, 0, 0)
wald.test(b = coef(coamox_surg_imd_logit), Sigma = vcov(coamox_surg_imd_logit), L = l)

OUTPUT

Wald test:
----------

Chi-squared test:
X2 = 33.9, df = 1, P(> X2) = 5.7e-09

The chi-squared test statistic of 33.9, with 1 degree of freedom is associated with a p-value of 5.7e-09 indicating that the overall effect of region is statistically significant.

Reporting the Logistic Regression Results


For our model coamox_region_logit there are various things that we can report, and different functions and packages that can be used.

We can report the CIs using profiled log-likelihood or using standard errors for our model.

R

# CIs using log-likelihood
confint(coamox_region_logit)

OUTPUT

Waiting for profiling to be done...

OUTPUT

                                     2.5 %      97.5 %
age_years_sd                    0.03956087  0.04130508
sex_male                        0.02553710  0.08277680
regionEast                     -3.31607182 -3.17758553
regionEast Midlands            -3.28247264 -3.14497346
regionLondon                   -3.27293944 -3.13585237
regionNorth East               -3.28808726 -3.15033821
regionNorth West               -3.28650481 -3.14834796
regionSouth East               -3.25209244 -3.11466610
regionSouth West               -3.29120225 -3.15377971
regionWest Midlands            -3.30530677 -3.16702900
regionYorkshire and The Humber -3.32465587 -3.18627353

R

## CIs using standard errors
confint.default(coamox_region_logit)

OUTPUT

                                     2.5 %      97.5 %
age_years_sd                    0.03955954  0.04130374
sex_male                        0.02553490  0.08277421
regionEast                     -3.31591576 -3.17743094
regionEast Midlands            -3.28231824 -3.14482054
regionLondon                   -3.27278672 -3.13570109
regionNorth East               -3.28793328 -3.15018570
regionNorth West               -3.28634826 -3.14819290
regionSouth East               -3.25194196 -3.11451707
regionSouth West               -3.29104882 -3.15362773
regionWest Midlands            -3.30515053 -3.16687424
regionYorkshire and The Humber -3.32449918 -3.18611831

Alternatively we may be interested in the odds-ratio:

R

## odds ratios only
exp(coef(coamox_region_logit))

OUTPUT

                  age_years_sd                       sex_male
                    1.04126013                     1.05564775
                    regionEast            regionEast Midlands
                    0.03890341                     0.04021282
                  regionLondon               regionNorth East
                    0.04058958                     0.03999265
              regionNorth West               regionSouth East
                    0.04006426                     0.04145157
              regionSouth West            regionWest Midlands
                    0.03986174                     0.03932038
regionYorkshire and The Humber
                    0.03856891 

R

# including confidence intervals
exp(cbind(OR = coef(coamox_region_logit), confint(coamox_region_logit)))

OUTPUT

Waiting for profiling to be done...

OUTPUT

                                       OR      2.5 %     97.5 %
age_years_sd                   1.04126013 1.04035382 1.04217000
sex_male                       1.05564775 1.02586597 1.08629932
regionEast                     0.03890341 0.03629513 0.04168618
regionEast Midlands            0.04021282 0.03753533 0.04306807
regionLondon                   0.04058958 0.03789487 0.04346269
regionNorth East               0.03999265 0.03732517 0.04283764
regionNorth West               0.04006426 0.03738429 0.04292298
regionSouth East               0.04145157 0.03869316 0.04439333
regionSouth West               0.03986174 0.03720909 0.04269046
regionWest Midlands            0.03932038 0.03668796 0.04212858
regionYorkshire and The Humber 0.03856891 0.03598490 0.04132558

These can also be presented using broom:

R

# as Log-Odds
tidy(coamox_region_logit)

OUTPUT

# A tibble: 11 × 5
   term                           estimate std.error statistic  p.value
   <chr>                             <dbl>     <dbl>     <dbl>    <dbl>
 1 age_years_sd                     0.0404  0.000445     90.9  0
 2 sex_male                         0.0542  0.0146        3.71 0.000208
 3 regionEast                      -3.25    0.0353      -91.9  0
 4 regionEast Midlands             -3.21    0.0351      -91.6  0
 5 regionLondon                    -3.20    0.0350      -91.6  0
 6 regionNorth East                -3.22    0.0351      -91.6  0
 7 regionNorth West                -3.22    0.0352      -91.3  0
 8 regionSouth East                -3.18    0.0351      -90.8  0
 9 regionSouth West                -3.22    0.0351      -91.9  0
10 regionWest Midlands             -3.24    0.0353      -91.7  0
11 regionYorkshire and The Humber  -3.26    0.0353      -92.2  0       

R

# as ORs
tidy(coamox_region_logit, exp = TRUE)

OUTPUT

# A tibble: 11 × 5
   term                           estimate std.error statistic  p.value
   <chr>                             <dbl>     <dbl>     <dbl>    <dbl>
 1 age_years_sd                     1.04    0.000445     90.9  0
 2 sex_male                         1.06    0.0146        3.71 0.000208
 3 regionEast                       0.0389  0.0353      -91.9  0
 4 regionEast Midlands              0.0402  0.0351      -91.6  0
 5 regionLondon                     0.0406  0.0350      -91.6  0
 6 regionNorth East                 0.0400  0.0351      -91.6  0
 7 regionNorth West                 0.0401  0.0352      -91.3  0
 8 regionSouth East                 0.0415  0.0351      -90.8  0
 9 regionSouth West                 0.0399  0.0351      -91.9  0
10 regionWest Midlands              0.0393  0.0353      -91.7  0
11 regionYorkshire and The Humber   0.0386  0.0353      -92.2  0       

The odd.n.ends package provides a wide range of reporting tools

R

# Odds ratios and confidence intervals
coamox_region_logitOR <- odds.n.ends::odds.n.ends(coamox_region_logit)

OUTPUT

Waiting for profiling to be done...

R

coamox_region_logitCI <- coamox_region_logitOR$`Predictor odds ratios and 95% CI`
coamox_region_logitCI

OUTPUT

                                       OR      2.5 %     97.5 %
age_years_sd                   1.04126013 1.04035382 1.04217000
sex_male                       1.05564775 1.02586597 1.08629932
regionEast                     0.03890341 0.03629513 0.04168618
regionEast Midlands            0.04021282 0.03753533 0.04306807
regionLondon                   0.04058958 0.03789487 0.04346269
regionNorth East               0.03999265 0.03732517 0.04283764
regionNorth West               0.04006426 0.03738429 0.04292298
regionSouth East               0.04145157 0.03869316 0.04439333
regionSouth West               0.03986174 0.03720909 0.04269046
regionWest Midlands            0.03932038 0.03668796 0.04212858
regionYorkshire and The Humber 0.03856891 0.03598490 0.04132558

R

# model fit
modfit <- coamox_region_logitOR$`Count R-squared (model fit): percent correctly predicted`
modfit

OUTPUT

[1] 70.938

R

# Other model statistics
odds.n.ends::odds.n.ends(coamox_region_logit)

OUTPUT

Waiting for profiling to be done...

OUTPUT

$`Logistic regression model significance`
Chi-squared        d.f.           p
  27177.424          11       <.001

$`Contingency tables (model fit): frequency predicted`
                Number observed
Number predicted      1      0    Sum
             1     4613   4510   9123
             0    24552  66325  90877
             Sum  29165  70835 100000

$`Count R-squared (model fit): percent correctly predicted`
[1] 70.938

$`Model sensitivity`
[1] 0.158169

$`Model specificity`
[1] 0.9363309

$`Predictor odds ratios and 95% CI`
                                       OR      2.5 %     97.5 %
age_years_sd                   1.04126013 1.04035382 1.04217000
sex_male                       1.05564775 1.02586597 1.08629932
regionEast                     0.03890341 0.03629513 0.04168618
regionEast Midlands            0.04021282 0.03753533 0.04306807
regionLondon                   0.04058958 0.03789487 0.04346269
regionNorth East               0.03999265 0.03732517 0.04283764
regionNorth West               0.04006426 0.03738429 0.04292298
regionSouth East               0.04145157 0.03869316 0.04439333
regionSouth West               0.03986174 0.03720909 0.04269046
regionWest Midlands            0.03932038 0.03668796 0.04212858
regionYorkshire and The Humber 0.03856891 0.03598490 0.04132558

As you can see we have a range of probabilities, odds ratios and log-odds. We need to be aware of which we are referring to. One way is to look at the range of the estimates. Probabilities always have a range from zero to 1. Logit units generally range from about -4 to +4, with zero meaning an equal probability of no event or the event outcome occurring. Odds ratios can range from very small (but positive) numbers to very large positive numbers.

These odds ratios versions of the estimates are more easily interpretable than logit scores. Odds ratios of less than one means that an increase in that predictor makes the outcome less likely to occur, and an odds ratio of greater than one means that an increase in that predictor makes the outcome more likely to occur.

Discussion

  • What other indicators could we have included in our model?
  • What question(s) would they help us to answer?

Challenge 3

We have explored potential indicators of Coamoxiclav resistance.

Explore potential indicators of either Gentamicin or Ciprofloxacin resistance.

Present your analysis steps and findings in a Quarto report.