Broom
Last updated on 2025-03-12 | Edit this page
Overview
Questions
- How can I present model outputs in an easier to read way?
Objectives
- To be able to present model outcomes using Broom
Data
R
# We will need these libraries and this data later.
library(ggplot2)
library(tidyverse)
library(lmtest)
library(sandwich)
library(broom)
hgtwgt_survey <- read.csv("data/hgt_wgt.csv")
amrData <- read.csv("data/dig_health_hub_amr_v2.csv")
Broom
The ‘broom’ package offers an alternative way of presenting the output of statistical analysis. It centers around three S3 methods, each of which take common objects produced by R statistical functions (lm, t.test, nls, etc) and convert them into a tibble.
These are:
- tidy: constructs a tibble that summarizes the model’s statistical findings. This includes coefficients and p-values for each term in a regression, per-cluster information in clustering applications, or per-test information for multtest functions.
- augment: add columns to the original data that was modeled. This includes predictions, residuals, and cluster assignments.
- glance: construct a concise one-row summary of the model. This typically contains values such as R^2, adjusted R^2, and residual standard error that are computed once for the entire model.
Let’s revisit the linear model exploring the relationship between height and weight:
R
reg_weight_height <- lm(weight.kg. ~ height.cm., data = hgtwgt_survey)
summary(reg_weight_height)
OUTPUT
Call:
lm(formula = weight.kg. ~ height.cm., data = hgtwgt_survey)
Residuals:
Min 1Q Median 3Q Max
-53.257 -11.782 -0.578 12.686 55.936
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.37137 9.11826 2.892 0.00392 **
height.cm. 0.30844 0.05373 5.740 1.32e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.44 on 843 degrees of freedom
(155 observations deleted due to missingness)
Multiple R-squared: 0.03762, Adjusted R-squared: 0.03648
F-statistic: 32.95 on 1 and 843 DF, p-value: 1.318e-08
There is a lot of useful information, but it not available in a way so that you can combine it with other models or do further analysis. We can convert this to tabular data using the ‘tidy’ function.
R
tidy(reg_weight_height)
OUTPUT
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 26.4 9.12 2.89 0.00392
2 height.cm. 0.308 0.0537 5.74 0.0000000132
The row names have been moved into a column called term, and the column names are simple and consistent (and can be accessed using $).
Information about the model can be explored with ‘augment’. The function augments the original data with information from the model, such as the fitted values and residuals for each of the original points in the regression.
R
augment(reg_weight_height)
OUTPUT
# A tibble: 845 × 9
.rownames weight.kg. height.cm. .fitted .resid .hat .sigma .cooksd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 84.4 186. 83.7 0.669 0.00353 18.5 0.00000234
2 2 51.5 161. 75.9 -24.4 0.00181 18.4 0.00159
3 3 85.0 151. 72.8 12.1 0.00414 18.4 0.000904
4 4 72.0 175. 80.4 -8.42 0.00149 18.5 0.000156
5 5 70.9 170. 78.8 -7.94 0.00119 18.5 0.000110
6 7 48.7 156. 74.3 -25.6 0.00279 18.4 0.00271
7 8 82.3 168. 78.1 4.24 0.00121 18.5 0.0000321
8 9 68.0 171. 79.0 -10.9 0.00120 18.5 0.000212
9 11 88.4 156. 74.5 13.9 0.00265 18.4 0.000751
10 13 71.0 168. 78.2 -7.24 0.00120 18.5 0.0000923
# ℹ 835 more rows
# ℹ 1 more variable: .std.resid <dbl>
Some of the data presented by ‘augment’ will be discussed in the supplemental episode Linear Regression Diagnostics.
Summary statistics are computed for the entire regression, such as R^2 and the F-statistic can be accessed with the ‘glance’ function:
R
glance(reg_weight_height)
OUTPUT
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.0376 0.0365 18.4 33.0 0.0000000132 1 -3661. 7328. 7342.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Generalised linear models
We can also use the ‘broom’ functions to present data from Generalised linear and non-linear models. For example, if we wanted to explore height in relation to gender.
R
# create a Generalised Linear Model
glmheightgender <- glm(gender.M. ~ height.cm., hgtwgt_survey, family = "binomial")
summary(glmheightgender)
OUTPUT
Call:
glm(formula = gender.M. ~ height.cm., family = "binomial", data = hgtwgt_survey)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.296726 0.958811 -0.309 0.757
height.cm. 0.002548 0.005646 0.451 0.652
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1271.2 on 919 degrees of freedom
Residual deviance: 1271.0 on 918 degrees of freedom
(80 observations deleted due to missingness)
AIC: 1275
Number of Fisher Scoring iterations: 3
Use of ‘tidy’:
R
tidy(glmheightgender)
OUTPUT
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.297 0.959 -0.309 0.757
2 height.cm. 0.00255 0.00565 0.451 0.652
Use of ‘augment’:
R
augment(glmheightgender)
OUTPUT
# A tibble: 920 × 9
.rownames gender.M. height.cm. .fitted .resid .hat .sigma .cooksd
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 186. 0.177 1.10 0.00324 1.18 0.00136
2 2 0 161. 0.113 -1.23 0.00169 1.18 0.000950
3 3 0 151. 0.0871 -1.21 0.00390 1.18 0.00214
4 4 1 175. 0.150 1.11 0.00136 1.18 0.000588
5 5 1 170. 0.137 1.12 0.00109 1.18 0.000477
6 7 0 156. 0.0996 -1.22 0.00262 1.18 0.00146
7 8 0 168. 0.130 -1.23 0.00111 1.18 0.000636
8 9 0 171. 0.138 -1.24 0.00110 1.18 0.000632
9 11 1 156. 0.101 1.13 0.00248 1.18 0.00113
10 13 0 168. 0.132 -1.23 0.00110 1.18 0.000630
# ℹ 910 more rows
# ℹ 1 more variable: .std.resid <dbl>
Use of ‘glance’:
R
glance(glmheightgender)
OUTPUT
# A tibble: 1 × 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 1271. 919 -636. 1275. 1285. 1271. 918 920
You will notice that the statistics computed by ‘glance’ are different for glm objects than for lm (e.g. deviance rather than R^2).
Hypothesis testing
The tidy function can also be applied a range of hypotheses tests, such as built-in functions like t.test, cor.test, and wilcox.test.
t-test
R
tt <- t.test(height.cm. ~ gender.M., hgtwgt_survey)
tidy(tt)
OUTPUT
# A tibble: 1 × 10
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.349 169. 170. -0.451 0.652 905. -1.87 1.17
# ℹ 2 more variables: method <chr>, alternative <chr>
Some cases might have fewer columns (for example, no confidence interval).
Wilcox test:
R
wt <- wilcox.test(height.cm. ~ gender.M., hgtwgt_survey)
tidy(wt)
OUTPUT
# A tibble: 1 × 4
statistic p.value method alternative
<dbl> <dbl> <chr> <chr>
1 103132 0.586 Wilcoxon rank sum test with continuity correcti… two.sided
Since the ‘tidy’ output is already only one row, glance returns the same output:
R
# t-test
glance(tt)
OUTPUT
# A tibble: 1 × 10
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.349 169. 170. -0.451 0.652 905. -1.87 1.17
# ℹ 2 more variables: method <chr>, alternative <chr>
R
# Wilcox test
glance(wt)
OUTPUT
# A tibble: 1 × 4
statistic p.value method alternative
<dbl> <dbl> <chr> <chr>
1 103132 0.586 Wilcoxon rank sum test with continuity correcti… two.sided
The chisq.test enables us to investigate whether changes in one categorical variable are related to changes in another categorical variable.
Let’s explore if there are differences in gender of those tested in each truste.
R
#let's organise our data for the test
# Select the columns of interest
selected_data <- amrData %>% select(trst_cd, sex_male)
# Create a contingency table for trust and gender
contingency_table <- table(selected_data$trst_cd, selected_data$sex_male)
# Perform chi-square test
chi_square_test <- chisq.test(contingency_table)
WARNING
Warning in chisq.test(contingency_table): Chi-squared approximation may be
incorrect
R
# View the results
print(chi_square_test)
OUTPUT
Pearson's Chi-squared test
data: contingency_table
X-squared = 264.8, df = 237, p-value = 0.1037
R
# we can use tidy on our results
tidy(chi_square_test)
OUTPUT
# A tibble: 1 × 4
statistic p.value parameter method
<dbl> <dbl> <int> <chr>
1 265. 0.104 237 Pearson's Chi-squared test
The ‘augment’ method is defined only for chi-squared tests, since there is no meaningful sense, for other tests, in which a hypothesis test produces output about each initial data point.
R
augment(chi_square_test)
OUTPUT
# A tibble: 476 × 9
Var1 Var2 .observed .prop .row.prop .col.prop .expected .resid .std.resid
<fct> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 R0A 0 2 0.002 1 0.00401 0.998 1.00 1.42
2 R1A 0 1 0.001 0.143 0.00200 3.49 -1.33 -1.89
3 R1C 0 4 0.004 0.571 0.00802 3.49 0.271 0.385
4 R1D 0 1 0.001 0.5 0.00200 0.998 0.00200 0.00283
5 R1E 0 0 0 0 0 2.00 -1.41 -2.00
6 R1F 0 2 0.002 1 0.00401 0.998 1.00 1.42
7 R1G 0 2 0.002 1 0.00401 0.998 1.00 1.42
8 R1H 0 0 0 0 0 1.50 -1.22 -1.73
9 R1J 0 4 0.004 0.571 0.00802 3.49 0.271 0.385
10 R1K 0 2 0.002 0.667 0.00401 1.50 0.411 0.582
# ℹ 466 more rows
There are a number of underlying assumptions of a Chi-Square test, these are:
Independence: The Chi-Square test assumes that the observations in the data are independent of each other. This means that the outcome of one observation should not influence the outcome of another.
Random sampling: The data should be obtained through random sampling to ensure that it is representative of the population from which it was drawn.
Expected frequency: The Chi-Square test assumes that the expected frequency count for each cell in the contingency table should be at least 5. If this assumption is not met, the test results may not be reliable.
As we have received a warning about the reliability of our test, it is likely that one of these assumptions has not been met, and that this is not a suitable test for this data.
Challenge 3
Use broom to amend the display of your model outputs.
Which function(s) did you use and why?
Conventions
There are some conventions that enable consistency across the broom functions, these are: * The output of the tidy, augment and glance functions is always a tibble.
- The output never has rownames. This ensures that you can combine it with other tidy outputs without fear of losing information (since rownames in R cannot contain duplicates).
- Some column names are kept consistent, so that they can be combined across different models and so that you know what to expect (in contrast to asking “is it pval or PValue?” every time).
Key Points
- Broom can be used to create reusable outputs from various analyses, in the form of tibbles.