Basic Statistics: describing, modelling and reporting
Last updated on 2024-12-04 | Edit this page
Estimated time: 80 minutes
Overview
Questions
- How can I detect the type of data I have?
- How can I make meaningful summaries of my data?
Objectives
- To be able to describe the different types of data
- To be able to do basic data exploration of a real dataset
- To be able to calculate descriptive statistics
- To be able to perform statistical inference on a dataset
Content
- Types of Data
- Exploring your dataset
- Descriptive Statistics
- Inferential Statistics
Data
R
# We will need these libraries and this data later.
library(tidyverse)
library(ggplot2)
# loading data
lon_dims_imd_2019 <- read.csv("data/English_IMD_2019_Domains_rebased_London_by_CDRC.csv")
# create a binary membership variable for City of London (for later examples)
lon_dims_imd_2019 <- lon_dims_imd_2019 %>% mutate(city = la19nm == "City of London")
We are going to use the data from the Consumer Data Research Centre, specifically the London IMD 2019 (English IMD 2019 Domains rebased) data. Atribution: Data provided by the Consumer Data Research Centre, an ESRC Data Investment: ES/L011840/1, ES/L011891/1
The statistical unit areas used to provide indices of relative deprivation across the country are Lower layer Super Output Areas (LSOAs), dimensions of depravation include income, employment, education, health, crime, barriers to housing and services, and the living environment. We have added a variable city indicating if an LSOA is within the City of London, or not.
The big picture
- Research often seeks to answer a question about a larger population by collecting data on a small sample
- Data collection:
- Many variables
- For each person/unit.
- This procedure, sampling, must be controlled so as to ensure representative data.
Descriptive and inferential statistics
Callout
Just as data in general are of different types - for example numeric vs text data - statistical data are assigned to different levels of measure. The level of measure determines how we can describe and model the data.
Describing data
- Continuous variables
- Discrete variables
Callout
How do we convey information on what your data looks like, using numbers or figures?
Describing continuous data.
First establish the distribution of the data. You can visualise this with a histogram.
R
ggplot(lon_dims_imd_2019, aes(x = barriers_london_rank)) +
geom_histogram()
OUTPUT
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What is the distribution of this data?
What is the distribution of population?
The raw values are difficult to visualise, so we can take the log of the values and log those. Try this command
R
ggplot(lon_dims_imd_2019, aes(x = log(barriers_london_rank))) +
geom_histogram()
OUTPUT
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What is the distribution of this data?
Parametric vs non-parametric analysis
- Parametric analysis assumes that
- The data follows a known distribution
- It can be described using parameters
- Examples of distributions include, normal, Poisson, exponential.
- Non parametric data
- The data can’t be said to follow a known distribution
Emphasise that parametric is not equal to normal.
Describing parametric and non-parametric data
How do you use numbers to convey what your data looks like.
- Parametric data
- Use the parameters that describe the distribution.
- For a Gaussian (normal) distribution - use mean and standard deviation
- For a Poisson distribution - use average event rate
- etc.
- Non Parametric data
- Use the median (the middle number when they are ranked from lowest to highest) and the interquartile range (the number 75% of the way up the list when ranked minus the number 25% of the way)
- You can use the command
summary(data_frame_name)
to get these numbers for each variable.
Mean versus standard deviation
- What does standard deviation mean?
- Both graphs have the same mean (center), but the second one has data which is more spread out.
R
# small standard deviation
dummy_1 <- rnorm(1000, mean = 10, sd = 0.5)
dummy_1 <- as.data.frame(dummy_1)
ggplot(dummy_1, aes(x = dummy_1)) +
geom_histogram()
OUTPUT
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
R
# larger standard deviation
dummy_2 <- rnorm(1000, mean = 10, sd = 200)
dummy_2 <- as.data.frame(dummy_2)
ggplot(dummy_2, aes(x = dummy_2)) +
geom_histogram()
OUTPUT
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Get them to plot the graphs. Explain that we are generating random data from different distributions and plotting them.
Calculating mean and standard deviation
R
mean(lon_dims_imd_2019$barriers_london_rank, na.rm = TRUE)
OUTPUT
[1] 2418
Calculate the standard deviation and confirm that it is the square root of the variance:
R
sdbarriers <- sd(lon_dims_imd_2019$barriers_london_rank, na.rm = TRUE)
print(sdbarriers)
OUTPUT
[1] 1395.889
R
varbarriers <- var(lon_dims_imd_2019$barriers_london_rank, na.rm = TRUE)
print(varbarriers)
OUTPUT
[1] 1948505
R
sqrt(varbarriers) == sdbarriers
OUTPUT
[1] TRUE
The na.rm
argument tells R to ignore missing values in
the variable.
Describing discrete data
- Frequencies
R
table(lon_dims_imd_2019$la19nm)
OUTPUT
Barking and Dagenham Barnet Bexley
110 211 146
Brent Bromley Camden
173 197 133
City of London Croydon Ealing
6 220 196
Enfield Greenwich Hackney
183 151 144
Hammersmith and Fulham Haringey Harrow
113 145 137
Havering Hillingdon Hounslow
150 161 142
Islington Kensington and Chelsea Kingston upon Thames
123 103 98
Lambeth Lewisham Merton
178 169 124
Newham Redbridge Richmond upon Thames
164 161 115
Southwark Sutton Tower Hamlets
166 121 144
Waltham Forest Wandsworth Westminster
144 179 128
- Proportions
R
areastable <- table(lon_dims_imd_2019$la19nm)
prop.table(areastable)
OUTPUT
Barking and Dagenham Barnet Bexley
0.022750776 0.043640124 0.030196484
Brent Bromley Camden
0.035780765 0.040744571 0.027507756
City of London Croydon Ealing
0.001240951 0.045501551 0.040537746
Enfield Greenwich Hackney
0.037849018 0.031230610 0.029782834
Hammersmith and Fulham Haringey Harrow
0.023371251 0.029989659 0.028335057
Havering Hillingdon Hounslow
0.031023785 0.033298862 0.029369183
Islington Kensington and Chelsea Kingston upon Thames
0.025439504 0.021302999 0.020268873
Lambeth Lewisham Merton
0.036814891 0.034953464 0.025646329
Newham Redbridge Richmond upon Thames
0.033919338 0.033298862 0.023784902
Southwark Sutton Tower Hamlets
0.034332989 0.025025853 0.029782834
Waltham Forest Wandsworth Westminster
0.029782834 0.037021717 0.026473630
Contingency tables of frequencies can also be tabulated with table(). For example:
R
table(
lon_dims_imd_2019$la19nm,
lon_dims_imd_2019$IDAOP_london_decile
)
OUTPUT
1 2 3 4 5 6 7 8 9 10
Barking and Dagenham 6 11 23 25 22 12 7 4 0 0
Barnet 6 7 13 15 18 32 29 37 29 25
Bexley 0 3 2 5 11 11 15 24 30 45
Brent 12 19 24 28 43 18 17 11 1 0
Bromley 2 3 6 9 10 12 20 19 41 75
Camden 12 19 14 18 14 10 9 16 11 10
City of London 0 0 1 0 0 0 0 1 0 4
Croydon 8 7 16 25 23 20 29 24 29 39
Ealing 11 18 22 23 24 23 31 21 18 5
Enfield 9 19 27 22 26 17 16 22 18 7
[ reached getOption("max.print") -- omitted 23 rows ]
Which leads quite naturally to the consideration of any association between the observed frequencies.
Inferential statistics
Meaningful analysis
- What is your hypothesis - what is your null hypothesis?
Callout
Always: the level of the independent variable has no effect on the level of the dependent variable.
What type of variables (data type) do you have?
What are the assumptions of the test you are using?
Interpreting the result
Testing significance
p-value
<0.05
-
0.03-0.049
- Would benefit from further testing.
0.05 is not a magic number.
Comparing means
It all starts with a hypothesis
- Null hypothesis
- “There is no difference in mean height between men and women” \[mean\_height\_men - mean\_height\_women = 0\]
- Alternate hypothesis
- “There is a difference in mean height between men and women”
More on hypothesis testing
The null hypothesis (H0) assumes that the true mean difference (μd) is equal to zero.
The two-tailed alternative hypothesis (H1) assumes that μd is not equal to zero.
The upper-tailed alternative hypothesis (H1) assumes that μd is greater than zero.
The lower-tailed alternative hypothesis (H1) assumes that μd is less than zero.
Remember: hypotheses are never about data, they are about the processes which produce the data. The value of μd is unknown. The goal of hypothesis testing is to determine the hypothesis (null or alternative) with which the data are more consistent.
Comparing means
Is there an absolute difference between the income ranks of the Lower-layer Super Output Areas?
R
lon_dims_imd_2019 %>%
group_by(la19nm) %>%
summarise(avg = mean(Income_london_rank)) %>%
arrange(la19nm, .locale = "en")
OUTPUT
# A tibble: 33 × 2
la19nm avg
<chr> <dbl>
1 Barking and Dagenham 7786.
2 Barnet 17049.
3 Bexley 18592.
4 Brent 11500.
5 Bromley 20826.
6 Camden 14359.
7 City of London 19800.
8 Croydon 14686.
9 Ealing 13718.
10 Enfield 11403.
# ℹ 23 more rows
Is the difference between the income ranks statistically significant?
t-test
Assumptions of a t-test
One independent categorical variable with 2 groups and one dependent continuous variable
The dependent variable is approximately normally distributed in each group
The observations are independent of each other
For students’ original t-statistic, that the variances in both groups are more or less equal. This constraint should probably be abandoned in favour of always using a conservative test.
Doing a t-test
R
t.test(health_london_rank ~ city, data = lon_dims_imd_2019)$statistic
OUTPUT
t
-0.5183242
R
t.test(health_london_rank ~ city, data = lon_dims_imd_2019)$parameter
OUTPUT
df
5.015827
Notice that the summary()** of the test contains more data than is output by default.
Write a paragraph in markdown format reporting this test result including the t-statistic, the degrees of freedom, the confidence interval and the p-value to 4 places. To do this include your r code inline with your text, rather than in an R code chunk.
More than two levels of IV
While the t-test is sufficient where there are two levels of the IV, for situations where there are more than two, we use the ANOVA family of procedures. To show this, we will create a variable that subsets our data by per capita GDP levels. If the ANOVA result is statistically significant, we will use a post-hoc test method to do pairwise comparisons (here Tukey’s Honest Significant Differences.)
R
anovamodel <- aov(lon_dims_imd_2019$health_london_rank ~ lon_dims_imd_2019$la19nm)
summary(anovamodel)
OUTPUT
Df Sum Sq Mean Sq F value Pr(>F)
lon_dims_imd_2019$la19nm 32 1.156e+11 3.614e+09 94.3 <2e-16 ***
Residuals 4802 1.840e+11 3.832e+07
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R
TukeyHSD(anovamodel)
OUTPUT
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = lon_dims_imd_2019$health_london_rank ~ lon_dims_imd_2019$la19nm)
$`lon_dims_imd_2019$la19nm`
diff lwr
Barnet-Barking and Dagenham 14629.43408 11864.704458
Bexley-Barking and Dagenham 10324.16189 7356.018133
Brent-Barking and Dagenham 8976.53678 6109.643394
Bromley-Barking and Dagenham 13160.91347 10362.721854
Camden-Barking and Dagenham 8842.99282 5813.159809
City of London-Barking and Dagenham 8728.87879 -1126.986222
Croydon-Barking and Dagenham 5472.01364 2726.731357
Ealing-Barking and Dagenham 7444.55056 4643.802426
Enfield-Barking and Dagenham 9856.81595 7020.532420
Greenwich-Barking and Dagenham 3308.37658 361.423935
Hackney-Barking and Dagenham -2303.09343 -5280.080786
Hammersmith and Fulham-Barking and Dagenham 3374.21360 225.344643
Haringey-Barking and Dagenham 4977.28683 2004.748485
Harrow-Barking and Dagenham 14760.22064 11750.476548
Havering-Barking and Dagenham 9781.09212 6830.002345
Hillingdon-Barking and Dagenham 7988.28148 5080.156381
Hounslow-Barking and Dagenham 6633.44686 3647.394114
Islington-Barking and Dagenham -824.36918 -3909.451677
Kensington and Chelsea-Barking and Dagenham 12368.11342 9144.725551
Kingston upon Thames-Barking and Dagenham 14113.29035 10847.712764
Lambeth-Barking and Dagenham 672.68029 -2178.519304
Lewisham-Barking and Dagenham 2032.14605 -847.904610
Merton-Barking and Dagenham 10555.97287 7476.768867
Newham-Barking and Dagenham 3111.49058 214.182117
Redbridge-Barking and Dagenham 12151.31254 9243.187437
upr p adj
Barnet-Barking and Dagenham 17394.16370 0.0000000
Bexley-Barking and Dagenham 13292.30565 0.0000000
Brent-Barking and Dagenham 11843.43017 0.0000000
Bromley-Barking and Dagenham 15959.10510 0.0000000
Camden-Barking and Dagenham 11872.82584 0.0000000
City of London-Barking and Dagenham 18584.74380 0.1877166
Croydon-Barking and Dagenham 8217.29592 0.0000000
Ealing-Barking and Dagenham 10245.29869 0.0000000
Enfield-Barking and Dagenham 12693.09947 0.0000000
Greenwich-Barking and Dagenham 6255.32923 0.0085802
Hackney-Barking and Dagenham 673.89392 0.4778538
Hammersmith and Fulham-Barking and Dagenham 6523.08255 0.0185918
Haringey-Barking and Dagenham 7949.82518 0.0000001
Harrow-Barking and Dagenham 17769.96473 0.0000000
Havering-Barking and Dagenham 12732.18190 0.0000000
Hillingdon-Barking and Dagenham 10896.40658 0.0000000
Hounslow-Barking and Dagenham 9619.49961 0.0000000
Islington-Barking and Dagenham 2260.71332 1.0000000
Kensington and Chelsea-Barking and Dagenham 15591.50128 0.0000000
Kingston upon Thames-Barking and Dagenham 17378.86794 0.0000000
Lambeth-Barking and Dagenham 3523.87988 1.0000000
Lewisham-Barking and Dagenham 4912.19670 0.6922986
Merton-Barking and Dagenham 13635.17688 0.0000000
Newham-Barking and Dagenham 6008.79904 0.0179605
Redbridge-Barking and Dagenham 15059.43763 0.0000000
[ reached getOption("max.print") -- omitted 503 rows ]
Regression Modelling
The most common use of regression modelling is to explore the
relationship between two continuous variables, for example between
Income_london_rank
and health_london_rank
in
our data. We can first determine whether there is any significant
correlation between the values, and if there is, plot the
relationship.
R
cor.test(lon_dims_imd_2019$Income_london_rank, lon_dims_imd_2019$health_london_rank)
OUTPUT
Pearson's product-moment correlation
data: lon_dims_imd_2019$Income_london_rank and lon_dims_imd_2019$health_london_rank
t = 92.907, df = 4833, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7903110 0.8105571
sample estimates:
cor
0.8006626
R
ggplot(lon_dims_imd_2019, aes(Income_london_rank, health_london_rank)) +
geom_point() +
geom_smooth()
OUTPUT
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Having decided that a further investigation of this relationship is
worthwhile, we can create a linear model with the function
lm()
.
R
modelone <- lm(lon_dims_imd_2019$Income_london_rank ~ lon_dims_imd_2019$health_london_rank)
summary(modelone)
OUTPUT
Call:
lm(formula = lon_dims_imd_2019$Income_london_rank ~ lon_dims_imd_2019$health_london_rank)
Residuals:
Min 1Q Median 3Q Max
-15354 -3547 -102 3458 24528
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.223e+03 2.039e+02 -15.80 <2e-16 ***
lon_dims_imd_2019$health_london_rank 8.634e-01 9.293e-03 92.91 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5087 on 4833 degrees of freedom
Multiple R-squared: 0.6411, Adjusted R-squared: 0.641
F-statistic: 8632 on 1 and 4833 DF, p-value: < 2.2e-16
Regression with a categorical IV (the t-test)
Run the following code chunk and compare the results to the t-test conducted earlier.
R
lon_dims_imd_2019 %>%
mutate(city = factor(city))
OUTPUT
ls11cd la19nm IDAOP_london_rank IDAOP_london_decile
1 E01000001 City of London 32,820 10
2 E01000002 City of London 31,938 10
3 E01000003 City of London 16,377 8
4 E01000005 City of London 3,885 3
IDACI_london_rank IDACI_london_decile Income_london_rank Income_london_decile
1 32806 10 32831 10
2 29682 10 29901 10
3 27063 9 18510 7
4 9458 4 6029 2
employment_london_rank employment_london_decile crime_london_rank
1 32742 10 32662
2 31190 10 32789
3 15103 5 29363
4 7833 2 31059
crime_london_decile barriers_london_rank barriers_london_decile
1 10 2679 6
2 10 3645 8
3 10 984 3
4 10 1003 3
livingEnv_london_rank livingEnv_london_decile health_london_rank
1 7789 4 32113
2 13070 7 29705
3 4092 2 17600
4 9397 5 17907
health_london_decile edu_london_rank edu_london_decile city
1 10 32842 10 TRUE
2 9 32832 10 TRUE
3 4 26386 8 TRUE
4 4 12370 2 TRUE
[ reached 'max' / getOption("max.print") -- omitted 4831 rows ]
R
modelttest <- lm(lon_dims_imd_2019$health_london_rank ~ lon_dims_imd_2019$city)
summary(modelttest)
OUTPUT
Call:
lm(formula = lon_dims_imd_2019$health_london_rank ~ lon_dims_imd_2019$city)
Residuals:
Min 1Q Median 3Q Max
-19251.6 -6392.6 367.4 6853.4 12362.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20481.6 113.3 180.75 <2e-16 ***
lon_dims_imd_2019$cityTRUE 1478.2 3216.6 0.46 0.646
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7874 on 4833 degrees of freedom
Multiple R-squared: 4.37e-05, Adjusted R-squared: -0.0001632
F-statistic: 0.2112 on 1 and 4833 DF, p-value: 0.6458