PC Labs for SO5041: Week 3
Table of Contents
Week 3 Lab: Data entry
Data collected in class
In previous years, I collected information from undergraduates in class. Here it is represented numerically:
| Age | Sex | Reg | Maths | Dist |
|---|---|---|---|---|
| 37 | 1 | 2 | 3 | 27 |
| 29 | 2 | 5 | 3 | 2 |
| 22 | 1 | 7 | 2 | -9 |
| 20 | 2 | 1 | 2 | 1 |
| 20 | 2 | 1 | 2 | 6 |
| 20 | 2 | 1 | 3 | 24 |
| 21 | 2 | 2 | 2 | 0 |
| 21 | 2 | 2 | 2 | 0 |
| 20 | 2 | 7 | 2 | 0 |
| 22 | 1 | 1 | 3 | 1 |
| 20 | 1 | 2 | 3 | 25 |
| 20 | 2 | 3 | 3 | 1 |
| 20 | 2 | 5 | 2 | 5 |
| 20 | 2 | 3 | 3 | 75 |
| 21 | 1 | 3 | 3 | 20 |
| 21 | 2 | 1 | 3 | 5 |
| 21 | 2 | 1 | 3 | 7 |
| 21 | 1 | 1 | 3 | 3 |
| 19 | 1 | 1 | 2 | 40 |
| 19 | 2 | 2 | 3 | 5 |
Sex is coded with
- 1: Male
- 2: Female
Reg is region of birth and is coded with
- Munster
- Leinster
- Connacht-ulster
- NI
- GB
- EU
- US
- Other
Dist is distance of home from UL, in miles.
Maths is last maths done, coded with
- Post LC
- LC(h)
- LC(o)
- JC(h)
- JC(o)
- Lower
What to do?
Copy this data into a spreadsheet, and save the result as a "Text CSV" file.
Alternatively, use an editor like Notepad to replace all whitespace with commas. It should look like this:
Age,Sex,Reg,Maths,Dist 37,1,2,3,27 29,2,5,3,2 22,1,7,2,-9 [ ... ]
We can read CSV files into R (you may need to do install.packages(tidyverse) first).
library(tidyverse) df <- read_csv("/tmp/x.csv") ## Change the file specification to match where you saved it
You have a pretty anonymous data set, with lots of unlabeled numbers. Inspect it in the Environment window, or in the main window:
> df
# A tibble: 20 × 5
Age Sex Reg Maths Dist
<dbl> <dbl> <dbl> <dbl> <dbl>
1 37 1 2 3 27
2 29 2 5 3 2
3 22 1 7 2 -9
4 20 2 1 2 1
5 20 2 1 2 6
6 20 2 1 3 24
7 21 2 2 2 0
8 21 2 2 2 0
9 20 2 7 2 0
10 22 1 1 3 1
11 20 1 2 3 25
12 20 2 3 3 1
13 20 2 5 2 5
14 20 2 3 3 75
15 21 1 3 3 20
16 21 2 1 3 5
17 21 2 1 3 7
18 21 1 1 3 3
19 19 1 1 2 40
20 19 2 2 3 5
This is a "tibble", which is a form of dataframe. You can access the variables within it as df$Sex, df$Age, etc.
Categorical variables
R has a special form of representation for categorical variables, a "factor". Factors have defined values, and each value has a label. To make the sex variable a labelled factor (1 => "Male", 2 => "Female"), do the following:
df$Sex <- factor(df$Sex, levels=1:2, labels=c("Male", "Female")) table(df$Sex)
The levels option goes from the minimum to the maximum underlying values. This is usually 1 to the number of values, so 1:X usually works (if the values have gaps, a list works).
The labels option gives a list of labels to apply. (The c() function stands for concatenate or combine, and makes a list of its inputs.) There needs to be exactly one label for each level.
Note that in the example above, the existing variable is being changed (overwritten). You may choose to create a new variable instead:
df$Sex2 <- factor(df$Sex, levels=1:2, labels=c("Male", "Female"))
Repeat this with the other categorical variables, Sex, Reg and Maths.
Explore the data
Tabulate the categorical variables (table(), or tab1() from the epiDisplay library), and summarise the quantitative ones (summary(), mean(), sd(), etc.).
Be clear about which summary is appropriate for the different variables. Tabulation is usually helpful where there are relatively few distinct values, and summaries such as means and standard deviations are only meaningful for interval and ratio variables. Age falls into both camps, as it has few distinct values, but is numerically meaningful.
Missing values
Note that the distance variable contains a negative value, which is obviously incorrect as a distance. This is a standard way to record a missing value. If you calculate the mean with the data as it exists, it will therefore be incorrect.
mean(df$Dist)
We need to recode the missing values to R's "NA" value. As a one-off for a single calculation you can do this:
mean(ifelse(df$Dist == -9, NA, df$Dist), na.rm=TRUE )
This changes the -9 but nothing else, and the na.rm=TRUE option tells the mean() function to use the non-NA values only (otherwise it will fail).
To make this change permanent within the dataframe:
df <- mutate(df, Dist = ifelse(Dist == -9, NA, Dist)) mean(df$Dist, na.rm=TRUE)