PC Labs for SO5041: Week 3

Week 3 Lab: Data entry

Week 3 Lab: Data entry

Data collected in class

In previous years, I collected information from undergraduates in class. Here it is represented numerically:

Age	Sex	Reg	Maths	Dist
37	1	2	3	27
29	2	5	3	2
22	1	7	2	-9
20	2	1	2	1
20	2	1	2	6
20	2	1	3	24
21	2	2	2	0
21	2	2	2	0
20	2	7	2	0
22	1	1	3	1
20	1	2	3	25
20	2	3	3	1
20	2	5	2	5
20	2	3	3	75
21	1	3	3	20
21	2	1	3	5
21	2	1	3	7
21	1	1	3	3
19	1	1	2	40
19	2	2	3	5

Sex is coded with

1: Male
2: Female

Reg is region of birth and is coded with

Munster
Leinster
Connacht-ulster
NI
GB
EU
US
Other

Dist is distance of home from UL, in miles.

Maths is last maths done, coded with

Post LC
LC(h)
LC(o)
JC(h)
JC(o)
Lower

What to do?

Copy this data into a spreadsheet, and save the result as a "Text CSV" file.

Alternatively, use an editor like Notepad to replace all whitespace with commas. It should look like this:

Age,Sex,Reg,Maths,Dist
37,1,2,3,27
29,2,5,3,2
22,1,7,2,-9
[ ... ]

We can read CSV files into R (you may need to do install.packages(tidyverse) first).

library(tidyverse)
df <- read_csv("/tmp/x.csv") ## Change the file specification to match where you saved it

You have a pretty anonymous data set, with lots of unlabeled numbers. Inspect it in the Environment window, or in the main window:

> df
# A tibble: 20 × 5
     Age   Sex   Reg Maths  Dist
   <dbl> <dbl> <dbl> <dbl> <dbl>
 1    37     1     2     3    27
 2    29     2     5     3     2
 3    22     1     7     2    -9
 4    20     2     1     2     1
 5    20     2     1     2     6
 6    20     2     1     3    24
 7    21     2     2     2     0
 8    21     2     2     2     0
 9    20     2     7     2     0
10    22     1     1     3     1
11    20     1     2     3    25
12    20     2     3     3     1
13    20     2     5     2     5
14    20     2     3     3    75
15    21     1     3     3    20
16    21     2     1     3     5
17    21     2     1     3     7
18    21     1     1     3     3
19    19     1     1     2    40
20    19     2     2     3     5

This is a "tibble", which is a form of dataframe. You can access the variables within it as df$Sex, df$Age, etc.

Categorical variables

R has a special form of representation for categorical variables, a "factor". Factors have defined values, and each value has a label. To make the sex variable a labelled factor (1 => "Male", 2 => "Female"), do the following:

df$Sex <- factor(df$Sex, levels=1:2, labels=c("Male", "Female"))
table(df$Sex)

The levels option goes from the minimum to the maximum underlying values. This is usually 1 to the number of values, so 1:X usually works (if the values have gaps, a list works).

The labels option gives a list of labels to apply. (The c() function stands for concatenate or combine, and makes a list of its inputs.) There needs to be exactly one label for each level.

Note that in the example above, the existing variable is being changed (overwritten). You may choose to create a new variable instead:

df$Sex2 <- factor(df$Sex, levels=1:2, labels=c("Male", "Female"))

Repeat this with the other categorical variables, Sex, Reg and Maths.

Explore the data

Tabulate the categorical variables (table(), or tab1() from the epiDisplay library), and summarise the quantitative ones (summary(), mean(), sd(), etc.).

Be clear about which summary is appropriate for the different variables. Tabulation is usually helpful where there are relatively few distinct values, and summaries such as means and standard deviations are only meaningful for interval and ratio variables. Age falls into both camps, as it has few distinct values, but is numerically meaningful.

Missing values

Note that the distance variable contains a negative value, which is obviously incorrect as a distance. This is a standard way to record a missing value. If you calculate the mean with the data as it exists, it will therefore be incorrect.

mean(df$Dist)

We need to recode the missing values to R's "NA" value. As a one-off for a single calculation you can do this:

mean(ifelse(df$Dist == -9, NA, df$Dist), na.rm=TRUE )

This changes the -9 but nothing else, and the na.rm=TRUE option tells the mean() function to use the non-NA values only (otherwise it will fail).

To make this change permanent within the dataframe:

df <- mutate(df, Dist = ifelse(Dist == -9, NA, Dist))
mean(df$Dist, na.rm=TRUE)

Age	Sex	Reg	Maths	Dist
37	1	2	3	27
29	2	5	3	2
22	1	7	2	-9
20	2	1	2	1
20	2	1	2	6
20	2	1	3	24
21	2	2	2	0
21	2	2	2	0
20	2	7	2	0
22	1	1	3	1
20	1	2	3	25
20	2	3	3	1
20	2	5	2	5
20	2	3	3	75
21	1	3	3	20
21	2	1	3	5
21	2	1	3	7
21	1	1	3	3
19	1	1	2	40
19	2	2	3	5

Age	Sex	Reg	Maths	Dist
37	1	2	3	27
29	2	5	3	2
22	1	7	2	-9
20	2	1	2	1
20	2	1	2	6
20	2	1	3	24
21	2	2	2	0
21	2	2	2	0
20	2	7	2	0
22	1	1	3	1
20	1	2	3	25
20	2	3	3	1
20	2	5	2	5
20	2	3	3	75
21	1	3	3	20
21	2	1	3	5
21	2	1	3	7
21	1	1	3	3
19	1	1	2	40
19	2	2	3	5