PC Labs for SO5041: Week 1

Table of Contents

Week 1

Working with R

We begin by downloading a data file (an extract from the US National Longitude Survey of Women, 1988).

We then load some some libraries into R, and read the file:

library(ggplot2)
library(foreign)
nlsw88 = read.dta("nlsw88.dta")
names(nlsw88)

The variable nlsw88 is a "data frame", where rows are cases/observations and columns are variables. Variables within the dataframe are referred to using $, e.g., nlsw88$occupation.

In the top-right window in RStudio, you will see details of the variables. You can also get the list of variables by entering the command names(nlsw88).

Tabulating categorical variables

Some of these variables have small numbers of distinct values (e.g., occupation), whereas some are quantities like money or time (e.g., wage). We can summarise the values of categorical variables using frequency tables:

table(nlsw88$occupation)

That's the default frequency table in R: not very readable. There are other options, for instance in the epiDisplay library. This needs to be added to R, so first do the following to download and install the library:

install.packages("epiDisplay")

Then try its tab1 option:

library(epiDisplay)
tab1(nlsw88$ occupation, sort.group = "decreasing", cum.percent = TRUE, graph = FALSE)

Summarising quantitative variables

Variables like occupation or union are categorical, "factors" in R terminology: they have a more-or-less small number of distinct values and can usefully be tabulated. Variables like wage have very large numbers of different values, butcan be summarised otherwise:

## Univariate summary of quantitative variable

summary(nlsw88$wage)
mean(nlsw88$wage)
sd(nlsw88$wage)
print(quantile(nlsw88$wage))

Better bar-charts

epiDisplay produces a bar chart as a side-product. R's ggplot2 library produces higher quality graphics. Experiment with these commands:

ggplot(nlsw88, aes(x=occupation)) + geom_bar(color="red", fill="pink")
ggplot(nlsw88, aes(x=occupation)) + geom_bar(color="red", fill="pink") + coord_flip()
ggplot(nlsw88, aes(x=occupation)) + geom_bar(color="red", fill="pink") + coord_flip() + ggtitle("Bar chart: Occupation")

Histograms

Histograms are a good way to summarise quantitative variables. The R default is somewhat ugly but simple:

hist(nlsw88$wage)

ggplot2 is more flexible, if verbose:

ggplot(nlsw88, aes(x=wage)) + geom_histogram() + ggtitle("Histogram of Wage")
ggplot(nlsw88, aes(x=wage)) + geom_histogram(color="white", fill="red") + ggtitle("Histogram of Wage")