PC Labs for SO5041: Week 1
Table of Contents
Week 1
Installing R and RStudio
Working with R
We begin by downloading a data file (an extract from the US National Longitude Survey of Women, 1988).
We then load some some libraries into R, and read the file:
library(ggplot2) library(foreign) nlsw88 = read.dta("nlsw88.dta") names(nlsw88)
The variable nlsw88
is a "data frame", where rows are cases/observations and columns are variables. Variables within the dataframe are referred to using $
, e.g., nlsw88$occupation
.
In the top-right window in RStudio, you will see details of the variables. You can also get the list of variables by entering the command names(nlsw88)
.
Tabulating categorical variables
Some of these variables have small numbers of distinct values (e.g., occupation
), whereas some are quantities like money or time (e.g., wage
). We can summarise the values of categorical variables using frequency tables:
table(nlsw88$occupation)
That's the default frequency table in R: not very readable. There are other options, for instance in the epiDisplay
library. This needs to be added to R, so first do the following to download and install the library:
install.packages("epiDisplay")
Then try its tab1
option:
library(epiDisplay) tab1(nlsw88$ occupation, sort.group = "decreasing", cum.percent = TRUE, graph = FALSE)
Summarising quantitative variables
Variables like occupation
or union
are categorical, "factors" in R terminology: they have a more-or-less small number of distinct values and can usefully be tabulated. Variables like wage
have very large numbers of different values, butcan be summarised otherwise:
## Univariate summary of quantitative variable
summary(nlsw88$wage)
mean(nlsw88$wage)
sd(nlsw88$wage)
print(quantile(nlsw88$wage))
Better bar-charts
epiDisplay
produces a bar chart as a side-product. R's ggplot2
library produces higher quality graphics. Experiment with these commands:
ggplot(nlsw88, aes(x=occupation)) + geom_bar(color="red", fill="pink") ggplot(nlsw88, aes(x=occupation)) + geom_bar(color="red", fill="pink") + coord_flip() ggplot(nlsw88, aes(x=occupation)) + geom_bar(color="red", fill="pink") + coord_flip() + ggtitle("Bar chart: Occupation")
Histograms
Histograms are a good way to summarise quantitative variables. The R default is somewhat ugly but simple:
hist(nlsw88$wage)
ggplot2
is more flexible, if verbose:
ggplot(nlsw88, aes(x=wage)) + geom_histogram() + ggtitle("Histogram of Wage") ggplot(nlsw88, aes(x=wage)) + geom_histogram(color="white", fill="red") + ggtitle("Histogram of Wage")