PC Labs for SO5041: Week 2
Table of Contents
1. Week 2: Univariate and bivariate analysis
1.1. Setup (from lab 1)
To set up the NLSW88 data set, run the following code in RStudio (see previous lab's notes to download data file). Note that the setwd()
command moves R to the location where the data file should be; replace "C:/Users/yourname/Documents/SO5041/labs"
with whatever corresponds with your location. Note also that if the foreign
library is not installed you will need to do install.packages("foreign")
first.
setwd("C:/Users/yourname/Documents/SO5041/labs") library(foreign) nlsw88 <- read.dta("nlsw88.dta")
1.2. Univariate summaries (recap)
One-way descriptive summaries depend on the type of variable:
- Categorical
- Nominal variables: numbers relate to categories which are "just different". All we can do is enumerate the different types
- Ordinal: categories have a natural order but steps between categories are not defined
- Quantitative (sometimes called continuous): numbers that have natural meaning like time, money, distance, age, weight, or counts; steps between successive numbers are consistent, and summaries like the mean make sense.
1.2.1. Numerical summaries of categorical variables
Categorical variables typically have few values and we can use frequency tables to summarise them:
- Frequency table:
table(nlsw88$occuation)
- Better version:
library(epiDensity) tab1(nlsw88$occuation) > tab1(nlsw88$occupation) nlsw88$occupation : Frequency %(NA+) %(NA-) Professional/Technical 317 14.1 14.2 Managers/Admin 264 11.8 11.8 Sales 726 32.3 32.5 Clerical/Unskilled 102 4.5 4.6 Craftsmen 53 2.4 2.4 Operatives 246 11.0 11.0 Transport 28 1.2 1.3 Laborers 286 12.7 12.8 Farmers 1 0.0 0.0 Farm laborers 9 0.4 0.4 Service 16 0.7 0.7 Household workers 2 0.1 0.1 Other 187 8.3 8.4 NA's 9 0.4 0.0 Total 2246 100.0 100.0
1.2.2. Graphical summaries of categorical variables
We will use the ggplot2
library for graphics.
Categorical variables can usually be summarised with barcharts or piecharts. Piecharts aren't very effective, but they're popular. They're easier without ggplot2
(note how it uses the table()
function to calculate the category sizes):
pie(table(nlsw88&occupation))
Bar charts are preferred:
library(ggplot2) ggplot(nlsw88, aes(x=occupation)) + geom_bar() + gtitle("Occupation") + coord_flip() ggplot(nlsw88, aes(x=occupation)) + geom_pie(color="red", fill="pink") + gtitle("Occupation") + coord_flip()
#+ENDEXAMPLE
1.2.3. Numerical summaries of quantitative variables
Scale and count variables (time, money, weights, counts etc) often have too many values to summarise as above. But since they are numerically meaningful we can summarise them: mean, median, standard deviation, range etc.
summary(nlsw88$wage) mean(nlsw88$wage) median(nlsw88$wage) sd(nlsw88$wage) quantile(nlsw88$wage) max(nlsw88$wage) - min(nlsw88$wage)
1.2.4. Graphical summaries of quantitative variables
Histograms are a good way of summarising the distribution of a quantitative variable with many distinct values. Here is how, using both base R and ggplot2:
hist(nlsw88$wage) ggplot(nlsw88, aes(x=wage)) + geom_histogram() + ggtitle("Histogram of Wage") ggplot(nlsw88, aes(x=wage)) + geom_histogram(color="white", fill="red") + ggtitle("Histogram of Wage")
A boxplot focuses mostly on the quartiles and medians, showing also the range and outliers. Quite sparse, more useful with bivariate analysis.
ggplot(nlsw88, aes(x=wage)) + geom_boxplot() + coord_flip()
1.3. Two-way analyses
Now we will work with several ways of analysing two variables together: bivariate analysis.
We will look at three types of combination:
- categorical by categorical
- categorical by continuous (interval/ratio)
- continuous by continuous.
We will consider numerical and graphical techniques.
1.3.1. Categorical by categorical
Cross-tabulation is the easiest, and quite a powerful, method for looking at the relationship between categorical (nominal, ordinal) or grouped data.
Cross-tabulate occupation
and collgrad
in base R:
table(nlsw88$occupation, nlsw88$colgrad)
Once again epiDisplay
does a nicer job
tabpct(nlsw88$occupation, nlsw88$colgrad, percent="row", graph=FALSE)
Get either row or column percentages by adding the option percent="row"
or percent="col"
.
What patterns to you see in the table? Which percentages (row or column) are easier to interpret?
- Graphing
With the same variables, create a clustered bar chart.
ggplot(nlsw88, aes(x = occupation, fill = union)) + geom_bar(position="dodge") + coord_flip()
Awkward way to drop the missing occupations:
ggplot(subset(nlsw88, !is.na(occupation)), aes(x = occupation, fill = union)) + geom_bar(position="dodge") + coord_flip()
Clustered means the union/non-union bars are clustered withing occupation. If you want to cluster occupation within union change the
aes()
part toaes(x = union, fill = occupation))
. Which is better?We can stack bars instead of clustering them, by changing the
position
keyword:ggplot(subset(nlsw88, !is.na(occupation)), aes(x = occupation, fill = union)) + geom_bar(position="stack") + coord_flip()
Experiment with stacked and clustered, and with which variable to use as the cluster or stack variable. Do you see the same patterns as in the table?
Another option is to highlight proportions within one category (like row percentages):
ggplot(subset(nlsw88, !is.na(union)), aes(x = occupation, fill = union)) + geom_bar(position="fill") + coord_flip()
1.3.2. Interval/Ratio by Categorical
Sometimes we have a continuous variable that "varies with" a categorical one. Income may vary with gender, or with educational qualifications.
With the NLSW88 data, see how wage varies with variables such as collgrad
or occupation
. Here for occupation:
aggregate(wage ~ occupation, nlsw88, mean)
Graphically, we can represent this with a barchart where the height of the bar represents the mean of the continuous variable for that value of the categorical one:
ggplot(subset(nlsw88, !is.na(occupation)), aes(occupation, wage)) + geom_bar(stat="summary", fun="mean") + coord_flip()
Box plots focus on medians and quartiles, and give a somewhat more detailed picture of the distribution than just the mean. They are good for comparing distributions across categories.
1.3.3. Interval/Ratio by Interval/Ratio
We will find out about numerical methods for summarising the relationship between pairs or interval variables later, but for now the scatterplot is very useful. With the NLSW88, compare all three possible pairs of the following variables:
- Years of education (
grade
) - Wage rate (
wage
) - Work experience (
ttl_exp
)
We can generate a scatterplot as follows:
ggplot(nlsw88, aes(y=wage, x=ttl_exp)) + geom_point()
We can improve the appearance (colour, transparency, titles and labels) as follows:
ggplot(nlsw88, aes(y=wage, x=ttl_exp)) + geom_point(color="darkred", alpha=0.4) + ggtitle("Scatter of Wage and Experience") + ylab("Wage Rate") + xlab("Years of work experience")