PC Labs for SO5041: Week 2

Table of Contents

1. Week 2: Univariate and bivariate analysis

1.1. Setup (from lab 1)

To set up the NLSW88 data set, run the following code in RStudio (see previous lab's notes to download data file). Note that the setwd() command moves R to the location where the data file should be; replace "C:/Users/yourname/Documents/SO5041/labs" with whatever corresponds with your location. Note also that if the foreign library is not installed you will need to do install.packages("foreign") first.

setwd("C:/Users/yourname/Documents/SO5041/labs")
library(foreign)
nlsw88 <- read.dta("nlsw88.dta")

1.2. Univariate summaries (recap)

One-way descriptive summaries depend on the type of variable:

  • Categorical
    • Nominal variables: numbers relate to categories which are "just different". All we can do is enumerate the different types
    • Ordinal: categories have a natural order but steps between categories are not defined
  • Quantitative (sometimes called continuous): numbers that have natural meaning like time, money, distance, age, weight, or counts; steps between successive numbers are consistent, and summaries like the mean make sense.

1.2.1. Numerical summaries of categorical variables

Categorical variables typically have few values and we can use frequency tables to summarise them:

  • Frequency table:
table(nlsw88$occuation)
  • Better version:
library(epiDensity)
tab1(nlsw88$occuation)
> tab1(nlsw88$occupation)
nlsw88$occupation : 
                       Frequency   %(NA+)   %(NA-)
Professional/Technical       317     14.1     14.2
Managers/Admin               264     11.8     11.8
Sales                        726     32.3     32.5
Clerical/Unskilled           102      4.5      4.6
Craftsmen                     53      2.4      2.4
Operatives                   246     11.0     11.0
Transport                     28      1.2      1.3
Laborers                     286     12.7     12.8
Farmers                        1      0.0      0.0
Farm laborers                  9      0.4      0.4
Service                       16      0.7      0.7
Household workers              2      0.1      0.1
Other                        187      8.3      8.4
NA's                           9      0.4      0.0
  Total                     2246    100.0    100.0

1.2.2. Graphical summaries of categorical variables

We will use the ggplot2 library for graphics.

Categorical variables can usually be summarised with barcharts or piecharts. Piecharts aren't very effective, but they're popular. They're easier without ggplot2 (note how it uses the table() function to calculate the category sizes):

pie(table(nlsw88&occupation))

Bar charts are preferred:

library(ggplot2)
ggplot(nlsw88, aes(x=occupation)) + geom_bar() + gtitle("Occupation") + coord_flip()
ggplot(nlsw88, aes(x=occupation)) + geom_pie(color="red", fill="pink") + gtitle("Occupation") + coord_flip()

#+ENDEXAMPLE

1.2.3. Numerical summaries of quantitative variables

Scale and count variables (time, money, weights, counts etc) often have too many values to summarise as above. But since they are numerically meaningful we can summarise them: mean, median, standard deviation, range etc.

summary(nlsw88$wage)
mean(nlsw88$wage)
median(nlsw88$wage)
sd(nlsw88$wage)
quantile(nlsw88$wage)
max(nlsw88$wage) - min(nlsw88$wage)

1.2.4. Graphical summaries of quantitative variables

Histograms are a good way of summarising the distribution of a quantitative variable with many distinct values. Here is how, using both base R and ggplot2:

hist(nlsw88$wage)
ggplot(nlsw88, aes(x=wage)) + geom_histogram() + ggtitle("Histogram of Wage")
ggplot(nlsw88, aes(x=wage)) + geom_histogram(color="white", fill="red") + ggtitle("Histogram of Wage")

A boxplot focuses mostly on the quartiles and medians, showing also the range and outliers. Quite sparse, more useful with bivariate analysis.

ggplot(nlsw88, aes(x=wage)) + geom_boxplot() + coord_flip()

1.3. Two-way analyses

Now we will work with several ways of analysing two variables together: bivariate analysis.

We will look at three types of combination:

  • categorical by categorical
  • categorical by continuous (interval/ratio)
  • continuous by continuous.

We will consider numerical and graphical techniques.

1.3.1. Categorical by categorical

Cross-tabulation is the easiest, and quite a powerful, method for looking at the relationship between categorical (nominal, ordinal) or grouped data.

Cross-tabulate occupation and collgrad in base R:

table(nlsw88$occupation, nlsw88$colgrad)

Once again epiDisplay does a nicer job

tabpct(nlsw88$occupation, nlsw88$colgrad, percent="row", graph=FALSE)

Get either row or column percentages by adding the option percent="row" or percent="col".

What patterns to you see in the table? Which percentages (row or column) are easier to interpret?

  1. Graphing

    With the same variables, create a clustered bar chart.

    ggplot(nlsw88, aes(x = occupation, fill = union)) +  geom_bar(position="dodge") + coord_flip()
    

    Awkward way to drop the missing occupations:

    ggplot(subset(nlsw88, !is.na(occupation)), aes(x = occupation, fill = union)) +  geom_bar(position="dodge") + coord_flip()
    

    Clustered means the union/non-union bars are clustered withing occupation. If you want to cluster occupation within union change the aes() part to aes(x = union, fill = occupation)). Which is better?

    We can stack bars instead of clustering them, by changing the position keyword:

    ggplot(subset(nlsw88, !is.na(occupation)), aes(x = occupation, fill = union)) +  geom_bar(position="stack") + coord_flip()
    

    Experiment with stacked and clustered, and with which variable to use as the cluster or stack variable. Do you see the same patterns as in the table?

    Another option is to highlight proportions within one category (like row percentages):

    ggplot(subset(nlsw88, !is.na(union)), aes(x = occupation, fill = union)) + geom_bar(position="fill") + coord_flip()
    

1.3.2. Interval/Ratio by Categorical

Sometimes we have a continuous variable that "varies with" a categorical one. Income may vary with gender, or with educational qualifications.

With the NLSW88 data, see how wage varies with variables such as collgrad or occupation. Here for occupation:

aggregate(wage ~ occupation, nlsw88, mean)

Graphically, we can represent this with a barchart where the height of the bar represents the mean of the continuous variable for that value of the categorical one:

ggplot(subset(nlsw88, !is.na(occupation)), aes(occupation, wage)) + geom_bar(stat="summary", fun="mean") + coord_flip()

Box plots focus on medians and quartiles, and give a somewhat more detailed picture of the distribution than just the mean. They are good for comparing distributions across categories.

1.3.3. Interval/Ratio by Interval/Ratio

We will find out about numerical methods for summarising the relationship between pairs or interval variables later, but for now the scatterplot is very useful. With the NLSW88, compare all three possible pairs of the following variables:

  • Years of education (grade)
  • Wage rate (wage)
  • Work experience (ttl_exp)

We can generate a scatterplot as follows:

ggplot(nlsw88, aes(y=wage, x=ttl_exp)) + geom_point()

We can improve the appearance (colour, transparency, titles and labels) as follows:

ggplot(nlsw88, aes(y=wage, x=ttl_exp)) + geom_point(color="darkred", alpha=0.4) + ggtitle("Scatter of Wage and Experience") +
  ylab("Wage Rate") +
  xlab("Years of work experience")

Date: Sep 18 2023

Author: Brendan Halpin, Department of Sociology, University of Limerick

Created: 2025-09-15 Mon 09:48

Validate