‎

1. Week 2 Lab

1. Week 2 Lab

1.1. Tables in R

R has lots of ways of making tables, but the results in base R are unsatisfactory. I've written a function to make slightly better tables. Download the tabxw.R file (right-click and "SAVE AS": don't open and save, Windows will mess it up) and save it in your working folder (I recommend you use something like C:\Users\yourname\so5032\labs\ or similar).

(Note that running R code downloaded from the web can be dangerous: make sure you know its origin.)

Use code like this to move R to your working folder and load the code:

getwd() ## show where you are
setwd("C:/Users/yourname/so5032/labs/") ## note "/" not "\"
source("tabxw.R")

Once it is installed, load the NLSW88 data and try the following:

library(foreign)
nlsw88 <- read.dta("https://teaching.sociology.ul.ie/so5041/nlsw88.dta")
tabxw(nlsw88, occupation, union)

For percentages:

tabxw(nlsw88, occupation, union, tabtype="row")
tabxw(nlsw88, occupation, union, tabtype="col")

Expected values and residuals:

tabxw(nlsw88, occupation, union, tabtype="expected")
tabxw(nlsw88, occupation, union, tabtype="resid")

1.2. Reminder: recoding data

Occupation has a number of small categories. We can use recode() to create a new variable (requires the dplyr library). See SO5041 lab 4 for more on recoding.

library(dplyr)
nlsw88$occ2 <- recode(nlsw88$occupation,
                      "Transport" = "Other",
                      "Farmers" = "Other",
                      "Farm laborers" = "Other",
                      "Service" = "Other",
                      "Household workers" = "Other",
                      "Other" = "Other")
tabxw(nlsw88, occ2, union, tabtype="observed", chisq=TRUE)

Note that the chi-sq warning we get with occupation no longer occurs.

1.3. Putting printed tables into R

We can enter tables into R quite easily using the following strategy. Given a table that looks like this:

	Agree	Disagree	Total
Male	122	223	345
Female	268	1632	1900
Total	390	1855	2245

we can put it into R like this:

source("tabxw.R")
tabdata = data.frame(n = c(122, 223, 268, 1632),
                     gender = factor(c(1,1,2,2), labels=c("Male", "Female")),
                     att = factor(c(1,2,1,2),labels=c("Agree", "Disagree")))
tabxw(tabdata, gender, att, wt=n)

       Agree Disagree Total
Male     122      223   345
Female   268     1632  1900
Total    390     1855  2245

Run this syntax, and run a χ² test:

tabxw(tabdata, gender, att, wt = n, chisq=TRUE)

1.3.1. Aside: expanding the table

We can use uncount() (in tidyr, loaded by tabxw.R) to turn this table into a data frame with 2245 cases:

tablong <- uncount(tabdata, n)
chisq.test(table(tablong$gender, tablong$att))

This can be useful, but it is an inefficient was of representing the data (2245 lines instead of 4).

1.4. Analysing a real table

This is a table relating social class of origin and highest educational qualification:

                   |               qual
             class |      Univ  2nd level  Incomplet |     Total
-------------------+---------------------------------+----------
          Prof/Man |      1025       1566        767 |      3358 
Routine non-manual |       124        687        713 |      1524 
    Skilled manual |        31        483        464 |       978 
    Semi/unskilled |        18        361        716 |      1095 
-------------------+---------------------------------+----------
             Total |      1198       3097       2660 |      6955 

Source: British Household Panel Survey 2001

By hand, calculate the odds ratio comparing prof/man versus semi/unskilled in their chances of having a university education (university versus anything else). Interpret it.

Do the same for routine-non-manual versus semi/unskilled and skilled versus semi/unskilled. Is there a pattern in the three ORs?

Enter the table in R using the strategy outlined above, and use tabxw to do the following (this assumes you name the data frame tabdat2, and the row and column variables class and qual respectively):

Analyse the pattern of percentages (tabxw(tabdat2, class, qual, wt=n, tabtype="row"))
Analyse the pattern of expected values and raw residuals
- tabxw(tabdat2, class, qual, wt=n, tabtype="expected")
- tabxw(tabdat2, class, qual, wt=n, tabtype="resid")
Analyse the adjusted residuals (tabxw(tabdat2, class, qual, wt=n, tabtype="adjres"))
Run the χ² test and interpret

1.5. Astrology: see last week's lab

The data from which the zodiac graphic in last week's lab can be accessed as follows:

astro <- read.dta("https://teaching.sociology.ul.ie/so5032/astrogss.dta")
tabxw(astro, zodiac, astrosci)

(Also a right-click download: save locally and do astro <- read.dta("astrogss.dta").)

This data has survey weights, a value for each case to weight it up or down according to how much it is under- or over-represented in the sample compared to, e.g., Census figures. We can incorporate the weights (in variable wtss) as follows:

tabxw(astro, zodiac, astrosci, wt=wtss)

See if you can produce the table from last week (row percentages), and the table of adjusted residuals. Can you get the adjusted residuals to match the colour scheme?

1.6. An ordinal view of class and education

Return to the class/qualification table. Note that both variables have an ordinal interpretation. There may be a simple association between them (higher with higher, lower with lower). Begin by calculating the correlation and the Spearman Rank Correlation. To do this it is convenient to use the expanded version of the table, but we need to force R to ignore the fact that class and qual are categorical variables, i.e., factors in R terms (this is what as.numeric() does: treat it as a number).

cor(as.numeric(tablong$class), as.numeric(tablong$qual))
cor(as.numeric(tablong$class), as.numeric(tablong$qual), method="spearman")

Run and interpret the gamma test
What does it tell you? Compare with the pattern of association shown in the adjusted residuals with the gamma, and consider which gives you the better summary.

1.7. Spurious association and suppression

Use the scouting example (right-click, SAVE AS) to explore an association that differs when you take account of more variables.

source("scout.R")
tabxw(scouting, scout, delinq, wt=n)

Start by calculating a measure of association for the scouting by delinquency table, then for each of the sub-panels (Odds Ratios would be a good idea, since this is 2X2). Then do the same for the three subtables in the church by scouting by delinquency table. Finally, figure out how the three subtables without association add up into a 2-way table with association.

1.8. Tables and complex association

Load and run this R file

source("dpbig.R")

Agresti uses data on race and the death penalty (use code above) to illustrate the possible complexity of a three-way relationship. The data classifies the sentence handed down in murder trials in Florida, by defendent's race and victim's race.

Look first at the defendent/penalty table, then at the three-way table (ie, defendent/penalty controlling for victim's race). Calculating odds ratios would also be useful here. What is going on with this data set?

Table of Contents