PC Labs for SO5032
Table of Contents
Week 3 Lab Multiple regression
Maths and Height
Load the following data:
library(foreign)
mh <- read.dta("http://teaching.sociology.ul.ie/so5032/mathsheight.dta")
This is artificial data, where maths is score in a maths test, and height is height in centimetres. The sample is the whole student body of a secondary school, from first years to sixth years. Examine the correlations between the variables, numerically and graphically. Then regress maths on height. Interpret the output, and relate it to the scatter plot.
Correlation:
cor(mh[,c("maths", "height")])
or
cor(mh$maths,mh$height)
Scatterplot:
ggplot(mh, aes(x = height, y = maths)) + geom_point()
mod1 <- lm(data=mh, maths ~ height) summary(mod1)
Draw the regression line
Scatterplot:
ggplot(mh, aes(x = height, y = maths)) +
geom_point() +
geom_function(fun=function(x) mod1$coefficients["(Intercept)"] +
mod1$coefficients["height"]*x, size=2, color="red")
Consider controlling for year. First, compare the maths/height scatterplot across year:
ggplot(subset(mh, year == 1), aes(x = height, y = maths)) + geom_point() ggplot(subset(mh, year == 2), aes(x = height, y = maths)) + geom_point() . . .
Or better, all in one:
ggplot(mh, aes(x = height, y = maths)) + geom_point() + facet_grid(year ~ .)
What does this tell you? Does this command make it clearer?
tempdf <- subset(mh, year == 1); cor(tempdf$maths, tempdf$height) tempdf <- subset(mh, year == 2); cor(tempdf$maths, tempdf$height) . . .
Then fit the regression including year as well as height as
explanatory variables: lm(data=mh, maths ~ height + year). Interpret the
output.
Hours, gender and income
We will use a small extract from the British Household Panel Survey with info on hours worked, income and gender. Load it into Stata as follows:
library(foreign)
oind <- read.dta("http://teaching.sociology.ul.ie/so5032/oind.dta")
First fit a bivariate regression using hours to predict income.
mod1 <- lm(data=oind, Income ~ Hours)
Then do a t-test comparing income across gender:
t.test(oind$Income ~ oind$Gender, var.equal=TRUE)
Compare the results you get by regression:
summary(lm(data=oind, Income ~ Gender))
Now fit the following multiple regression:
summary(lm(data=oind, Income ~ Hours))
Now fit the regression with both explanatory variables
summary(lm(data=oind, Income ~ Hours + Gender))
Interpret the results.
Draw the two regression lines on paper.
Interactions
Fit separate models for men and women, and attempt to draw the regression lines
summary(lm(data=subset(oind, Gender=="male"), Income ~ Hours)) summary(lm(data=subset(oind, Gender=="female"), Income ~ Hours))
The assumption of parallel lines seems not to be justified.
Note that a similar result is obtained by fitting a regression with an interaction term.
summary(lm(data=oind, Income ~ Hours*Gender))
Direct and indirect effects
indir <- read.dta("https://teaching.sociology.ul.ie/so5032/indirect.dta")
lm(data=indir, ownscore ~ fatherscore)
lm(data=indir, ownscore ~ education)
lm(data=indir, education ~ fatherscore)
lm(data=indir, ownscore ~ education + fatherscore)