PC Labs for SO5032

Week 3 Lab Multiple regression

Week 3 Lab Multiple regression

Maths and Height

Load the following data:

library(foreign)
mh <- read.dta("http://teaching.sociology.ul.ie/so5032/mathsheight.dta")

This is artificial data, where maths is score in a maths test, and height is height in centimetres. The sample is the whole student body of a secondary school, from first years to sixth years. Examine the correlations between the variables, numerically and graphically. Then regress maths on height. Interpret the output, and relate it to the scatter plot.

Correlation:

cor(mh[,c("maths", "height")])

cor(mh$maths,mh$height)

Scatterplot:

ggplot(mh, aes(x = height, y = maths)) + geom_point()

mod1 <- lm(data=mh, maths ~ height)
summary(mod1)

Draw the regression line

Scatterplot:

ggplot(mh, aes(x = height, y = maths)) +
    geom_point() +
    geom_function(fun=function(x) mod1$coefficients["(Intercept)"] +
                                  mod1$coefficients["height"]*x, size=2, color="red")

Consider controlling for year. First, compare the maths/height scatterplot across year:

ggplot(subset(mh, year == 1), aes(x = height, y = maths)) + geom_point()
ggplot(subset(mh, year == 2), aes(x = height, y = maths)) + geom_point()
. . .

Or better, all in one:

ggplot(mh, aes(x = height, y = maths)) + geom_point() + facet_grid(year ~ .)

What does this tell you? Does this command make it clearer?

tempdf <- subset(mh, year == 1); cor(tempdf$maths, tempdf$height)
tempdf <- subset(mh, year == 2); cor(tempdf$maths, tempdf$height)
. . .

Then fit the regression including year as well as height as explanatory variables: lm(data=mh, maths ~ height + year). Interpret the output.

Hours, gender and income

We will use a small extract from the British Household Panel Survey with info on hours worked, income and gender. Load it into Stata as follows:

library(foreign)
oind <- read.dta("http://teaching.sociology.ul.ie/so5032/oind.dta")

First fit a bivariate regression using hours to predict income.

mod1 <- lm(data=oind, Income ~ Hours)

Then do a t-test comparing income across gender:

t.test(oind$Income ~ oind$Gender, var.equal=TRUE)

Compare the results you get by regression:

summary(lm(data=oind, Income ~ Gender))

Now fit the following multiple regression:

summary(lm(data=oind, Income ~ Hours))

Now fit the regression with both explanatory variables

summary(lm(data=oind, Income ~ Hours + Gender))

Interpret the results.

Draw the two regression lines on paper.

Interactions

Fit separate models for men and women, and attempt to draw the regression lines

summary(lm(data=subset(oind, Gender=="male"), Income ~ Hours))
summary(lm(data=subset(oind, Gender=="female"), Income ~ Hours))

The assumption of parallel lines seems not to be justified.

Note that a similar result is obtained by fitting a regression with an interaction term.

summary(lm(data=oind, Income ~ Hours*Gender))

Direct and indirect effects

indir <- read.dta("https://teaching.sociology.ul.ie/so5032/indirect.dta")
lm(data=indir, ownscore ~ fatherscore)
lm(data=indir, ownscore ~ education)
lm(data=indir, education ~ fatherscore)
lm(data=indir, ownscore ~ education + fatherscore)