SO5032: Lab Materials

1. Week 8 Lab
- 1.1. Logging the dependent variable
- 1.2. Data Archives
  - 1.2.1. Key links

1. Week 8 Lab

1.1. Logging the dependent variable

Linear regression requires the assumption that the residuals are (approximately) normally distributed, and that their variance (spread) is not associated with any of the explanatory variables. One way this assumption is violated is if the dependent variable has a multiplicative (rather than an additive) relationship with the explanatory variables: random variations around high predictions will be bigger than around small predictions). This is the case with wage in the NLSW-88 data.

Work through this example as presented, making sure you understand what is happening at each step.

library(foreign)
library(ggplot2)
nlsw88 <- read.dta("https://teaching.sociology.ul.ie/so5041/nlsw88.dta")
modlin <- lm(data=nlsw88, wage ~ ttl_exp)
nlsw88$plin <- predict(modlin)
nlsw88$rlin <- nlsw88$wage - nlsw88$plin

ggplot(data=nlsw88, aes(x=ttl_exp, y=rlin)) + geom_point()

ggplot(data=nlsw88, aes(x=rlin)) + geom_histogram()

hist(nlsw88$rlin)

The variation around the predicted wage is higher for people with higher experience (and therefore higher predicted wages) than for people with lower experience. This is "heteroscedasticity", or violation of the assumption of homoscedasticity.

If the relationship is multiplicative (i.e., a unit increase in experience tends to increase wage by a percentage, rather than a fixed amount) we can cope with this by taking the log of the dependent variable:

nlsw88$lw <- log(nlsw88$wage)
modlog <- lm(data=nlsw88, lw ~ ttl_exp)
nlsw88$plog <- predict(modlog)
nlsw88$rlog <- nlsw88$lw - nlsw88$plog
ggplot(data=nlsw88, aes(x=ttl_exp, y=rlog)) + geom_point()

We interpret the slope coefficient as the additive effect on the log of wage. Thus the antilog of the slope coeffiecient is a multiplicative effect on wage.

There is a small problem when it comes to predictions. We can get the predicted log wage directly, but if we just take the antilog it will tend to be too low. (Simply put, the result is the antilog of the average of the logged data, which is not the same as the average of the original data.)

nlsw88$eplog <- exp(nlsw88$plog)
summary(nlsw88$plin)
summary(nlsw88$eplog)

We see that while the average of the linear regression prediction is close to the data, the average of the anti-log of the log prediction is too low. We need to impose a small adjustment based on the spread of the data around the regression line, the Root Mean Square error.

The RMSE is given in the summary() output, and can be accessed as summary(modlog)$sigma.

nlsw88$eplogadj = nlsw88$eplog*exp(summary(modlog)$sigma^2/2)
ggplot(data=subset(nlsw88, wage<=20)) +
  geom_point(aes(x=ttl_exp, y=wage)) +
  geom_point(aes(x=ttl_exp, y=plin, color="Linear")) +
  geom_point(aes(x=ttl_exp, y=eplog, color="Unadj log")) +
  geom_point(aes(x=ttl_exp, y=eplogadj, color="Adj log"))

1.2. Data Archives

Today's second business is simply to examine what is available through the Irish Social Science Data Archive website. You are encouraged to browse, particularly through the documentation held there, and follow your own interests. However, it would be good to explore the main holdings systematically, including resources from the CSO, the ESRI, and so on. It is also appropriate to investigate the links out that the website gives:

https://www.understandingsociety.ac.uk/

SO5032: Lab Materials

Table of Contents

1. Week 8 Lab

1.1. Logging the dependent variable

1.2. Data Archives

1.2.1. Key links