SO5032: Lab Materials
Table of Contents
1. Week 8 Lab
1.1. Logging the dependent variable
Linear regression requires the assumption that the residuals are normally distributed, and that their variance (spread) is not associated with any of the explanatory variables. One way this assumption is violated is if the dependent variable has a multiplicative (rather than an additive) relationship with the explanatory variables: random variations around high predictions will be bigger than around small predictions). This is the case with wage in the NLSW-88 data.
Work through this example as presented, making sure you understand what is happening at each step.
sysuse nlsw88 reg wage ttl_exp predict plin // View residuals predict rlin, res scatter rlin ttl hist rlin, norm
The variation around the predicted wage is higher for people with higher experience (and therefore higher predicted wages) than for people with lower experience. This is "heteroscedasticity", or violation of the assumption of homoscedasticity.
If the relationship is multiplicative (i.e., a unit increase in experience tends to increase wage by a percentage, rather than a fixed amount) we can cope with this by taking the log of the dependent variable:
gen lw = log(wage) reg lw ttl_exp di _b[ttl_exp] di exp(_b[ttl_exp]) predict rlog, res scatter rlog ttl
We interpret the slope coefficient as the additive effect on the log of wage. Thus the antilog of the slope coeffiecient is a multiplicative effect on wage.
There is a small problem when it comes to predictions. We can get the predicted log wage directly, but if we just take the antilog it will tend to be too low. (Simply put, the result is the antilog of the average of the logged data, which is not the same as the average of the original data.)
predict plog gen eplog=exp(plog) label var eplog "Predicted log wage: underestimate" label var plin "Predicted wage (linear)" su wage plin eplog
We see that while the average of the linear regression prediction is close to the data, the average of the anti-log of the log prediction is too low. We need to impose a small adjustment based on the spread of the data around the regression line, the Root Mean Square error.
gen eplogadj = eplog*exp(e(rmse)^2/2) label var eplogadj "Predicted log wage: corrected" su wage plin eplog eplogadj scatter wage eplogadj eplog plin ttl if wage<15, msize(0.7 0.1 0.1 0.1)
1.2. Data Archives
Today's second business is simply to examine what is available through the Irish Social Science Data Archive website. You are encouraged to browse, particularly through the documentation held there, and follow your own interests. However, it would be good to explore the main holdings systematically, including resources from the CSO, the ESRI, and so on. It is also appropriate to investigate the links out that the website gives:
1.2.1. Key links
- ISSDA: Irish Social Science Data Archive
- CESSDA: Consortium of European Social Science Data Archives
- The CSO
- UK Data Archive
- ZACAT
- Eurobarometer at GESIS (post 2006)
- The ICPSR
- The European Social Survey
- IPUMS
- Generations and Gender Programme