SO5032: Lab Materials
Table of Contents
1. Week 7 Lab
1.1. Non-linearity
1.1.1. Sketch a quadratic function
Using pen and paper and/or Excel, plot the curve \(Y=20+0.75X-0.03X^2\).
1.1.2. Age example
Age often has non-linear effects: how to handle it?
use http://teaching.sociology.ul.ie/so5032/example1 recode age = . if age < 15 drop if income >6000 scatter income age, mfcol(blue%05) mlcolor(white%00) reg income age reg income age if age<=34 reg income age if age>34 gen ageg = 5*int(age/5) reg income i.ageg reg income age##age
1.1.3. Model a non-linear relationship
Run this do-file to load data on
do http://teaching.sociology.ul.ie/so5032/labs/birth
Fit models predicting birth rate using GNP as
- a linear effect
- a quadratic effect (GNP plus squared GNP)
- logged GNP and
- a grouped effect.
Consider the fit of the four models.
Plot the four predicted values as lines/curves on the same graph: how do they compare? Plot the residuals as well.
1.2. Murder, mayhem and model search
Load this data set for US states, which details various crime and other statistics.
use http://teaching.sociology.ul.ie/so5032/labs/agrestistates
Explore the data set, and search for a good model to predict the murder rate, using t-tests, Adj-R2, F as appropriate. Start by looking at bivariate regressions, then move on to fuller models.
Do you observe multicollinearity? That is, correlated explanatory variables that are strong individually but both weak when entered together?
1.2.1. Residuals
When you have a model with which you are satisfied, generate residuals,
and examine them: look at their distribution (histogram
), their
scatter plot with each of the explanatory variables, and examine cases
with large residuals (positive or negative).
reg mr m w h predict r, res scatter r m scatter r w scatter r h
1.2.2. Cook's Distance
A measure of how much influence a case has on the model is "Cook's Distance". This is related to the residual, but is a better measure of how much the case affects the estimation. It is available from the same dialogue as the residuals. Generate it, and examine it in comparison with the other variables, and with the residuals.
predict cd, cooksd
Having done that, remove the case with the highest Cook's Distance from the data and fit the regression again. Do the results differ? If so, how would you explain that? In research terms, would it make sense to remove this case, would it change the substantive inferences you could make?
Note: the dfbeta
command creates a new variable for each explanatory
variable in the model, showing the effect each case has on the parameter estimates:
dfbeta
The Cook's distance is an overall measure, while the DFBETAs show the effect variable by variable.