SO5032: Lab Materials

Table of Contents

1. Week 7 Lab

1.1. Non-linearity

1.1.1. Sketch a quadratic function

Using pen and paper and/or Excel, plot the curve \(Y=20+0.75X-0.03X^2\).

1.1.2. Age example

Age often has non-linear effects: how to handle it?


recode age = . if age < 15

drop if income >6000

scatter income age, mfcol(blue%05) mlcolor(white%00)

reg income age
reg income age if age<=34
reg income age if age>34

gen ageg = 5*int(age/5)

reg income i.ageg

reg income age##age

1.1.3. Model a non-linear relationship

Run this do-file to load data on


Fit models predicting birth rate using GNP as

  • a linear effect
  • a quadratic effect (GNP plus squared GNP)
  • logged GNP and
  • a grouped effect.

Consider the fit of the four models.

Plot the four predicted values as lines/curves on the same graph: how do they compare? Plot the residuals as well.

1.2. Murder, mayhem and model search

Load this data set for US states, which details various crime and other statistics.


Explore the data set, and search for a good model to predict the murder rate, using t-tests, Adj-R2, F as appropriate. Start by looking at bivariate regressions, then move on to fuller models.

Do you observe multicollinearity? That is, correlated explanatory variables that are strong individually but both weak when entered together?

1.2.1. Residuals

When you have a model with which you are satisfied, generate residuals, and examine them: look at their distribution (histogram), their scatter plot with each of the explanatory variables, and examine cases with large residuals (positive or negative).

reg mr m w h
predict r, res
scatter r m
scatter r w
scatter r h

1.2.2. Cook's Distance

A measure of how much influence a case has on the model is "Cook's Distance". This is related to the residual, but is a better measure of how much the case affects the estimation. It is available from the same dialogue as the residuals. Generate it, and examine it in comparison with the other variables, and with the residuals.

predict cd, cooksd

Having done that, remove the case with the highest Cook's Distance from the data and fit the regression again. Do the results differ? If so, how would you explain that? In research terms, would it make sense to remove this case, would it change the substantive inferences you could make?

Note: the dfbeta command creates a new variable for each explanatory variable in the model, showing the effect each case has on the parameter estimates:


The Cook's distance is an overall measure, while the DFBETAs show the effect variable by variable.

Author: Brendan Halpin

Created: 2023-03-07 Tue 11:38