# SO5032: Lab Materials

## Table of Contents

## 1. Week 7 Lab

### 1.1. Non-linearity

#### 1.1.1. Sketch a quadratic function

Using pen and paper and/or Excel, plot the curve \(Y=20+0.75X-0.03X^2\).

#### 1.1.2. Age example

Age often has non-linear effects: how to handle it?

use http://teaching.sociology.ul.ie/so5032/example1 recode age = . if age < 15 drop if income >6000 scatter income age, mfcol(blue%05) mlcolor(white%00) reg income age reg income age if age<=34 reg income age if age>34 gen ageg = 5*int(age/5) reg income i.ageg reg income age##age

#### 1.1.3. Model a non-linear relationship

Run this do-file to load data on

do http://teaching.sociology.ul.ie/so5032/labs/birth

Fit models predicting birth rate using GNP as

- a linear effect
- a quadratic effect (GNP plus squared GNP)
- logged GNP and
- a grouped effect.

Consider the fit of the four models.

Plot the four predicted values as lines/curves on the same graph: how do they compare? Plot the residuals as well.

### 1.2. Murder, mayhem and model search

Load this data set for US states, which details various crime and other statistics.

use http://teaching.sociology.ul.ie/so5032/labs/agrestistates

Explore the
data set, and search for a good model to predict the murder rate,
using t-tests, Adj-R^{2}, F as appropriate. Start by
looking at bivariate regressions, then move on to fuller models.

Do you observe **multicollinearity**? That is, correlated explanatory
variables that are strong individually but both weak when entered
together?

#### 1.2.1. Residuals

When you have a model with which you are satisfied, generate residuals,
and examine them: look at their distribution (`histogram`

), their
scatter plot with each of the explanatory variables, and examine cases
with large residuals (positive or negative).

reg mr m w h predict r, res scatter r m scatter r w scatter r h

#### 1.2.2. Cook's Distance

A measure of how much influence a case has on the model is "Cook's Distance". This is related to the residual, but is a better measure of how much the case affects the estimation. It is available from the same dialogue as the residuals. Generate it, and examine it in comparison with the other variables, and with the residuals.

predict cd, cooksd

Having done that, remove the case with the highest Cook's Distance from the data and fit the regression again. Do the results differ? If so, how would you explain that? In research terms, would it make sense to remove this case, would it change the substantive inferences you could make?

Note: the `dfbeta`

command creates a new variable for each explanatory
variable in the model, showing the effect each case has on the parameter estimates:

dfbeta

The Cook's distance is an overall measure, while the DFBETAs show the effect variable by variable.