# SO5032: Lab Materials

## Table of Contents

## 1 Week 6 Lab

### 1.1 Murder, mayhem and model search

Load this data set for US states, which details various crime and other statistics.

use http://teaching.sociology.ul.ie/so5032/labs/agrestistates

Explore the
data set, and search for a good model to predict the murder rate,
using t-tests, Adj-R^{2}, F as appropriate. Start by
looking at bivariate regressions, then move on to fuller models.

Do you observe **multicollinearity**? That is, correlated explanatory
variables that are strong individually but both weak when entered
together?

#### 1.1.1 Residuals

When you have a model with which you are satisfied, generate residuals,
and examine them: look at their distribution (`histogram`

), their
scatter plot with each of the explanatory variables, and examine cases
with large residuals (positive or negative).

reg mr m w h predict r, res scatter r m scatter r w scatter r h

#### 1.1.2 Cook's Distance

A measure of how much influence a case has on the model is "Cook's Distance". This is related to the residual, but is a better measure of how much the case affects the estimation. It is available from the same dialogue as the residuals. Generate it, and examine it in comparison with the other variables, and with the residuals.

predict cd, cooksd

Having done that, remove the case with the highest Cook's Distance from the data and fit the regression again. Do the results differ? If so, how would you explain that? In research terms, would it make sense to remove this case, would it change the substantive inferences you could make?

Note: the `dfbeta`

command creates a new variable for each explanatory
variable in the model, showing the effect each case has on the parameter estimates:

dfbeta

The Cook's distance is an overall measure, while the DFBETAs show the effect variable by variable.

### 1.2 Logging the dependent variable

Linear regression requires the assumption that the residuals are normally distributed, and that their variance (spread) is not associated with any of the explanatory variables. One way this assumption is violated is if the dependent variable has a multiplicative (rather than an additive) relationship with the explanatory variables: random variations around high predictions will be bigger than around small predictions). This is the case with wage in the NLSW-88 data.

Work through this example as presented, making sure you understand what is happening at each step.

sysuse nlsw88 reg wage ttl_exp predict plin // View residuals predict rlin, res scatter rlin ttl

The variation around the predicted wage is higher for people with higher experience (and therefore higher predicted wages) than for peope with lower experience. This is "heteroscedasticity", or violation of the assumption of homoscedasticity.

If the relationship is multiplicative (i.e., a unit increase in experience tends to increase wage by a percentage, rather than a fixed amount) we can cope with this by taking the log of the dependent variable:

gen lw = log(wage) reg lw ttl_exp di _b[ttl_exp] di exp(_b[ttl_exp]) predict rlog, res scatter rlog ttl

We interpret the slope coefficient as the additive effect on the log of wage. Thus the antilog of the slope coeffiecient is a multiplicative effect on wage.

There is a small problem when it comes to predictions. We can get the predicted log wage directly, but if we just take the antilog it will tend to be too low. (Simply put, the result is the antilog of the average of the logged data, which is not the same as the average of the original data.)

predict plog gen eplog=exp(plog) label var eplog "Predicted log wage: underestimate" label var plin "Predicted wage (linear)" su wage plin eplog

We see that while the average of the linear regression prediction is close to the data, the average of the anti-log of the log prediction is too low. We need to impose a small adjustment based on the spread of the data around the regression line, the Root Mean Square error.

gen eplogadj = eplog*exp(e(rmse)^2/2) label var eplogadj "Predicted log wage: corrected" su wage plin eplog eplogadj scatter wage eplogadj eplog plin ttl if wage<15, msize(0.7 0.1 0.1 0.1)