# SO5032: Lab Materials

## 1 Week 6 Lab

### 1.1 Murder, mayhem and model search

Load this data set for US states, which details various crime and other statistics.

```use http://teaching.sociology.ul.ie/so5032/labs/agrestistates
```

Explore the data set, and search for a good model to predict the murder rate, using t-tests, Adj-R2, F as appropriate. Start by looking at bivariate regressions, then move on to fuller models.

Do you observe multicollinearity? That is, correlated explanatory variables that are strong individually but both weak when entered together?

#### 1.1.1 Residuals

When you have a model with which you are satisfied, generate residuals, and examine them: look at their distribution (`histogram`), their scatter plot with each of the explanatory variables, and examine cases with large residuals (positive or negative).

```reg mr m w h
predict r, res
scatter r m
scatter r w
scatter r h
```

#### 1.1.2 Cook's Distance

A measure of how much influence a case has on the model is "Cook's Distance". This is related to the residual, but is a better measure of how much the case affects the estimation. It is available from the same dialogue as the residuals. Generate it, and examine it in comparison with the other variables, and with the residuals.

```predict cd, cooksd
```

Having done that, remove the case with the highest Cook's Distance from the data and fit the regression again. Do the results differ? If so, how would you explain that? In research terms, would it make sense to remove this case, would it change the substantive inferences you could make?

Note: the `dfbeta` command creates a new variable for each explanatory variable in the model, showing the effect each case has on the parameter estimates:

```dfbeta
```

The Cook's distance is an overall measure, while the DFBETAs show the effect variable by variable.

### 1.2 Logging the dependent variable

Linear regression requires the assumption that the residuals are normally distributed, and that their variance (spread) is not associated with any of the explanatory variables. One way this assumption is violated is if the dependent variable has a multiplicative (rather than an additive) relationship with the explanatory variables: random variations around high predictions will be bigger than around small predictions). This is the case with wage in the NLSW-88 data.

Work through this example as presented, making sure you understand what is happening at each step.

```sysuse nlsw88

reg wage ttl_exp
predict plin

// View residuals
predict rlin, res
scatter rlin ttl
```

The variation around the predicted wage is higher for people with higher experience (and therefore higher predicted wages) than for peope with lower experience. This is "heteroscedasticity", or violation of the assumption of homoscedasticity.

If the relationship is multiplicative (i.e., a unit increase in experience tends to increase wage by a percentage, rather than a fixed amount) we can cope with this by taking the log of the dependent variable:

```gen lw = log(wage)
reg lw ttl_exp

di _b[ttl_exp]
di exp(_b[ttl_exp])

predict rlog, res
scatter rlog ttl
```

We interpret the slope coefficient as the additive effect on the log of wage. Thus the antilog of the slope coeffiecient is a multiplicative effect on wage.

There is a small problem when it comes to predictions. We can get the predicted log wage directly, but if we just take the antilog it will tend to be too low. (Simply put, the result is the antilog of the average of the logged data, which is not the same as the average of the original data.)

```predict plog
gen eplog=exp(plog)

label var eplog "Predicted log wage: underestimate"
label var plin "Predicted wage (linear)"
su wage plin eplog
```

We see that while the average of the linear regression prediction is close to the data, the average of the anti-log of the log prediction is too low. We need to impose a small adjustment based on the spread of the data around the regression line, the Root Mean Square error.

```gen eplogadj = eplog*exp(e(rmse)^2/2)

label var eplogadj "Predicted log wage: corrected"

su wage plin eplog eplogadj

scatter wage eplogadj eplog plin ttl if wage<15, msize(0.7 0.1 0.1 0.1)
```

Created: 2018-02-26 Mon 09:52

Emacs 26.0.50 (Org mode 8.2.10)

Validate