SO5032: Lab Materials
Table of Contents
1. Week 4 Lab
1.1. Mental health
This file contains code that relates a mental impairment score to SES (socioeconomic status) and a negative life-events score. Run it as follows:
source("https://teaching.sociology.ul.ie/so5032/mental.R")
Fit the regression model predicting impairment from the other two variables. Interpret the model.
1.2. Predicted values
Taking the regression results, calculate the predicted values by hand (calculator!) for the first few cases (i.e. use their values on the independent variables). Then, after running the regression, do mental$yhat <- predict(mod1) to get R to generate predicted values as a new column in the data frame (assuming the regression is saved as mod1). Were your calculations correct?
- Do a scatter plot of the predicted values versus the observed values
- Are the predicted values close to the real ones?
- Calculate the correlation between the predicted and observed values – relate it to the R2 from the regression
1.3. Hypothesis tests
With the model containing the two explanatory variables, carry out hypothesis tests on the conditional effects of the two variables. Can you reject the null hypothesis in either case?
1.4. Adjusted R2
F-tests can be used to globally test a model, and also do compare two models, one with extra variables. An approximate but quicker way to do this is to look at Adjusted R2, which is R2 scaled to take account of the number of cases and number of parameters, in a calculation similar to that for the F-statistic. Adjusted R2 can fall as variables are added to the model, unlike R2, if their contribution is insignificant.
1.5. F-tests
R's regression output presents the result of an F-test against the null model (bottom of summary(lm()) output). This is testing the null hypothesis that the true value of all parameters is zero.
To compare two "nested" models (one has all the variables of the other, plus some extra ones), fit and save both models and compare the two as follows:
library(foreign)
counties <- read.dta("https://teaching.sociology.ul.ie/so5032/labs/agresticounties.dta")
cmod1 <- lm(data=counties, c ~ u)
cmod2 <- lm(data=counties, c ~ u + i + hs)
anova(cmod1, cmod2)
Interpret that result.
1.6. Note: Dummy variables
If you have a categorical explanatory variable, you can enter it as a set of n-1 "dummy" variables, where n is the number of values. A dummy variable is a variable taking the values 0 and 1, indicating that the original variable takes the appropriate value:
| Original | d1 | d2 | d3 |
|---|---|---|---|
| 1 | 1 | 0 | 0 |
| 2 | 0 | 1 | 0 |
| 3 | 0 | 0 | 1 |
| 4 | 0 | 0 | 0 |
In this example, the original value takes the values 1 to 4. There are three dummy variables, d1 to d3, taking the values 0 and 1, each corresponding to one value of the original variable. For value 4 of the original variable, all three dummy variables have the value 0. Once the dummy variables are entered in a regression analysis, the interpretation of their parameter estimates is the effect on the dependent variable of being in this category compared with category 4.
There is no need to create dummy variables yourself in R. As long as the variable is a factor, R will do it automatically.
1.7. Predicting wage
Using the NSLW-88 data, use a regression model with continuous and categorical explanatory variables to predict wage. Look at age, grade, work experience, job tenure, occupation, industry, union membership, and any other variable you consider relevant. Use dummy coding as necessary, t-tests, delta-F tests, and tracking adjusted R-squared to find an overall model that makes sense to you. Interpret the model.
nlsw88 <- read.dta("https://teaching.sociology.ul.ie/so5041/nlsw88.dta")
## Drop cases with missing values
nlsw88 <- subset(nlsw88, !is.na(grade) &
!is.na(occupation) &
!is.na(hours) &
!is.na(industry) &
!is.na(union) &
!is.na(tenure))