Labs for Unit B1: Correlation and Regression

1. Lab 2

1.1. Maths and Height

Load the data at


This is the maths/height example we have considered. Examine the correlations between the variables, numerically and graphically (corr varlist and scatter yvar xvar). Then regress maths on height:

reg maths height

Interpret the output, and relate it to the scatter plot.

Consider controlling for year. First, compare the maths/height scatterplot across year: scatter maths height, by(year). What does this tell you?

An alternative way to graph it is:

separate maths, by(year)
scatter maths1-maths6 height

Then fit the regression including year as well as height as explanatory variables: reg maths height year. Interpret the output.

1.2. Crime: spurious association

The Agresti & Finlay text book discusses county-level data on crime. You can load it into Stata as follows:


Graph crime rate against education, and against income. Note the relationship.

For example:

scatter c hs

Repeat the exercise distinguishing between counties according to level of urbanisation. E.g.

scatter c hs if u<=60 || scatter c hs if u>60

(Even better, make a variable with several levels of urbanisation, to get more detail in the graph.) Does this change your interpretation?

Approach the same issue using regression:

reg c hs

What is the effect of education on crime? Is this plausible? Now add the counties' level of urbanisation to the regression. What changes and what does it mean?

reg c hs u

1.3. Direct and indirect effects

Intergenerational transmission can often work through chains of causality. This data set contains information on the respondent's job prestige score, his/her father's job prestige score when he/she was growing up, and his/her level of education:


Look at the relationships between all pairs of variables, and then focus on the father's score/own score relationship.

1.4. Interaction effects

Interaction is where the effect of one variable depends on the level of another. For instance, in predicting income, both age and education may have direct effects, but the effect of education may be less for older people (due to the time elapsed since leaving education). That is, an additional year of education may boost income more for young people than for older people. We can accommodate this by adding an extra variable to the model, consisting of age multiplied by education. If this is significant, it tells us how the effect of education changes for each year of age.

1.4.1. Interaction between binary and continuous variables

A small data set drawn from the British Household Panel Survey allows us to look at an interaction between gender and hours in predicting income. Load the data and execute the following syntax:

gen female = osex==2
gen inter = female*ojbhrs
reg ofimn female
reg ofimn female ojbhrs
reg ofimn female ojbhrs inter

Draw the regression lines for the second and third models. Fit bi-variate models for men and women separately (i.e. only ojbhrs as the explanatory variable) and compare the findings from those models.

1.4.2. Interaction between two continuous variables

Use the NLSW88 data file, and create the log of wage as before. Fit a regression model that predicts the log of wage using ttl_exp and grade. We can create an interaction between ttl_exp and grade by multiplying them together: add this variable to the model.

Note that Stata has syntax that can fit interactions more conveniently:

sysuse nlsw88
gen lw = log(wage)
gen intvar = ttl_exp*grade
reg lw ttl_exp grade intvar

reg lw c.ttl_exp##c.grade

How do you interpret the model? Try by drawing the line relating ttl_exp to lw for grade at specific values (say 6, 10, 14). Do the same thing for grade's effect on lw setting ttl_exp at 6, 10, 14.

1.5. Inference when adding variables

1.5.1. F-tests

Stata's regression output presents the result of an F-test against the null model (top-right of output) but doesn't do incremental F-tests. A handy add-on for this can be installed using ssc install ftest. Using it means you need to fit a model, store its details, fit another and compare the two:

ssc install ftest
reg c u
estimates store urban
reg c u i hs
ftest urban

Interpret that result, and compare it with the result of testing reg c i hs and reg c i hs u.

1.5.2. Adjusted R2

F-tests can be used to globally test a model, and also do compare two models, one with extra variables. An approximate but quicker way to do this is to look at Adjusted R2, which is R2 scaled to take account of the number of cases and number of parameters, in a calculation similar to that for the F-statistic. Adjusted R2 can fall as variables are added to the model, unlike R2, if their contribution is insignificant.

Date: May 2019

Author: Brendan Halpin

Created: 2023-05-31 Wed 18:11