Labs for Unit B1: Correlation and Regression
1. Lab 1
1.1. Correlation
1.1.1. Practice app
See http://teaching.sociology.ul.ie:3838/apps/corrgame
Try to guess the correlation. Keep trying until you get good at it!
1.2. Linear Regression
1.2.1. Florida Crime
The following code will run a regression with county-level crime rate as the outcome, and county-level median income as the explanatory variable (the data are from Florida).
clear use http://teaching.sociology.ul.ie/ssrm/unitb1/floridacrime scatter crime income reg crime income
Run the code and examine the output.
- Write out the \(Y = a + bX\) equation
- Report R2
- Test the hypothesis that income is associated with crime
- Predict crime for income = 20 and income = 30
- Draw the regression line (on paper)
- Calculate the predicted value and residual (error term) for case 5
and relate them to the observed value (do
list in 5
to see the values).
You can verify your predicted values and the line, by getting Stata to do the work:
predict ypred sort income line ypred income || scatter crime income
1.2.2. Practice app
See the practice app: http://teaching.sociology.ul.ie:3838/bivar
1.2.3. NLSW:
Execute the following commands to load the National Longitudinal
Study of Women data set that comes with Stata, and look at the distribution of the wage
variable:
clear sysuse nlsw88 su wage
Considering the following list of variables as potential predictors of the wage
variable.
age
ttl_exp
, total lifetime work experiencegrade
, years of educationunion
, whether a member of a union
Let's consider wage as the "dependent variable", to be
explained by the others (ignoring union
for the moment as it only has
two values). Create scatterplots for wage (on the Y-axis)
compared with each of the other variables. Consider the correlations too
(e.g., corr ttl_exp lw
). Can you see much of a
relationship?
Now do regression analyses:
reg wage
varname
replacing varname with each of the other variables one at a time as the
independent. There are two key things to look at: the
R2 figure and the parameter estimate (Coef.
for the
independent variable, along with its significance). Which variables
affect wage much? Do any not affect it at all?
Interpret the results: in each case ask the question, "what happens to the predicted value of income, if the value of X were to change by one unit?". For two different values of the independent variable (X) calculate the predicted value of income – see where these fall on the scatterplot, and see where the regression line would lie. Does it seem like a good summary of the relationship?
If R2 is big, the independent variable "explains" the dependent variable "a lot". However, it is possible for R2 to be small and yet for the independent variable to a systematic effect (i.e. very low p-value for significance): this independent variable may be only one thing among many that affects the dependent variable.
1.3. Union effects
Test the effect of union
on wage. Note that union
in a
binary variable (i.e., yes/no represented as 1/0). Use a t-test in
the first instance, and then fit a regression. Compare the results.
Do the same relating grade
to union
. Note
that unionised workers tend to earn more and be better educated. Could
it be that the union effect is simply due to them being better educated?
That is, for workers with similar education does union status matter?
Fit the wage/grade regression for unionised and non-unionised workers separately, and think about the results (make scatterplots too). Do:
reg wage grade if union==0 predict p0 reg wage grade if union==1 predict p1
Having saved the predicted values, we can plot the two regressions simultaneously:
sort grade line p0 p1 grade || scatter wage grade
The ||
syntax allows us to combine plots. The following example exploits this a little more to make a plot that distinguishes by union even better:
line p0 p1 grade || scatter wage grade if union==0 || scatter wage grade if union==1
1.4. Two explanatory variables
You can also fit a model with both union status and grade explaining wage. Fit a regression with both grade
and union
as explanatory variables. Interpret the parameter estimates.
Compare your results to the previous separate regressions, and the t-test.