Week 3 PC Lab

Re-cap: Univariate analysis

One-way descriptive summaries depend on the type of variable:

Two-way analyses

Now we will work with several ways of analysing two variables together: bivariate analysis.

We will look at three types of combination:

We will consider numerical and graphical techniques.

Categorical by categorical

Cross-tabulation is the easiest, and quite a powerful, method for looking at the relationship between categorical (nominal, ordinal) or grouped data.

Use the command

use http://teaching.sociology.ul.ie/so5041/labs/gssexamp

to load an extract from the 1991 U.S. General Social Survey.

Cross-tabulate race and region:

tab race region

Get either row or column percentages by adding the option row or col at the end of the command (options come after a comma). What patterns do you see in the table? Which percentages (row or column) are easier to interpret?


With the same variables, use graph hbar (count) x, over(race) over(region) asyvars to create a clustered bar chart (you will need to do gen x = 1 first). If you add the option stack you get a stacked rather than clustered bar chart. Experiment with both types, and with which variable to use as the cluster or stack variable. Do you see the same patterns as in the table? See what happens if you drop the asyvars option.

Interval/Ratio by Categorical

Sometimes we have a continuous variable that "varies with" a categorical one. Income may vary with gender, or with educational qualifications. bysort groupvar: su continvar allows us to get the mean value of the continuous variable (continvar) for each value of the categorical one (groupvar).

With the 1991 U.S. General Social Survey, look at how occupational prestige (prestg80) varies with variables such as region, race and sex (as the group variable).

Graphically, we can represent this with a barchart where the height of the bar represents the mean of the continuous variable for that value of the categorical one: graph hbar (mean) prestg80, over(region).

Box plots focus on medians and quartiles, and give a somewhat more detailed picture of the distribution than just the mean. Try graph box prestg80, over(region), etc.

Interval/Ratio by Interval/Ratio

We will find out about numerical methods for summarising the relationship between pairs or interval variables later, but for now the scatterplot is very useful. With the US General Social Survey, compare all three possible pairs of the following variables:

To do this, try scatter age prestg80 and so on.

What relationships do you see?

Brendan Halpin
Department of Sociology, University of Limerick
F1-002, x 3147; brendan.halpin@ul.ie