# PC Labs for SO5041: Week 2

## Table of Contents

## 1. Week 2: Univariate and bivariate analysis

One-way descriptive summaries depend on the type of variable:

- Categorical
- Nominal variables: numbers relate to categories which are "just different". All we can do is enumerate the different types:
- Frequency table (ignore cumulative percents):

sysuse nlsw88 tab occupation

- Pie Chart:
`graph pie, over(occupation)`

- Bar Chart:
`graph bar, over(occupation)`

- Ordinal variable: the categories are different, but have an order. We can use all of the above summaries, plus:
- Median and related measures:

centile age centile age, centile(25 50 75)

- Scale or measurement variable: Here the number is inherently meaningful, as a count or a measurement. We can use all of the above summaries (if necessary by grouping the variable into bands), plus summaries that take advantage of the scale property:
- Mean and standard deviation:
`su age`

- Histograms:
`histogram age`

- Box plots:
`graph box age`

- Mean and standard deviation:
- Scale variables differ on whether they are "interval" or "ratio". Numbers where zero really means zero are ratio, and we can take ratios, e.g., say that 30 is 50% more than 20. Most real-world examples of measurement will be ratio variables. Some measurements scales have arbitrary zeros – temperature in centigrade or farenheit is an example, or opinion scales – and there ratios don't make sense.

### 1.1. Two-way analyses

Now we will work with several ways of analysing two variables together: **bivariate** analysis.

We will look at three types of combination:

- categorical by categorical
- categorical by continuous (interval/ratio)
- continuous by continuous.

We will consider numerical and graphical techniques.

#### 1.1.1. Categorical by categorical

Cross-tabulation is the easiest, and quite a powerful, method for looking at the relationship between categorical (nominal, ordinal) or grouped data.

Cross-tabulate `occupation`

and `collgrad`

:

tab occupation collgrad

Get either row or column percentages by adding the option `row`

or `col`

at the end of the command (options come after a comma). What patterns do you see in the table? Which percentages (row or column) are easier to interpret?

- Graphing

With the same variables, use

`graph hbar, over(collgrad) over(occupation) asyvars`

to create a clustered bar chart. If you add the option`stack`

you get a stacked rather than clustered bar chart. Experiment with both types, and with which variable to use as the cluster or stack variable. Do you see the same patterns as in the table? See what happens if you drop the`asyvars`

option. See also what happens with`graph hbar, over(occupation) over(collgrad) asyvars`

.

#### 1.1.2. Interval/Ratio by Categorical

Sometimes we have a continuous variable that "varies with" a categorical
one. Income may vary with gender, or with educational qualifications.
`bysort`

*groupvar*: `su`

*continvar* allows us to get the mean value of
the continuous variable (*continvar*) for each value of the categorical
one (*groupvar*).

With the NLSW88 data, see how wage varies with variables such as `collgrad`

or `occupation`

.

Graphically, we can represent this with a barchart where the height of the bar represents the mean of the continuous variable for that value of the categorical one: `graph hbar (mean) wage, over(occupatoin)`

.

Box plots focus on medians and quartiles, and give a somewhat more detailed picture of the distribution than just the mean. Try `graph box wage, over(occupation)`

, etc.

#### 1.1.3. Interval/Ratio by Interval/Ratio

We will find out about numerical methods for summarising the relationship between pairs or interval variables later, but for now the scatterplot is very useful. With the NLSW88, compare all three possible pairs of the following variables:

- Years of education (
`grade`

) - Wage rate (
`wage`

) - Work experience (
`ttl_exp`

)

What relationships do you see?

To do this, try `scatter wage ttl_exp`

and so on.