PC Labs for SO5041: Week 1

Table of Contents

Week 1: Intro

The main software we will be using will be:

  • Stata: for data handling, statistical analysis, graphics and presentation
  • Excel spreadsheet: for smaller analyses, scratch work, alternative graphics, etc.

Today, we will take a brief look at Stata.

  • Log on, and start Stata: type "Stata" in the search box and click on the App icon
  • The Stata window has four panels: the big one is for output, and the wide one at the bottom is for entering commands
  • Load a data set: Type sysuse nlsw88 in the command panel and press return. This command loads an example file that comes with Stata (it's an extract from the 1988 US National Longitudinal Study of Women).
  • The bottom left window shows you what variables are in the dataset
  • You can get more information about the variables by entering the command describe or des for short
  • You can look at the variables by entering the command list or by clicking on one of the data-editor icons in the toolbar
  • What does the data look like? You can summarise variables by entering the command summarize varname, for example su age
  • If the variable has a small number of distinct elements, you can tabulate it: e.g., tab occupation
  • Do this with a number of variables to get a feel for the data
  • We can also make two-way tables: for example, tab occ never_married
  • Graphs are easy: try histogram age or graph pie, over(occupation). Explore the graphics menu too.

Univariate analysis: one variable at a time

Univariate analysis involves looking at one variable at a time. It's less interesting than bivariate, but it is an important building block. It gives us numerical and graphical summaries of single variables.

Numerical summaries

We need to distinguish between types of variables when we are summarising them. A key distinction is whether a variable has categories (e.g., religion) or whether the numbers have implicit meaning (e.g., count variables like how many children are in a family, or measures like time spent watching screens). If a variable has categories (or small numbers of values), we can summarise it completely with a frequency table:

. sysuse nlsw88
. tab industry

Note this gives you the number in each category, the percentage and the cumulative percentage (the last makes sense only when the variable is ordered).

When a variable consists of numbers that are meaningful, summaries like the mean and spread are useful.

. summarize wage

This gives the number of cases, mean, standard deviation, and highest and lowest values for the variable. These give a good deal of info about the variable. What happens if we try to tabulate this variable:

. tab wage

You get pages and pages of output because it has so many values. Press the Q key to escape. However, not all variable which are meaningful numbers are not suited to tabulation. Age for instance does not have many values in this data set:

. tab age

We can get the median of a numerically-meaningful variable too:

. centile age

Note that in the tab age output, the cumulative percentage goes above 50% for 39, which centile determines is the median value.

Get the median for wage too, and compare it with the mean for wage:

. centile wage

Graphical summaries

The same division of variables matters for graphical summaries. There they are categorical (or have relatively few distinct values), we can use bar charts (where the height of the bar is proportional to the number in the category) or pie charts (where the angle of each wedge give the same info).

. graph bar, over(industry)
. graph hbar, over(industry)
. graph pie, over(industry)

Where the number is meaningful (like wage) we can make histograms and box plots

. histogram wage
. graph box wage