PC Labs for SO5041: Week 1
Table of Contents
Week 1: Intro
The main software we will be using will be:
- Stata: for data handling, statistical analysis, graphics and presentation
- Excel spreadsheet: for smaller analyses, scratch work, alternative graphics, etc.
Today, we will take a brief look at Stata.
- Log on, and start Stata: type "Stata" in the search box and click on the App icon
- The Stata window has four panels: the big one is for output, and the wide one at the bottom is for entering commands
- Load a data set: Type
sysuse nlsw88
in the command panel and press return. This command loads an example file that comes with Stata (it's an extract from the 1988 US National Longitudinal Study of Women). - The bottom left window shows you what variables are in the dataset
- You can get more information about the variables by entering the command
describe
ordes
for short - You can look at the variables by entering the command
list
or by clicking on one of the data-editor icons in the toolbar - What does the data look like? You can summarise variables by entering the command
summarize
varname, for examplesu age
- If the variable has a small number of distinct elements, you can tabulate it: e.g.,
tab occupation
- Do this with a number of variables to get a feel for the data
- We can also make two-way tables: for example,
tab occ never_married
- Graphs are easy: try
histogram age
orgraph pie, over(occupation)
. Explore the graphics menu too.
Univariate analysis: one variable at a time
Univariate analysis involves looking at one variable at a time. It's less interesting than bivariate, but it is an important building block. It gives us numerical and graphical summaries of single variables.
Numerical summaries
We need to distinguish between types of variables when we are summarising them. A key distinction is whether a variable has categories (e.g., religion) or whether the numbers have implicit meaning (e.g., count variables like how many children are in a family, or measures like time spent watching screens). If a variable has categories (or small numbers of values), we can summarise it completely with a frequency table:
. sysuse nlsw88 . tab industry
Note this gives you the number in each category, the percentage and the cumulative percentage (the last makes sense only when the variable is ordered).
When a variable consists of numbers that are meaningful, summaries like the mean and spread are useful.
. summarize wage
This gives the number of cases, mean, standard deviation, and highest
and lowest values for the variable. These give a good deal of info
about the variable. What happens if we try to tabulate
this variable:
. tab wage
You get pages and pages of output because it has so many values. Press the Q key to escape. However, not all variable which are meaningful numbers are not suited to tabulation. Age for instance does not have many values in this data set:
. tab age
We can get the median of a numerically-meaningful variable too:
. centile age
Note that in the tab age
output, the cumulative
percentage goes above 50% for 39, which centile
determines is the median value.
Get the median for wage too, and compare it with the mean for wage:
. centile wage
Graphical summaries
The same division of variables matters for graphical summaries. There they are categorical (or have relatively few distinct values), we can use bar charts (where the height of the bar is proportional to the number in the category) or pie charts (where the angle of each wedge give the same info).
. graph bar, over(industry) . graph hbar, over(industry) . graph pie, over(industry)
Where the number is meaningful (like wage) we can make histograms and box plots
. histogram wage . graph box wage