The main software we will be using will be:
Today, we will take a brief look at Stata.
S -> Specialist Software -> Stataor type "Stata" in the
sysuse nlsw88in the command panel and press return. This command loads an example file that comes with Stata (it's an extract from the 1988 US National Longitudinal Study of Women).
listor by clicking on one of the data-editor icons in the toolbar
summarize varname, for example
tab occ never_married
graph pie, over(occupation). Explore the graphics menu too.
Univariate analysis involves looking at one variable at a time. It's less interesting than bivariate, but it is an important building block. It gives us numerical and graphical summaries of single variables.
We need to distinguish between types of variables when we are summarising them. A key distinction is whether a variable has categories (e.g., religion) or whether the numbers have implicit meaning (e.g., count variables like how many children are in a family, or measures like time spent watching screens). If a variable has categories (or small numbers of values), we can summarise it completely with a frequency table:
. sysuse nlsw88 . tab industry
Note this gives you the number in each category, the percentage and the cumulative percentage (the last makes sense only when the variable is ordered).
When a variable consists of numbers that are meaningful, summaries like the mean and spread are useful.
. summarize wage
This gives the number of cases, mean, standard deviation, and highest
and lowest values for the variable. These give a good deal of info
about the variable. What happens if we try to
. tab wage
You get pages and pages of output because it has so many values. Press the Q key to escape. However, not all variable which are meaningful numbers are not suited to tabulation. Age for instance does not have many values in this data set:
. tab age
We can get the median of a numerically-meaningful variable too:
. centile age
Note that in the
tab age output, the cumulative
percentage goes above 50% for 39, which
determines is the median value.
Get the median for wage too, and compare it with the mean for wage:
. centile wage
The same division of variables matters for graphical summaries. There they are categorical (or have relatively few distinct values), we can use bar charts (where the height of the bar is proportional to the number in the category) or pie charts (where the angle of each wedge give the same info).
. gen n = 1 . graph bar (count) n, over(industry) . graph hbar (count) n, over(industry) . graph pie, over(industry)
Where the number is meaningful (like wage) we can make histograms and box plots
. histogram wage . graph box wage