PC Labs for SO5041: Week 4
Table of Contents
Week 4 Lab: Modifying Data
Editing data
We have seen how to enter data, and how to load data from existing files. It is also possible to change variables and to create new variables.
Creating new variables
generate
varname = expression is used to create new variables. The
varname is the name of new variable that will be created. The
expression is any mathematical expression, but often you will want to
base it on existing variables. For instance:
generate three = 2 + 1 gen mark = markpct/100 gen total = part1 + part2 + part3
Lots of arithmetical, statistical, mathematical and other functions are
available. Enter help functions
to get an overview. This is a simple
example, that creates a new variable that is the square root of the
first:
gen sqrootx = sqrt(x)
Load the data file week4.dta
(this is an extract from the 2016 European Social Survey for Ireland):
use http://teaching.sociology.ul.ie/so5041/labs/week4
Make a new variable that is the difference between the respondent's work
hours and his/her partner's work hours (wkhtot
and wkhtotp
).
Summarise the variable. Hint: use the commands summarize
and hist
.
Compare the distributions of the difference according to the
respondent's gender (variable gndr
).
Recoding variables
We can also recode variables to give them more convenient
values. For instance, if we have age in years (agea
in the Week 4 data) we could recode it
into groups as follows:
gen agegr = agea recode agegr 0/15=1 16/20=2 20/35=3 36/50=4 51/999=5 label define agg 1 "0/15" 2 "16/20" 3 "20/35" 4 "36/50" 5 "Over 50" label values agegr agg
Note that the label commands are not necessary, but are very helpful. If you make a habit of putting them in, it makes your work self-documenting.
I am making a strong assumption that age is in whole years here. For
instance, if someone in the data set is recorded as 15.5 years, their
value won't be changed. If the groups overlap (e.g., 0/15=1 15/20=2
)
this won't be a problem, even though it appears ambiguous. Values are
assigned according to the first part that applies to them, so in this
case 15 year-olds go into category 1, not category 2 (if someone is
recorded as 15.5 they go into category 2).
Sometimes, categorical variables have several small categories, which we
might like to lump together. Do tab prtvtbie
to see an example of this
using voting behaviour. We may want to group all the small parties
together for simplicity. Do tab prtvtbie, nolabel
to see what the
numbers behind the labels are (another way to do this would be label
list prtvtbie
). Categories 8, 9 and 10 are very small and could
usefully be lumped together. Recode them all to the same value (10 is
good, because it already has the label "Other"). Tabulate the variable
again.
Note that if you simple recode prtvtbie
you will
lose the original values, so it is often a good idea to create a
new variable as a copy of the original, and to recode that. You can
then apply the old labels to the new variable
by label values
/newvar prtvtbie
(the name of the labels is often the
same as the name of the variable; if it is not, the command
describe
will show names of variables and names of labels).
Selecting and Excluding Cases
Sometimes we need to exclude certain cases from consideration. One example is "missing values": where a variable has a value that is not useful (respondent refused, didn't know, made no sense) we can declare this as missing and it is not used in analysis. Other times we may simply want to exclude certain cases, for instance to look at the income distribution for women only, or to calculate the mean earnings for people who have earnings greater than zero (i.e. there are many people not working whose earnings are exactly zero: this is a meaningful value but we may often wish to ignore it).
Missing values
Missing values are usually coded as numbers which will not occur in reality. Thus for instance, if people refuse to give their age, their response may be coded 999 or -9: these are impossible values for age so there is no confusion. For income, 999 would not do because it could be a real value. When missing values are used like this it is very important that Stata knows they are not meaningful values, and thus it is necessary to declare them.
Download the data file http://teaching.sociology.ul.ie/so5041/labs/week4raw.dta and load it into Stata:
clear use http://teaching.sociology.ul.ie/so5041/labs/week4raw
This is the same file as above but with missing values included. Numbers too big for the range are used to represent missing.
- Use
summarize
to look at the distribution ofagea
. Note the minimum and the mean. - Now declare the missing values:
recode agea 999=.
. Stata uses ".
" to indicate missing. - Re-run the summary and note the difference.
Because the 999s were treated as valid values the first time, the mean was much higher. When they are excluded, the mean falls and the number of cases falls.
Selecting cases: if
and in
Stata has two useful keywords for selecting cases,
if
and in
. These restrict the command to
a certain subset of cases (e.g., if sex==2
) or a
certain range (e.g., in 1/100
). For instance, compare
the output of list
to list in 1/10
.
If we want to restrict analysis to women only, we can use
if
:
clear use http://teaching.sociology.ul.ie/so5041/labs/week4 su wkhtot if gndr==2 tab prtvtbie if gndr==2
Note that we use ==
to test for equality. For
not-equals we use !=
, and note also >
,
<
, <=
etc. We can combine
conditions too:
tab prtvtbie if gndr==2 & wkhtot<60 tab prtvtbie if gndr==2 & inrange(wkhtot,30,50) tab prtvtbie gndr if prtvtbie==1 | prtvtbie==2
Using if
and summarize
, calculate the
mean own work hours for men and women separately.
Do-files
Stata commands can be kept in files, called do-files. It can be good practice to use do-files to organise your work. You can open a do-file from within Stata, and execute single commands or blocks of commands within it. It can be a useful strategy, when you have a particular task in hand, to build up the minimum set of commands necessary to go from your data file to the desired result. Note that you can copy commands from the Review window into the do-file editor.