We have seen how to enter data, and how to load data from existing files. It is also possible to change variables and to create new variables.

`generate `

is
used to create new variables. The *varname* = *expression*

is
the name of new variable that will be created. The
*varname*

is any mathematical expression,
but often you will want to base it on existing variables. For
instance:*expression*

generate three = 2 + 1 gen mark = markpct/100 gen total = part1 + part2 + part3

Lots of arithmetical, statistical, mathematical and other
functions are available. Enter `help functions`

to get
an overview. This is a simple example, that creates a new variable
that is the square root of the first:

gen sqrootx = sqrt(x)

Using the data file week4.dta
(e.g. ```
use
http://teaching.sociology.ul.ie/so5041/labs/week4
```

)
, make a new variable that
is the difference between the respondent's work hours and his/her
partner's work hours (`wkhtot`

and `wkhtotp`

).
Next, create a variable that represents the ratio. Summarise these variables and plot
their relationship. (Hint: use the commands
`summarize`

and `scatter`

.)

We can also recode variables to give them more convenient values. For instance, if we have age in years we could recode it into groups as follows:

gen agegr = agea recode agegr 0/15=1 16/20=2 20/35=3 36/50=4 51/999=5 label define agg 1 "0/15" 2 "16/20" 3 "20/35" 4 "36/50" 5 "51/999" label values agegr agg

(Note that the label commands are not necessary, but are very helpful.)

I am making a strong assumption that age is in whole years here.
For instance, if someone in the data set is recorded as 16.5 years,
their value won't be changed. If the groups overlap (e.g.,
`0/15=1 15/20=2`

) this won't be a problem, even though
it appears ambiguous. Values are assigned according to the first
part that applies to them, so in this case 15 year-olds go into
category 1, not category 2 (if someone is recorded as 15.1 they go
into category 2).

Sometimes, categorical variables have several small
categories, which we might like to lump together. Do ```
tab
prtvtaie
```

to see an example of this using voting behaviour. We may
want to group all the small parties together for simplicity.
Do `tab prtvtaie, nolabel`

to see what the numbers behind the labels are (another way to do
this would be `label list prtvtaie`

). Categories 6, 8 and
10 are very small and could usefully be lumped together. Recode them
all to the same value (10 is good, because it already has the label
"Other"). Tabulate the variable again.

Note that if you simple recode `prtvtaie`

you will
lose the original values, so it is often a good idea to create a
new variable as a copy of the original, and to recode that. You can
then apply the old labels to the new variable by ```
label values
```

(the name of the labels is often the
same as the name of the variable; if it is not, the command
*newvar* prtvtaie`describe`

will show names of variables and names of labels).

Sometimes we need to exclude certain cases from consideration. One example is "missing values": where a variable has a value that is not useful (respondent refused, didn't know, made no sense) we can declare this as missing and it is not used in analysis. Other times we may simply want to exclude certain cases, for instance to look at the income distribution for women only, or to calculate the mean earnings for people who have earnings greater than zero (i.e. there are many people not working whose earnings are exactly zero: this is a meaningful value but we may often wish to ignore it).

Missing values are usually coded as numbers which will not occur in reality. Thus for instance, if people refuse to give their age, their response may be coded 999 or -9: these are impossible values for age so there is no confusion. For income, 999 would not do because it could be a real value. When missing values are used like this it is very important that Stata knows they are not meaningful values, and thus it is necessary to declare them.

Download the data file example6.dta
and load it into Stata (e.g., ```
use
http://teaching.sociology.ul.ie/so5041/labs/example6
```

). This
is the same file as above but with missing values included. The number
-9 represents missing.

- Use
`summarize`

to look at the distribution of`grsearn`

. Note the minimum and the mean. - Now declare the missing values:
`recode grsearn -9=.`

. Stata uses "`.`

" to indicate missing. - Re-run the summary and note the difference.

Because the -9s were treated as valid values the first time, the mean was much lower. When they are excluded, the mean rises and the number of cases falls.

`if`

and `in`

Stata has two useful keywords for selecting cases,
`if`

and `in`

. These restrict the command to
a certain subset of cases (e.g., `if sex==2`

) or a
certain range (e.g., `in 1/100`

). For instance, compare
the output of `list`

to `list in 1/10`

.

If we want to restrict analysis to women only, we can use
`if`

:

clear use http://teaching.sociology.ul.ie/so5041/labs/week4 su wkhtot if gndr==2 tab prtvtaie if gndr==2

Note that we use `==`

to test for equality. For
not-equals we use `!=`

, and note also `>`

,
`<`

, `<=`

etc. We can combine
conditions too:

tab prtvtaie if gndr==2 & wkhtot<60 tab prtvtaie if gndr==2 & inrange(wkhtot,30,50) tab prtvtaie gndr if prtvtaie==1 | prtvtaie==2

Using `if`

and `summarize`

, calculate the
mean own work hours for men and women separately.

Stata commands can be kept in files, called do-files. It can be good practice to use do-files to organise your work. You can open a do-file from within Stata, and execute single commands or blocks of commands within it. It can be a useful strategy, when you have a particular task in hand, to build up the minimum set of commands necessary to go from your data file to the desired result. Note that you can copy commands from the Review window into the do-file editor.

Brendan Halpin

Department of Sociology, University of Limerick

F1-002, x 3147; brendan.halpin@ul.ie