PC Labs for SO5041: Week 4

Week 4 Lab: Modifying Data
- Editing data
  - Creating new variables
  - Recoding variables
- Selecting and Excluding Cases
  - Missing values
  - Selecting cases: if and in

Week 4 Lab: Modifying Data

Editing data

We have seen how to enter data, and how to load data from existing files. It is also possible to change variables and to create new variables.

Creating new variables

We can create new variables in a dataframe or tibble.

Load the data file week4.dta (this is an extract from the 2016 European Social Survey for Ireland):

library(foreign)

week4 <- read.dta("https://teaching.sociology.ul.ie/so5041/week4.dta")

To create new variables in a dataframe, df, we can simply declare them:

df$newvar <- df$oldvar + 1

Make a new variable in the week4 data frame that is the difference between the respondent's work hours and his/her partner's work hours (wkhtot and wkhtotp).

Summarise the variable. Hint: use the commands summary() and hist().

Compare the distributions of the difference according to the respondent's gender (variable gndr). Use the aggregate() command described in Lab 2 (or see below). This is with own hours, wkhtot.

aggregate(wkhtot ~ gndr, week4, mean)

Recoding variables

We can also recode variables to give them more convenient values. For instance, if we have age in years (agea in the Week 4 data) we could recode it into groups as follows:

week4$agegroup <- cut(week4$agea, c(0,15,20,35,50,999))
table(week4$agegroup)

The cut() creates labels automatically, where (20,35] means greater than 20, up to and including 35.

Sometimes, categorical variables have several small categories, which we might like to lump together. Do table(week4$prtvtbie) to see an example of this using voting behaviour. We may want to group all the small parties together for simplicity. This variable is a factor, consisting of (more or less hidden) integer values with attached labels (see https://datacarpentry.github.io/r-socialsci/02-starting-with-data.html#factors). We can see the structure of a factor like this:

> levels(week4$prtvtbie)
 [1] "Anti-Austerity Alliance - People Before Profit"
 [2] "Fianna Fáil"                                   
 [3] "Fine Gael"                                     
 [4] "Green Party"                                   
 [5] "Independents"                                  
 [6] "Labour"                                        
 [7] "Sinn Féin"                                     
 [8] "Social Democrats"                              
 [9] "Socialist Party - United Left Alliance"        
[10] "Other"                                         
[11] "Not applicable"                                
[12] "Refusal"                                       
[13] "Don't know"                                    
[14] "No answer"

Categories 8, 9 and 10 are very small and could usefully be lumped together. Also the missing values could be condensed. For factors, we need to use the special NA_character_ missing value code:

week4$prtvtbie_recode <- recode(week4$prtvtbie,
                                "Anti-Austerity Alliance - People Before Profit" = "Other",
                               "Social Democrats" = "Other",
                               "Socialist Party - United Left Alliance" = "Other",
                               "Not applicable" = NA_character_,
                               "Refusal" = NA_character_,
                               "Don't know" = NA_character_,
                               "No answer" = NA_character_)
library(epiDisplay)
tab1(week4$prtvtbie_recode)

A better way to do the recode:

week4$prtvtbie_recode <- week4$prtvtbie
week4$prtvtbie_recode <- recode(week4$prtvtbie,
                                       "Social Democrats" = "Other",
                                       "Socialist Party - United Left Alliance" = "Other",
                                       "Not applicable" = "NA",
                                       "Refusal" = "NA",
                                       "Don't know" = "NA",
                                       "No answer" = "NA")

Note that if you simply recode week4$prtvtbie you will lose the original values, so it is often a good idea to recode to a new variable. The TRUE ~ week4$prtvtbie part of the recode means to copy all values not explicitly mentioned.

Selecting and Excluding Cases

Sometimes we need to exclude certain cases from consideration. One example is "missing values": where a variable has a value that is not useful (respondent refused, didn't know, made no sense) we can declare this as missing and it is not used in analysis. Other times we may simply want to exclude certain cases, for instance to look at the income distribution for women only, or to calculate the mean earnings for people who have earnings greater than zero (i.e. there are many people not working whose earnings are exactly zero: this is a meaningful value but we may often wish to ignore it).

Missing values

Missing values are usually coded as numbers which will not occur in reality. Thus for instance, if people refuse to give their age, their response may be coded 999 or -9: these are impossible values for age so there is no confusion. For income, 999 would not do because it could be a real value. When missing values are used like this it is very important that Stata knows they are not meaningful values, and thus it is necessary to declare them.

The week4raw file is a version of week4 without missing values being changed.

week4raw <- read.dta("https://teaching.sociology.ul.ie/so5041/week4raw.dta")

summary(week4raw$agea, na.rm=TRUE)
table(week4raw$agea)

tempvar <- ifelse(week4raw$agea == 999, NA, week4raw$agea)
mean(tempvar, na.rm=TRUE )

Selecting cases: `if` and `in`

R has many ways of selecting subsets of a dataframe.

Base R:

newdf <- week4[week4$gndr=="Female", ]
table(newdf$gndr)
summary(newdf$wkhtot)

Using the dplyr package (part of the tidyverse world):

library(dplyr)
newdf <- subset(week4, gndr=="Female")
table(newdf$gndr)
summary(newdf$wkhtot)

Exercise: this time using the week4raw dataframe, calculate the difference between own hours (wkhtot) and partner's hours (wkhtotp), taking account of missing values (hint: very high values are used for missing; use table(week4raw$wkhtot) to see them). Calculate the overall mean, then the mean for men and women separately.