PC Labs for SO5041: Week 10

Week 10 Lab: Independent sample t-test, correlation

Week 10 Lab: Independent sample t-test, correlation

Hypothesis tests in R

Load the last lab's data into R as follows:

library(foreign)
data = read.csv("https://teaching.sociology.ul.ie/so5041/labs/hypotest.csv")

(This is an example of how to read a CSV file directly into R from the web.)

First, generate a new variable that is the difference between before and after:

data$diff <- data$After - data$Before

Then, use the t.test() function to compare this with zero:

t.test(data$diff)

Interpret the output. If you still have your calculations from last week, compare with your results.

Note that you can do a paired-sample t-test in one step with the following:

t.test(data$After, data$Before, paired=TRUE)

Hypothesis testing exam marks

Say that in last year's Leaving Cert English exam, the average mark achieved was 62.1%. A year later, the Dept of Education wants quick feedback on whether the standard has changed. A random sample of 100 scripts are assessed and marked. Their average mark is 65.2% with a standard deviation of 12.4%

Conduct a test of the hypothesis that the standard has changed, using a 95% level of confidence. Report and interpret your findings.

Independent Sample t-test in R

Load the following file, which contains information on gender and work hours for a UK sample:

library(foreign)
week11a <- read.dta("https://teaching.sociology.ul.ie/so5041/week11a.dta")
week11a$ojbhrs <- replace(week11a$ojbhrs, which(week11a$ojbhrs<0), NA)

See how missing values on the ojbhrs variable (-9 to -1) are recoded to NA, in the third line above.

Use summaries and graphs to get a sense of the male-female differences. E.g.,

summary(week11a[week11a$osex=="Male", ]$ojbhrs)

Then use an independent sample t-test (i.e., t.test(varname, by = groupvar), replacing varname and groupvar with the names of the appropriate variables).

Comparing distributions

You can visually compare the two workhours distributions with these commands.

First, side-by-side:

library(ggplot2)
ggplot(week11a, aes(x=ojbhrs, fill=osex)) +
    geom_histogram(position="identity", colour="grey40", alpha=0.2, bins = 10) +
    facet_grid(. ~ osex)

Second, overlaid:

ggplot(week11a, aes(x=ojbhrs, fill= osex)) + geom_histogram(color="black", alpha=0.33, position="identity") + ggtitle("Histogram of Hours Worked")

Different distribution shapes?

Different distribution standard deviations suggest we should use the unequal-variance version of the t-test (standard deviation is the square root of the variance). This is done as follows: t.test(varname, by = groupvar, var.equal=FALSE). However, with this large sample, you will notice that it makes very little difference. Testing for unequal variance is usually only relevant for small samples.

Testing proportions in R

Refer to the extract of the school-leavers' survey:

example6 <- read.dta("https://teaching.sociology.ul.ie/so5041/labs/example6.dta")

According to CSO population estimates, the proportion male in the relevant age group in the population was 51.6% when this survey was collected. Calculate the proportion male in the data, and construct a confidence interval around it. Is there any evidence that the sample was drawn from a population with a different proportion? In other words, is this sample consistent with (representative of) the contemporary population?

Do this by hand first, then get R to do the work. The prop.test() is analogous to the t.test() function.

table(example6$sex)
prop.test(1104,  (1104+1093), p=0.516)
prop.test(table(example6$sex), p=0.516

Your results should be very close but not identical, as prop.test() does not use the normal approximation.

Comparing proportions across groups

With the same data, calculate the proportion unemployed or looking for a first job, and compare it by sex:

example6$UE <- example6$empstat=="unemployed" | example6$empstat=="looking for 1st job"
prop.table(table(example6$UE, example6$sex), margin=2)
library(epiDisplay)
tabpct(example6$UE, example6$sex, percent="col")

There is apparently a difference in the proportion unemployed between men and women. Use R to test this, and interpret the result.

prop.test(table(example6$sex, 1-example6$UE))

(Note: prop.test() tests proportions of the first value of the outcome variable, so we need to give it 1-example6$UE instead of example6$UE which has TRUE as the second value.)

Then run a chi-sq on the table, and compare the inferences:

chisq.test(table(example6$sex, example6$UE))

That is, comparing a proportion over two groups is actually creating a 2-by-2 table, and the inferences from the prop.test() and the chisq.test() functions should be the same.