PC Labs for SO5041: Week 8
Table of Contents
Week 8 Lab: The Chi2 test
Spreadsheets and the Chi2 test
As we have seen, spreadsheets are very useful for manipulating data and presenting numbers. Today we use a spreadsheet to manipulate tables, and to calculate expected values and the chi-squared statistic.
Exercise 1: entering a table
The following table is available here as a spreadsheet. Fill in the
missing row, column and grand totals, using the =sum()
function,
marking the relevant cells with the mouse. For example, type in the
first row, go to where you want the row total to be, enter =sum(
, mark
the row cells with the mouse, enter )
and press return.
Male | Female | Total | |
Employed | 388 | 380 | 768 |
Unemployed | 67 | 46 | 113 |
Looking for 1st job | 170 | 151 | 321 |
Student | 471 | 490 | 961 |
Other | 8 | 26 | 34 |
Total | 1104 | 1093 | 2197 |
Do not type in the numbers in bold! Use formulas instead.
Another tip: rather than entering the same formula repeatedly, you can copy it – for instance, if you copy the row 1 total formula down one, it now totals row 2. Formulas by default use relative references to other cells. For instance, a formula in B1 referring to A1 is really referring to "one cell to the left", so if we copy it to C23 the new formula refers to B23.
Exercise 2: Calculating percentages
Copy the entire table to a nearby location and delete the
numbers in the body. Calculate the row proportion (percentage) in each
cell by using a formula that divides the corresponding cell in the
original table by the row total. If you want to copy this formula
from the "male" column to the "female" one,
you need to "de-relativise" it a bit because the position of the
row total does not move. To do this just put a $
in
the reference to the row total: for instance, =C4/$E4
.
If you copy this one cell right it will become =D4/$E4
.
To get percentages, you can either multiply by 100, or set the
format (Format -> Cells -> Percentage
).
Exercise 3: Expected values
If there is no association between two variables, we would expect the row percentages in each row to follow more or less the same pattern as the percentages across the column totals (also true for column percentages). We can calculate the .expected values/ therefore, from the column and row totals. One way is for each cell to multiply the column total percentage by the row total number. Another way is to multiply the row total by the column total and divide by the grand total.
Copy the table again (for simplicity, perhaps key in the totals). For each cell replace the observed value with a formula for the expected value. Can you see big differences between the observed and expected values?
Calculate percentages based on the expected values, and verify that they are the same column by column, row by row.
Exercise 4: The Chi2 test
A standard test for association in tables is called the chi-squared test. This involves calculating for each cell the quantity (O-E)2/E where O means observed value and E means expected value. When added up for the entire table this quantity is called the chi-squared statistic: the larger the individual differences between the observed and expected values the larger this will be. This statistic has a known distribution. For instance, for a table this size, there is only a 5% chance that sampling variability would result in a chi-squared statistic of 9.488 or greater, when there was really no association in the population. Thus if the chi-squared statistic is less than 9.488 we consider the table as being consistent with there being no association in the population, but if it is greater than 9.488 we consider it evidence that there probably is association in the population.
Calculate the chi-squared statistic and assess whether there seems to be association or not.
Chi-squared test in Stata
The table above is drawn from the School Leavers' Survey, a subset of
which is available here (or do use http://teaching.sociology.ul.ie/so5041/labs/example6
). Download and load
into Stata; recode the empstat
variable to match the table
and re-create the crosstab, with the chi-squared statistic (hint: the
option chi
after a comma). Compare Stata's value to that you
calculated by hand.
T-table, Normal distribution and hypothesis testing
Using the tables, calculate the t-score for the 90, 95 and 99% levels for:
- A large sample (use normal distribution)
- A sample of 50 (use t-distribution)
- A sample of 25 (use t-distribution)
- A sample of 4 (use t-distribution)
Make a table of your results – what patterns do you see?
Suppose you have been given information from a sample: mean age is 45 years, standard deviation 14 and sample size 50. Make a 95% confidence interval around the mean age, first using the normal distribution. Then repeat the exercise using the t-distribution. How do the two confidence intervals differ and which should you use?
Suppose someone has claimed that the average age in this population is 40 years – what does your data say about this claim?