PC Labs for SO5041: Week 7

Table of Contents

Week 7 Lab: The Chi2 test

Spreadsheets and the Chi2 test

As we have seen, spreadsheets are very useful for manipulating data and presenting numbers. Today we use a spreadsheet to manipulate tables, and to calculate expected values and the chi-squared statistic.

Exercise 1: entering a table

The following table is available here as a spreadsheet. Fill in the missing row, column and grand totals, using the =sum() function, marking the relevant cells with the mouse. For example, type in the first row, go to where you want the row total to be, enter =sum(, mark the row cells with the mouse, enter ) and press return.

  Male Female Total
Employed 388 380 768
Unemployed 67 46 113
Looking for 1st job 170 151 321
Student 471 490 961
Other 8 26 34
Total 1104 1093 2197

Do not type in the numbers in bold! Use formulas instead.

Another tip: rather than entering the same formula repeatedly, you can copy it – for instance, if you copy the row 1 total formula down one, it now totals row 2. Formulas by default use relative references to other cells. For instance, a formula in B1 referring to A1 is really referring to "one cell to the left", so if we copy it to C23 the new formula refers to B23.

Exercise 2: Calculating percentages

Copy the entire table to a nearby location and delete the numbers in the body. Calculate the row proportion (percentage) in each cell by using a formula that divides the corresponding cell in the original table by the row total. If you want to copy this formula from the "male" column to the "female" one, you need to "de-relativise" it a bit because the position of the row total does not move. To do this just put a $ in the reference to the row total: for instance, =C4/$E4. If you copy this one cell right it will become =D4/$E4.

To get percentages, you can either multiply by 100, or set the format (Format -> Cells -> Percentage).

Exercise 3: Expected values

If there is no association between two variables, we would expect the row percentages in each row to follow more or less the same pattern as the percentages across the column totals (also true for column percentages). We can calculate the .expected values/ therefore, from the column and row totals. One way is for each cell to multiply the column total percentage by the row total number. Another way is to multiply the row total by the column total and divide by the grand total.

Copy the table again (for simplicity, perhaps key in the totals). For each cell replace the observed value with a formula for the expected value. Can you see big differences between the observed and expected values?

Calculate percentages based on the expected values, and verify that they are the same column by column, row by row.

Exercise 4: The Chi2 test

A standard test for association in tables is called the chi-squared test. This involves calculating for each cell the quantity (O-E)2/E where O means observed value and E means expected value. When added up for the entire table this quantity is called the chi-squared statistic: the larger the individual differences between the observed and expected values the larger this will be. This statistic has a known distribution. For instance, for a table this size, there is only a 5% chance that sampling variability would result in a chi-squared statistic of 9.488 or greater, when there was really no association in the population. Thus if the chi-squared statistic is less than 9.488 we consider the table as being consistent with there being no association in the population, but if it is greater than 9.488 we consider it evidence that there probably is association in the population.

Calculate the chi-squared statistic and assess whether there seems to be association or not.

Chi-squared test in Stata

The table above is drawn from the School Leavers' Survey, a subset of which is available here (or do use https://teaching.sociology.ul.ie/so5041/labs/example6). Download and load into Stata; recode the empstat variable to match the table and re-create the crosstab, with the chi-squared statistic (hint: the option chi after a comma). Compare Stata's value to that you calculated by hand.

T-table, Normal distribution and hypothesis testing

Using the tables, calculate the t-score for the 90, 95 and 99% levels for:

  1. A large sample (use normal distribution)
  2. A sample of 50 (use t-distribution)
  3. A sample of 25 (use t-distribution)
  4. A sample of 4 (use t-distribution)

Make a table of your results – what patterns do you see?

Suppose you have been given information from a sample: mean age is 45 years, standard deviation 14 and sample size 50. Make a 95% confidence interval around the mean age, first using the normal distribution. Then repeat the exercise using the t-distribution. How do the two confidence intervals differ and which should you use?

Suppose someone has claimed that the average age in this population is 40 years – what does your data say about this claim?