Tables and formula are available here. These will be available during the exam. For today's purposes they include the tables of the Chi-squared distribution, and of Student's t Distribution. Some of the formulas will be covered in subsequent lectures.
As we have seen, spreadsheets are very useful for manipulating data and presenting numbers. Today we use a spreadsheet to manipulate tables, and to calculate expected values and the chi-squared statistic.
The following table is available here as
Fill in the missing row, column and grand
totals, using the
=sum() function, marking the relevant
cells with the mouse. For example, type in the first row, go to
where you want the row total to be, enter
the row cells with the mouse, enter
) and press
|Looking for 1st job||170||151||321|
Another tip: rather than entering the same formula repeatedly, you can copy it -- for instance, if you copy the row 1 total formula down one, it now totals row 2. Formulas by default use relative references to other cells. For instance, a formula in B1 referring to A1 is really referring to "one cell to the left", so if we copy it to C23 the new formula refers to B23.
Copy the entire table to a nearby location and delete the
numbers in the body. Calculate the row proportion (percentage) in each
cell by using a formula that divides the corresponding cell in the
original table by the row total. If you want to copy this formula
from the "male" column to the "female" one,
you need to "de-relativise" it a bit because the position of the
row total does not move. To do this just put a
the reference to the row total: for instance,
If you copy this one cell right it will become
To get percentages, you can either multiply by 100, or set the
Format -> Cells -> Percentage).
If there is no association between two variables, we would expect the row percentages in each row to follow more or less the same pattern as the percentages across the column totals (also true for column percentages). We can calculate the expected values therefore, from the column and row totals. One way is for each cell to multiply the column total percentage by the row total number. Another way is to multiply the row total by the column total and divide by the grand total.
Copy the table again (for simplicity, perhaps key in the totals). For each cell replace the observed value with a formula for the expected value. Can you see big differences between the observed and expected values?
Calculate percentages based on the expected values, and verify that they are the same column by column, row by row.
A standard test for association in tables is called the chi-squared test. This involves calculating for each cell the quantity (O-E)2/E where O means observed value and E means expected value. When added up for the entire table this quantity is called the chi-squared statistic: the larger the individual differences between the observed and expected values the larger this will be. This statistic has a known distribution. For instance, for a table this size, there is only a 5% chance that sampling variability would result in a chi-squared statistic of 9.488 or greater, when there was really no association in the population. Thus if the chi-squared statistic is less than 9.488 we consider the table as being consistent with there being no association in the population, but if it is greater than 9.488 we consider it evidence that there probably is association in the population.
Calculate the chi-squared statistic and assess whether there seems to be association or not.
The table above is drawn from the School Leavers' Survey, a subset of
which is available here (or do
use http://teaching.sociology.ul.ie/so5041/labs/example6). Download and load
into Stata; recode the
empstat variable to match the table
and re-create the crosstab, with the chi-squared statistic (hint: the
option chi after a comma). Compare Stata's value to that you
calculated by hand.
Using the tables, calculate the t-score for the 90, 95 and 99% levels for:
Make a table of your results -- what patterns do you see?
Suppose you have been given information from a sample: mean age is 45 years, standard deviation 14 and sample size 50. Make a 95% confidence interval around the mean age, first using the normal distribution. Then repeat the exercise using the t-distribution. How do the two confidence intervals differ and which should you use?
Suppose someone has claimed that the average age in this population is 40 years -- what does your data say about this claim?