Guillaume Calmettes
The $\chi^2$, statistic to compare proportions to "hypothesized" proportions.
Note:
The $\chi^2$ test for independance can also be applied to a 2x2 contingency table, and will give you the same results (p-value or confidence intervals) than if you were comparing the difference in proportions, or the ratio of the two proportions.
Response variable |
Explanatory var. | |
Cat. #1 | Cat. #2 | |
Yes | 50 | 65 |
No | 30 | 45 |
So far, we’ve learned how to do inference for a difference in means IF the categorical variable has only two categories (i.e. compare two groups)
Today, we’ll learn how to do hypothesis tests for a difference in mean (or median) across multiple categories (i.e. compare more than two groups)
As young students in Australia, Dominic Kelly and his friends enjoyed watching ants gather on pieces of sandwiches. Later, as a university student, Dominic decided to study this with a more formal experiment. He chose three types of sandwich fillings to compare:
To conduct the experiment he randomly chose a sandwich, broke off a piece, and left it on the ground near an ant hill. After several minutes he placed a jar over the sandwich bit and counted the number of ants. He repeated the process, allowing time for ants to return to the hill after each trial, until he had eight samples for each of the three sandwich fillings.
Dominic's data are shown below:
Number of ants visiting sandwiches | ||||||||||
#1 | #2 | #3 | #4 | #5 | #6 | #7 | #8 | $\bar{x}$ | $s$ | |
Vegemite | 18 | 29 | 42 | 42 | 31 | 21 | 38 | 25 | 30.75 | 9.25 |
Peanut Butter | 43 | 59 | 22 | 25 | 36 | 47 | 19 | 21 | 34.0 | 14.63 |
Ham & Pickles | 44 | 34 | 36 | 49 | 54 | 65 | 59 | 53 | 49.25 | 10.79 |
Do ants have a preferential sandwich filling?
$H_0$: $\mu_\texclass{warning}{1}=\mu_\texclass{info}{2}=\mu_\texclass{danger}{3}$
$H_a$: At least one $\mu_i=\mu_j$
We have 3 groups. Why not just carry out a bunch of pair-wise analyses of the difference in mean?
$H_0$: $\mu_\texclass{warning}{1}\neq\mu_\texclass{info}{2}$
$H_0$: $\mu_\texclass{warning}{1}\neq\mu_\texclass{danger}{3}$
$H_0$: $\mu_\texclass{info}{2}\neq\mu_\texclass{danger}{3}$
Each test is carried out at $\alpha=0.05$, so the risk of making a Type I error is $5\%$ each time (rejecting the null when it's true).
These risk of making a Type I errors increases as we multiply the number of tests on the same data:
We need to make multiple comparisons with an overall Type I error of $\alpha=0.05$ (or whichever level is specified). We need an alternative approach that uses one overall test that compares all means at once.
Last time we used an overall test ($\chi^2$) to compare multiple proportions, we can probably do the same for comparing means.
If we have two means to compare, we just need to look at their difference (or ratio) to measure how far apart they are.
Suppose we wanted to compare three means. How could I create something that would measure how different all three means are?
Calculate the average differences between the three means!
$\bar{x}$ | $s$ | |
Vegemite | 30.75 | 9.25 |
Peanut Butter | 34.0 | 14.63 |
Ham & Pickles | 49.25 | 10.79 |
$d_1=30.75-34.0=\texclass{success}{-3.25}$
$d_2=34.0-49.25=\texclass{success}{-15.25}$
$d_3=49.25-30.75=\texclass{success}{18.5}$
How can we combine those 3 differences into a single statistic of interest so they don't cancel out?
The average (or mean) of the absolute value of the differences in the conditional means, referred as the MAD statistic, provides a measure of how far appart the sample means are on average.
For the ants/sandwiches study:
$\mathrm{MAD}=\frac{\mid d_1\mid + \mid d_2 \mid + \mid d_3 \mid}{3}$
$\mathrm{MAD}=\frac{\mid-3.25\mid+\mid-15.25\mid+\mid-18.5\mid}{3}$
$\mathrm{MAD}=12.33$
The average distance between one of the means to another is 12.33.
The average difference in number of ants that are attracted by a sandwich compared to another one is 12.33. Is it a lot or not?
$H_0$: There is no association between which kind of sandwich is left on the ground and the number of ants that are attracted by it.
All the three long term means ($\mu$) are the same.
$H_0$: $\mu_\texclass{warning}{1}=\mu_\texclass{info}{2}=\mu_\texclass{danger}{3}$
$H_a$: There is an association between the kind of sandwich and the number of ants that are attracted.
$H_a$: at least one of the mean is different
$H_0$: There is no association between which kind of sandwich is left on the ground and the number of ants that are attracted by it.
All the three long term means ($\mu$) are the same.
$H_0$: $\mu_\texclass{warning}{1}=\mu_\texclass{info}{2}=\mu_\texclass{danger}{3}$
If the sandwich filling doesn't affect the number of ants that are attracted, then all the numbers of ants Dominic recorded for each of his trials could have been observed for any of the sandwich fillings.
All the numbers of ants recorded could have come from the same population, and this repartition of numbers just happened by chance.
To simulate the null hypothesis, we have to randomize the groups (shuffle all the pooled numbers of ants observed per filling), redistribute these shuffled numbers into 3 groups (same sample size than original groups) and calculate the resulting MAD statistic each time.
$MAD_\mathrm{original}=12.33$
Number of ants visiting sandwiches | |||||||||
#1 | #2 | #3 | #4 | #5 | #6 | #7 | #8 | $\bar{x}$ | |
Vegemite | 18 | 29 | 42 | 42 | 31 | 21 | 38 | 25 | 30.75 |
Peanut Butter | 43 | 59 | 22 | 25 | 36 | 47 | 19 | 21 | 34.0 |
Ham & Pickles | 44 | 34 | 36 | 49 | 54 | 65 | 59 | 53 | 49.25 |
Number of ants visiting sandwiches | |||||||||
#1 | #2 | #3 | #4 | #5 | #6 | #7 | #8 | $\bar{x}$ | |
Vegemite | 47 | 49 | 22 | 42 | 31 | 65 | 38 | 25 | 39.88 |
Peanut Butter | 53 | 34 | 43 | 21 | 21 | 59 | 54 | 25 | 38.75 |
Ham & Pickles | 18 | 19 | 44 | 36 | 42 | 29 | 36 | 59 | 35.37 |
$MAD_1=3.0$
Number of ants visiting sandwiches | |||||||||
#1 | #2 | #3 | #4 | #5 | #6 | #7 | #8 | $\bar{x}$ | |
Vegemite | 44 | 29 | 36 | 36 | 43 | 42 | 19 | 25 | 34.25 |
Peanut Butter | 59 | 25 | 18 | 34 | 47 | 53 | 31 | 21 | 36 |
Ham & Pickles | 22 | 21 | 49 | 65 | 42 | 59 | 54 | 38 | 43.75 |
$MAD_2=6.33$
Repeat this process 10000 times! ($MAD_1 ... MAD_{10000}$)
We obtain 10000 MAD statistics that could have been observed if there was no association between the sandwich filling and the number of ants attracted to it.
Where does the initial MAD statistic lies in this distribution? What is the probability of getting such an extreme statistic by chance?
The probability of observing such a difference in the number of ants attracted by the sandwiches if the filling had no effect would be $1.3\%$.
There is evidence that ants do not prefer sandwich fillings equally.
From this null hypothesis testing, we can conclude that ants do not prefer sandwich fillings equally. Which filling is particularly attractive?
Several options are possible as follow-up test:
Vegemite vs Peanut Butter:
$30.75 - 34.0 = -3.25$ with $95\%$ CIs ($-14.75, 7.50$)
Vegemite vs Ham & Pickles:
$30.75 - 49.25 = -18.50$ with $95\%$ CIs ($-27.50, -9.375$)
Peanut Butter vs Ham & Pickles:
$34.0 - 49.25 = -15.25$ with $95\%$ CIs ($-26.75, -3.00$)
Ham & Pickles is a big hit at every party in the ant world!
Our original data did not have obvious outliers and the mean was a good-enough descriptor for each of our groups, but other datasets to analyze could contain anomalies.
What could we do if this happen?
Same thing but using the median!!
Just calculate a MAD-like statistic using the median of each group instead of the mean (Mean Median Difference).
Using resampling, you can adapt your statistic of interest depending on the caracteristics of your data!
Pros:
Cons??:
Like for comparing multiple proportions ($\chi^2$), comparing multiple means also has its own more complex statistic with a distribution that can be pretty much modeled by theory.
This new statistic is called an $F$-statistic and the resampling-based (or theory-based) distribution that estimates the null distribution is called an $F$-distribution.
What is this $F$-statistic?
Consider those 3 groups (same mean in left and right panels). Intuitively, in which case the groups seem more different from one to another? Why?
Total
variability
Variability
between groups
Variability
within groups
For the same total variability, the greater the variability between groups will be compared the the variability within groups, the more likely the groups will be different from one to another.
An ANalysis Of VAriance (ANOVA) is a comparison between the variability between groups to the variability within groups. The statistic of interested that reflex this comparison between the two types of variability when performing an ANOVA is the $F$-statistic.
Total
variability
Variability
between groups
Variability
within groups
How to measure variability between groups?
How to measure variability within groups?
How to compare the two measures?
How to determine significance?
To caracterize the variability between groups, we are considering how far each sample mean ($\bar{x}_k$) is from the "Grand mean" ($\bar{x}$, mean of the means)
To caracterize the variability within groups, we are considering how far each data point ($x_{ik}$) inside each sample is from the sample mean ($\bar{x}_k$).
Total
variability
Variability
between groups
Variability
within groups
$\sum_{i=1}^n(x_{i}-\bar{x})^2$
$\sum_{k=1}^{nk}n_{ik}(\bar{x}_k-\bar{x})^2$
$\sum_{i=1}^{n}(x_{ik}-\bar{x}_k)^2$
To caracterize the variability between groups, we are considering how far each sample mean ($\bar{x}_k$) is from the "Grand mean" ($\bar{x}$, mean of the means)
To caracterize the variability within groups, we are considering how far each data point ($x_{ik}$) inside each sample is from the sample mean ($\bar{x}_k$).
Total sum of squares (SST)
Sum of squares due to groups (SSG)
"Error" sum of squares (SSE)
$\sum_{i=1}^n(x_{i}-\bar{x})^2$
$\sum_{k=1}^{nk}n_{ik}(\bar{x}_k-\bar{x})^2$
$\sum_{i=1}^{n}(x_{ik}-\bar{x}_k)^2$
The $F$-statistic is a ratio of the variability between groups and the variability withing groups, but uses the mean variability instead of the total variability for each component.
$F=\frac{\texclass{warning}{\textrm{Mean variability between groups}}}{\texclass{info}{\textrm{Mean variability within groups}}}$
$F=\frac{\texclass{warning}{\textrm{Mean SSG}}}{\texclass{info}{\textrm{Mean SSE}}}=\frac{\texclass{warning}{\textrm{SSG}/(k-1)}}{\texclass{info}{\textrm{SSE}/(n-k)}}$
Notes:
The $F$-statistic equation can then be expressed as:
Note:
Don't worry, nobody calculates it by hand, it would be very tedious. You can rely on technology to do the hard work for you.
The $F$-statistic is just another statistic, like the MAD, this doesn't change the approach to compare different groups.
Null hypothesis testing:
To simulate the null hypothesis, we have to randomize the groups (number of ants observed per filling) and calculate the resulting $F$-statistic each time.
$F_\mathrm{original}=5.63$
Number of ants visiting sandwiches | |||||||||
#1 | #2 | #3 | #4 | #5 | #6 | #7 | #8 | $\bar{x}$ | |
Vegemite | 18 | 29 | 42 | 42 | 31 | 21 | 38 | 25 | 30.75 |
Peanut Butter | 43 | 59 | 22 | 25 | 36 | 47 | 19 | 21 | 34.0 |
Ham & Pickles | 44 | 34 | 36 | 49 | 54 | 65 | 59 | 53 | 49.25 |
Number of ants visiting sandwiches | |||||||||
#1 | #2 | #3 | #4 | #5 | #6 | #7 | #8 | $\bar{x}$ | |
Vegemite | 47 | 49 | 22 | 42 | 31 | 65 | 38 | 25 | 39.88 |
Peanut Butter | 53 | 34 | 43 | 21 | 21 | 59 | 54 | 25 | 38.75 |
Ham & Pickles | 18 | 19 | 44 | 36 | 42 | 29 | 36 | 59 | 35.37 |
$F_1=0.21$
Number of ants visiting sandwiches | |||||||||
#1 | #2 | #3 | #4 | #5 | #6 | #7 | #8 | $\bar{x}$ | |
Vegemite | 44 | 29 | 36 | 36 | 43 | 42 | 19 | 25 | 34.25 |
Peanut Butter | 59 | 25 | 18 | 34 | 47 | 53 | 31 | 21 | 36 |
Ham & Pickles | 22 | 21 | 49 | 65 | 42 | 59 | 54 | 38 | 43.75 |
$F_2=1.06$
Repeat this process 10000 times! ($F_1 ... F_{10000}$)
We obtain 10000 $F$-statistics that could have been observed if there was no association between the sandwich filling and the number of ants attracted to it.
Where does the initial $F$-statistic lies in this distribution? What was the probability of getting such an extreme statistic by chance?
The probability of observing such a difference in the number of ants attracted by the sandwiches if the filling had no effect would be $1.3\%$.
There is evidence that ants do not prefer sandwich fillings equally.
Using the $F$-statistic as our statistic of interest, we obtained the same p-value than when we used the $MAD$, and our conclusions will be the same, i.e. that ants do not prefer sandwich fillings equally. Which filling is particularly attractive?
Several options are possible as follow-up test (like for the MAD analysis ):
Vegemite vs Peanut Butter:
$30.75 - 34.0 = -3.25$ with $95\%$ CIs ($-14.75, 7.50$)
Vegemite vs Ham & Pickles:
$30.75 - 49.25 = -18.50$ with $95\%$ CIs ($-27.50, -9.375$)
Peanut Butter vs Ham & Pickles:
$34.0 - 49.25 = -15.25$ with $95\%$ CIs ($-26.75, -3.00$)
Ham & Pickles is a big hit at every parties in the ant world!
The distribution of the $F$-statistic under the assumption that the null hypothesis is true can be described theoretically by a particular distribution, called $F$-distribution, which depends on the degrees of freedom for the variability between $(k-1)$ and within (n-k) groups.
A p-value is obtained by computing the area under the curve for $F\geq F_{original}$.
(Here again, the p-value obtained is pretty close to what we calculated and the conclusions of our study will be the same).
If you use the MAD, the idea behing comparing multiple ($>2$) means is the same than when you compared only two means.
MAD-like statistics can be designed to fit your particular dataset or study (if you want to analyse the average relative risk, this is possible too).
The MAD can only be performed using a resampling approach (no theory to describe the distribution of the MAD or MAD-like statistics)
The $F$-statistic compares the variability between groups (sample means) to the variability within groups (variance withing the groups). The bigger the difference between groups compared to the variability within groups, the bigger the $F$-statistic.