Stats 13

Lecture 15

What about comparing
more than 2 means

Guillaume Calmettes

Last time

The $\chi^2$, statistic to compare proportions to "hypothesized" proportions.

  • Pattern of a proportion distribution ($\chi^2$ test for goodness-of-fit)
  • Association of a response variable to an explanatory variable ($\chi^2$ test for independance)

Note:
The $\chi^2$ test for independance can also be applied to a 2x2 contingency table, and will give you the same results (p-value or confidence intervals) than if you were comparing the difference in proportions, or the ratio of the two proportions.

Response
variable
Explanatory var.
Cat. #1 Cat. #2
Yes 50 65
No 30 45

Multiple explanatory categories for quantitative variables

So far, we’ve learned how to do inference for a difference in means IF the categorical variable has only two categories (i.e. compare two groups)

Today, we’ll learn how to do hypothesis tests for a difference in mean (or median) across multiple categories (i.e. compare more than two groups)

Gluttonous ants

As young students in Australia, Dominic Kelly and his friends enjoyed watching ants gather on pieces of sandwiches. Later, as a university student, Dominic decided to study this with a more formal experiment. He chose three types of sandwich fillings to compare:

  • Vegemite
  • Peanut Butter
  • Ham and Pickles

To conduct the experiment he randomly chose a sandwich, broke off a piece, and left it on the ground near an ant hill. After several minutes he placed a jar over the sandwich bit and counted the number of ants. He repeated the process, allowing time for ants to return to the hill after each trial, until he had eight samples for each of the three sandwich fillings.

Gluttonous ants

Dominic's data are shown below:

Number of ants visiting sandwiches
#1 #2 #3 #4 #5 #6 #7 #8 $\bar{x}$ $s$
Vegemite 18 29 42 42 31 21 38 25 30.75 9.25
Peanut Butter 43 59 22 25 36 47 19 21 34.0 14.63
Ham & Pickles 44 34 36 49 54 65 59 53 49.25 10.79

Gluttonous ants

Do ants have a preferential sandwich filling?

$H_0$: $\mu_\texclass{warning}{1}=\mu_\texclass{info}{2}=\mu_\texclass{danger}{3}$

$H_a$: At least one $\mu_i=\mu_j$

Comparing means

We have 3 groups. Why not just carry out a bunch of pair-wise analyses of the difference in mean?

$H_0$: $\mu_\texclass{warning}{1}\neq\mu_\texclass{info}{2}$

$H_0$: $\mu_\texclass{warning}{1}\neq\mu_\texclass{danger}{3}$

$H_0$: $\mu_\texclass{info}{2}\neq\mu_\texclass{danger}{3}$

Controlling for Type I error

Each test is carried out at $\alpha=0.05$, so the risk of making a Type I error is $5\%$ each time (rejecting the null when it's true).

These risk of making a Type I errors increases as we multiply the number of tests on the same data:

  • At the $5\%$ significance level, the probability of making at least one type I error for 3 tests would be $14\%$
  • Comparing 4 means (6 tests), this jumps to $26\%$
  • Comparing 5 means (10 tests), this jumps to $40\%$

We need to make multiple comparisons with an overall Type I error of $\alpha=0.05$ (or whichever level is specified). We need an alternative approach that uses one overall test that compares all means at once.

Overall test (part 1: the simple way)

Last time we used an overall test ($\chi^2$) to compare multiple proportions, we can probably do the same for comparing means.

If we have two means to compare, we just need to look at their difference (or ratio) to measure how far apart they are.

Suppose we wanted to compare three means. How could I create something that would measure how different all three means are?

Calculate the average differences between the three means!

Overall test (part 1: the simple way)

$\bar{x}$ $s$
Vegemite 30.75 9.25
Peanut Butter 34.0 14.63
Ham & Pickles 49.25 10.79

$d_1=30.75-34.0=\texclass{success}{-3.25}$

$d_2=34.0-49.25=\texclass{success}{-15.25}$

$d_3=49.25-30.75=\texclass{success}{18.5}$

How can we combine those 3 differences into a single statistic of interest so they don't cancel out?

Mean Average Difference statistic

The average (or mean) of the absolute value of the differences in the conditional means, referred as the MAD statistic, provides a measure of how far appart the sample means are on average.

$$\mathrm{MAD}=\sum_{\substack{i,j \\ i < j}}^n \frac{\mid\bar{x_i}-\bar{x_j}\mid}{n}$$

Let's get MAD!

For the ants/sandwiches study:

$\mathrm{MAD}=\frac{\mid d_1\mid + \mid d_2 \mid + \mid d_3 \mid}{3}$

$\mathrm{MAD}=\frac{\mid-3.25\mid+\mid-15.25\mid+\mid-18.5\mid}{3}$

$\mathrm{MAD}=12.33$

The average distance between one of the means to another is 12.33.

The average difference in number of ants that are attracted by a sandwich compared to another one is 12.33. Is it a lot or not?

Simulating the null (MAD)

$H_0$: There is no association between which kind of sandwich is left on the ground and the number of ants that are attracted by it.

All the three long term means ($\mu$) are the same.
$H_0$: $\mu_\texclass{warning}{1}=\mu_\texclass{info}{2}=\mu_\texclass{danger}{3}$

$H_a$: There is an association between the kind of sandwich and the number of ants that are attracted.

$H_a$: at least one of the mean is different

Simulating the null (MAD)

$H_0$: There is no association between which kind of sandwich is left on the ground and the number of ants that are attracted by it.

All the three long term means ($\mu$) are the same.
$H_0$: $\mu_\texclass{warning}{1}=\mu_\texclass{info}{2}=\mu_\texclass{danger}{3}$

If the sandwich filling doesn't affect the number of ants that are attracted, then all the numbers of ants Dominic recorded for each of his trials could have been observed for any of the sandwich fillings.

All the numbers of ants recorded could have come from the same population, and this repartition of numbers just happened by chance.

Simulating the null (MAD)

To simulate the null hypothesis, we have to randomize the groups (shuffle all the pooled numbers of ants observed per filling), redistribute these shuffled numbers into 3 groups (same sample size than original groups) and calculate the resulting MAD statistic each time.

$MAD_\mathrm{original}=12.33$

Number of ants visiting sandwiches
#1 #2 #3 #4 #5 #6 #7 #8 $\bar{x}$
Vegemite 18 29 42 42 31 21 38 25 30.75
Peanut Butter 43 59 22 25 36 47 19 21 34.0
Ham & Pickles 44 34 36 49 54 65 59 53 49.25
Number of ants visiting sandwiches
#1 #2 #3 #4 #5 #6 #7 #8 $\bar{x}$
Vegemite 47 49 22 42 31 65 38 25 39.88
Peanut Butter 53 34 43 21 21 59 54 25 38.75
Ham & Pickles 18 19 44 36 42 29 36 59 35.37

$MAD_1=3.0$

Number of ants visiting sandwiches
#1 #2 #3 #4 #5 #6 #7 #8 $\bar{x}$
Vegemite 44 29 36 36 43 42 19 25 34.25
Peanut Butter 59 25 1834 47 53 31 21 36
Ham & Pickles 22 21 49 65 42 59 54 38 43.75

$MAD_2=6.33$

Repeat this process 10000 times! ($MAD_1 ... MAD_{10000}$)

Null MAD distribution

We obtain 10000 MAD statistics that could have been observed if there was no association between the sandwich filling and the number of ants attracted to it.

Where does the initial MAD statistic lies in this distribution? What is the probability of getting such an extreme statistic by chance?

The probability of observing such a difference in the number of ants attracted by the sandwiches if the filling had no effect would be $1.3\%$.
There is evidence that ants do not prefer sandwich fillings equally.

Which filling is particularly attractive?

From this null hypothesis testing, we can conclude that ants do not prefer sandwich fillings equally. Which filling is particularly attractive?

Several options are possible as follow-up test:

  • Confidence interval for a single mean
  • Confidence interval for a difference in two means
  • Pairwise comparison for a difference in two means

Vegemite vs Peanut Butter:
$30.75 - 34.0 = -3.25$ with $95\%$ CIs ($-14.75, 7.50$)

Vegemite vs Ham & Pickles:
$30.75 - 49.25 = -18.50$ with $95\%$ CIs ($-27.50, -9.375$)

Peanut Butter vs Ham & Pickles:
$34.0 - 49.25 = -15.25$ with $95\%$ CIs ($-26.75, -3.00$)

Ham & Pickles is a big hit at every party in the ant world!

What about outliers?

Our original data did not have obvious outliers and the mean was a good-enough descriptor for each of our groups, but other datasets to analyze could contain anomalies.

What could we do if this happen?

Same thing but using the median!!

Just calculate a MAD-like statistic using the median of each group instead of the mean (Mean Median Difference).

Using resampling, you can adapt your statistic of interest depending on the caracteristics of your data!

Recap on the MAD statistic

Pros:

  • Easy to understand in the context of the study and easy to directly relate to the data (average distance between the means of the different groups)
  • Can be adapted to better render the particular shape of the data (compare medians instead of the means if outliers are present)
  • Do not require any assumptions

Cons??:

  • The value of the MAD will be specific to each dataset analyzed (cannot directly compare the value of a MAD to another) and do not render any information about the sample size (not a standardized statistic). Some people think this is annoying.
  • The distribution of the MAD cannot be predicted. Some people think this is annoying too.

Overall test (part 2: the complicated way)

Like for comparing multiple proportions ($\chi^2$), comparing multiple means also has its own more complex statistic with a distribution that can be pretty much modeled by theory.

This new statistic is called an $F$-statistic and the resampling-based (or theory-based) distribution that estimates the null distribution is called an $F$-distribution.

What is this $F$-statistic?

Variability in a multiple-group dataset

Consider those 3 groups (same mean in left and right panels). Intuitively, in which case the groups seem more different from one to another? Why?

Variability in a multiple-group dataset

Total
variability

$=$

Variability
between groups

$+$

Variability
within groups

Variability in a multiple-group dataset

For the same total variability, the greater the variability between groups will be compared the the variability within groups, the more likely the groups will be different from one to another.

Analysis of variance

An ANalysis Of VAriance (ANOVA) is a comparison between the variability between groups to the variability within groups. The statistic of interested that reflex this comparison between the two types of variability when performing an ANOVA is the $F$-statistic.

Total
variability

$=$

Variability
between groups

$+$

Variability
within groups

How to measure variability between groups?

How to measure variability within groups?

How to compare the two measures?

How to determine significance?

Sums of squares

To caracterize the variability between groups, we are considering how far each sample mean ($\bar{x}_k$) is from the "Grand mean" ($\bar{x}$, mean of the means)

To caracterize the variability within groups, we are considering how far each data point ($x_{ik}$) inside each sample is from the sample mean ($\bar{x}_k$).

Total
variability

$=$

Variability
between groups

$+$

Variability
within groups

$\sum_{i=1}^n(x_{i}-\bar{x})^2$

$=$

$\sum_{k=1}^{nk}n_{ik}(\bar{x}_k-\bar{x})^2$

$+$

$\sum_{i=1}^{n}(x_{ik}-\bar{x}_k)^2$

Sums of squares

To caracterize the variability between groups, we are considering how far each sample mean ($\bar{x}_k$) is from the "Grand mean" ($\bar{x}$, mean of the means)

To caracterize the variability within groups, we are considering how far each data point ($x_{ik}$) inside each sample is from the sample mean ($\bar{x}_k$).

Total sum of squares (SST)

$=$

Sum of squares due to groups (SSG)

$+$

"Error" sum of squares (SSE)

$\sum_{i=1}^n(x_{i}-\bar{x})^2$

$=$

$\sum_{k=1}^{nk}n_{ik}(\bar{x}_k-\bar{x})^2$

$+$

$\sum_{i=1}^{n}(x_{ik}-\bar{x}_k)^2$

The $F$-statistic

The $F$-statistic is a ratio of the variability between groups and the variability withing groups, but uses the mean variability instead of the total variability for each component.

$F=\frac{\texclass{warning}{\textrm{Mean variability between groups}}}{\texclass{info}{\textrm{Mean variability within groups}}}$

$F=\frac{\texclass{warning}{\textrm{Mean SSG}}}{\texclass{info}{\textrm{Mean SSE}}}=\frac{\texclass{warning}{\textrm{SSG}/(k-1)}}{\texclass{info}{\textrm{SSE}/(n-k)}}$

Notes:

  • $(k-1)$ and $(n-k)$ are the respective degrees of freedom for the variability between and within groups. (degree of freedom for the total varaibility is $(n-1)$)
  • This is like normalizing the difference in means between each group and the "Grand mean" with their respective variance ($s^2$). The $F$-statistic is a standardized statistic.

The $F$-statistic

The $F$-statistic equation can then be expressed as:

$F=\frac{ \frac{\sum_{i=1}^n n_i(\bar{x}_i-\bar{x})^2}{(k-1)} }{ \frac{\sum_{i=1}^k (n_i-1)s_i^2}{(n-k)} }$

Note:
Don't worry, nobody calculates it by hand, it would be very tedious. You can rely on technology to do the hard work for you.

The $F$-statistic

The $F$-statistic is just another statistic, like the MAD, this doesn't change the approach to compare different groups.

Null hypothesis testing:

  1. State hypotheses
  2. Calculate a statistic, based on your sample data
  3. Create a distribution of this statistic, as it would be observed if the null hypothesis were true
  4. Measure how extreme your test statistic from (2) is, as compared to the distribution generated in (3)

Simulating the null ($F$)

To simulate the null hypothesis, we have to randomize the groups (number of ants observed per filling) and calculate the resulting $F$-statistic each time.

$F_\mathrm{original}=5.63$

Number of ants visiting sandwiches
#1 #2 #3 #4 #5 #6 #7 #8 $\bar{x}$
Vegemite 18 29 42 42 31 21 38 25 30.75
Peanut Butter 43 59 22 25 36 47 19 21 34.0
Ham & Pickles 44 34 36 49 54 65 59 53 49.25
Number of ants visiting sandwiches
#1 #2 #3 #4 #5 #6 #7 #8 $\bar{x}$
Vegemite 47 49 22 42 31 65 38 25 39.88
Peanut Butter 53 34 43 21 21 59 54 25 38.75
Ham & Pickles 18 19 44 36 42 29 36 59 35.37

$F_1=0.21$

Number of ants visiting sandwiches
#1 #2 #3 #4 #5 #6 #7 #8 $\bar{x}$
Vegemite 44 29 36 36 43 42 19 25 34.25
Peanut Butter 59 25 1834 47 53 31 21 36
Ham & Pickles 22 21 49 65 42 59 54 38 43.75

$F_2=1.06$

Repeat this process 10000 times! ($F_1 ... F_{10000}$)

Simulating the null (MAD)

We obtain 10000 $F$-statistics that could have been observed if there was no association between the sandwich filling and the number of ants attracted to it.

Where does the initial $F$-statistic lies in this distribution? What was the probability of getting such an extreme statistic by chance?

The probability of observing such a difference in the number of ants attracted by the sandwiches if the filling had no effect would be $1.3\%$.
There is evidence that ants do not prefer sandwich fillings equally.

Which filling is particularly attractive?

Using the $F$-statistic as our statistic of interest, we obtained the same p-value than when we used the $MAD$, and our conclusions will be the same, i.e. that ants do not prefer sandwich fillings equally. Which filling is particularly attractive?

Several options are possible as follow-up test (like for the MAD analysis ):

  • Confidence interval for a single mean
  • Confidence interval for a difference in two means
  • Pairwise comparison for a difference in two means

Vegemite vs Peanut Butter:
$30.75 - 34.0 = -3.25$ with $95\%$ CIs ($-14.75, 7.50$)

Vegemite vs Ham & Pickles:
$30.75 - 49.25 = -18.50$ with $95\%$ CIs ($-27.50, -9.375$)

Peanut Butter vs Ham & Pickles:
$34.0 - 49.25 = -15.25$ with $95\%$ CIs ($-26.75, -3.00$)

Ham & Pickles is a big hit at every parties in the ant world!

Theory approach: the $F$-distribution

The distribution of the $F$-statistic under the assumption that the null hypothesis is true can be described theoretically by a particular distribution, called $F$-distribution, which depends on the degrees of freedom for the variability between $(k-1)$ and within (n-k) groups.

A p-value is obtained by computing the area under the curve for $F\geq F_{original}$.

(Here again, the p-value obtained is pretty close to what we calculated and the conclusions of our study will be the same).

Comparing multiple means

If you use the MAD, the idea behing comparing multiple ($>2$) means is the same than when you compared only two means.

MAD-like statistics can be designed to fit your particular dataset or study (if you want to analyse the average relative risk, this is possible too).

The MAD can only be performed using a resampling approach (no theory to describe the distribution of the MAD or MAD-like statistics)

The $F$-statistic compares the variability between groups (sample means) to the variability within groups (variance withing the groups). The bigger the difference between groups compared to the variability within groups, the bigger the $F$-statistic.