Stats 13

Lecture 7

Comparing two means

Guillaume Calmettes

Last time

How to compare 2 proportions

Playing for a sell out crowd

Sports teams prefer to play in front of their own fans rather than at the opposing team's site. Having a sell out crowd should provide even more excitement and lead to an even better performance right?

Well, consider the Oklahoma City Thunder (NBA team) in its second season (2008-2009) after moving from Seattle. This team had a win-loss record that was actually worse for home games with a sell out crowd (3 wins and 15 losses) than for home games without a sell out crowd (12 wins and 11 losses) .

Game
outcome?
Sell out crowd?
Yes No Total
Win 3 12 15
Loss 15 11 26
Total 18 23 41

Playing for a sell out crowd

Sports teams prefer to play in front of their own fans rather than at the opposing team's site. Having a sell out crowd should provide even more excitement and lead to an even better performance right?

Well, consider the Oklahoma City Thunder (NBA team) in its second season (2008-2009) after moving from Seattle. This team had a win-loss record that was actually worse for home games with a sell out crowd (3 wins and 15 losses) than for home games without a sell out crowd (12 wins and 11 losses) .

Game
outcome?
Sell out crowd?
Yes No
Win 0.17 0.52
Loss 0.83 0.48
Total 1 1

$\textrm{Relative risk}=\frac{0.52}{0.17}=3.1$
(3.1 times more likely to win if it is not a sell out crowd game)

Simulating the null

Null hypothesis:
There is no association between playing in front of a sell out crowd and winning/losing a game.

If there was no association between the variables, what would be the probability of observing an association as strong as the one we observed ($RR=3.1$) between the fact that a game is a sell out or not and winning or losing?

To evaluate the statistical significance of the observed association in our groups, we will investigate how large the $RR$ in conditional proportions tends to be just from the random assignment of outcomes (win or loss) to the explanatory variable groups (sell out crowd or not).
=> we will do a simulation in which the chances of winning a game is the same whether it is a sell out or not.

Simulating the null

  1. Pool all the 41 wins/losses together
  2. Shuffle all 41 games and randomly redistribute into two stacks
    • One with 18 games (representing the sell out crowd games)
    • Another 23 games (representing the games without sell out crowd)
  3. Calculate the relative proportion and relative risk for the simulation
  4. Repeat 10000 times (we obtain 10000 RR that we could have observed if the presence or not of a sell out crowd did not matter)

Simulating the null

What would be the probability of observing a RR of 3.1 if there was no association between sell out crowd and winning or losing?

=> The probability is very low (only 2.6% of the simulations had such an extreme RR in both directions).

10000 simulations

We have strong evidences supporting the idea that it is less likely for the Oklahoma City Thunder to win when they play in front of a sell out crowd.

95% confidence intervals

What is the true value of how much more likely the Oklahoma City Thunder is to win when there is not a sell out crowd compare to when they play in a packed stadium?

1- Draw a bootstrap sample from the sell out crowd games
2- Draw a bootstrap sample from the non sell out crowd games
3- Compute your statistics (relative risk of the conditional proportions) and store this value.

4- Repeat this process (steps 1-3) 10000 times

5- Determine the $\frac{\alpha}{2}$ and $(1-\frac{\alpha}{2})$ percentiles in your stored result array

95% confidence intervals

What is the true value of how much more likely the Oklahoma City Thunder is to win when there is not a sell out crowd compare to when they play in a packed stadium?

=> The Oklahoma City Thunder are 1.3 to 11 times more likely to win when they do not play in front of a sell out crowd

Are we just going to accept this conclusion?
Why does this NBA team is less likely to win when there is a sell out crowd?
What could explain this?

Possible explanations for home court disadvantage

There are three possible explanations for this odd finding that the team had a better winning percentage with smaller crowds:

Random chance (but we pretty much ruled it out)

The sell out crowd caused the Thunder to play worse, perhaps because of pressure or nervousness

The sell out crowd did not cause a worse performance, and some other issue (variable) explains why they had a worse winning percentage with sell out crowds

=> In other words, for #3, a third variable is at play, which is related to both the crowd size and the game outcome. (confounding variable)

Possible explanations for home court disadvantage

Association does not mean causation

Two variables are associated (or related), if the value of one variable gives you information about the value of the other variable.
=> When comparing groups, this means that the proportions or means take on different values in the different groups.

A confounding variable is a variable that is related both to the explanatory and to the response variable in such a way that its effects on the response variable cannot be separated from the effects of the explanatory variable.

Always think critically when you observe a cause-and-effect conclusion.

Avoiding confounding variables

In an observational study, the groups you compare are "just there", that is they are defined by what you see rather than by what you do.

In an experiment, you actively create the groups by what you choose to do. More formally, you assign the conditions to be compared. These conditions may be one or more treatments or a control (a group you do nothing to).

In a randomized experiment, you use a chance device to make the assignements. The role of the random assignement is to balance out potentially confounding variables among the explanatory variable groups, giving us the potential to draw cause-and-effect conclusions. (e.g.: you flip a coin before a game to assign if this will be sell-out or not)

In a double-blind study, neither the subjects nor those evaluating the response variable know which treatment group the subject is in. (ex: the teams and the statistician do not know if this is a sell out crowd game or not)

Avoiding confounding variables

The different possible study designs

Comparing two means

Bicycle weight and commuting time

The UK government’s Cycle to Work scheme allows an employee to purchase a bicycle (up to a cost of $1560) at a significant discount by using tax incentives, provided the bicycle is used for commuting to and from work. The initiative aims to “promote healthier journeys to work and reduce environmental pollution.

A British researcher, Jeremy Groves, decided to take advantage of this opportunity to buy a new bike (and publish a statistical study about it).

"Bicycle weight and commuting time: randomised trial" - BMJ, (2010) 341:c6801

Bicycle weight and commuting time

I purchased a bike at the top end of the cost allowed by the scheme and opted for a carbon frame because it was significantly lighter than my existing bicycle’s steel frame. The wheels were lighter and tyres narrower too. All were factors that made me believe that the extra £950 I had spent would get me to work in a trice.

One sunny morning, I got to work in 43 minutes, the fastest I could recall. My steel bike was consigned to a corner of the garage to gather dust—until I had a puncture. The next day I was back on my old steel bike. I fitted the cycle computer, set off . . . and discovered I had got to work in 44 minutes. “Hang on,” I thought, “was that minute worth £950 or was it a fluke?” There was only one answer: a randomised trial.

Bicycle weight and commuting time

Randomized experiment conducted by Jeremy Groves

Research question:
Groves wanted to know if bicycle weight affected his commute to work.

Experimental design:
For 56 days (January to July) Groves tossed a coin to decide if he would bike the 27 miles to work on his carbon frame bike (20.9lbs) or steel frame bicycle (29.75lbs).
He recorded the commute time for each trip.

What are the observational units? What are the variables? What type/kind of variables are those?

Bicycle weight and commuting time

Null hypothesis:
There is no association between which bike is used and commute time. Commute time is not affected by which bike is used.

Alternative hypothesis:
There is an association between which bike is used and commute time. Commute time is affected by which bike is used.

The parameters of interest are:
$\mu_{\mathrm{carbon}}$ = Long term average commute time with carbon frame bike
$\mu_{\mathrm{steel}}$ = Long term average commute time with steel frame bike

H$_0$: $\mu_{\mathrm{carbon}} = \mu_{\mathrm{steel}}$
($\mu_{\mathrm{carbon}} - \mu_{\mathrm{steel}} = 0$)

H$_a$: $\mu_{\mathrm{carbon}}\neq\mu_{\mathrm{steel}}$
($\mu_{\mathrm{carbon}} - \mu_{\mathrm{steel}} \neq 0$)

Bicycle weight and commuting time

H$_0$: $\mu_{\mathrm{carbon}} = \mu_{\mathrm{steel}}$
($\mu_{\mathrm{carbon}} - \mu_{\mathrm{steel}} = 0$)

H$_a$: $\mu_{\mathrm{carbon}}\neq\mu_{\mathrm{steel}}$
($\mu_{\mathrm{carbon}} - \mu_{\mathrm{steel}} \neq 0$)

Remember:
The hypotheses are about the association between commute time and bike used, not just the 56 trips of Groves.
Hypotheses are always about populations or processes, not the sample data.

Bicycle weight and commuting time

Frame type
Carbon Steel
Sample size 26 30
Sample mean (min) 108.34 107.81
Sample SD (min) 6.25 4.89

The sample average and variability for commute time was higher for the carbon frame bike. Does this indicate a tendency? Or could a higher average just come from the random assignment?

Perhaps the carbon frame bike was randomly assigned to days where traffic was heavier or weather slowed down Dr. Groves on his way to work?

Bicycle weight and commuting time

Frame type
Carbon Steel
Sample size 26 30
Sample mean (min) 108.34 107.81
Sample SD (min) 6.25 4.89

Is it possible to get a difference of 0.53 minutes if commute time isn’t affected by the bike used?
Yes it’s possible, how likely though?
=> is the observed difference in commute time arising solely from the randomness in the of the frame type used each day?

Simulating the bike commute null distribution

Statistic of interest?
The observed difference in average commute time
$d=\mu_{\mathrm{carbon}} - \mu_{\mathrm{steel}}$

We can simulate this study with index cards.

  1. Write all 56 times on 56 cards
  2. Shuffle all 56 cards and randomly redistribute into two stacks
    • One with 26 cards (representing the times for the carbon-frame bike)
    • Another 30 cards (representing the times for the steel-frame bike)
  3. Calculate the difference in the average time between the two stacks of cards

Shuffling assumes the null hypothesis of no association between commute time and bike.

Simulating the bike commute null distribution (difference in means)

Shuffling procedure

Simulating the random assignment of bike frame (difference in means)

Statistic calculated for each simulation:
difference in mean ($\bar{x}_{\mathrm{carbon}}-\bar{x}_{\mathrm{steel}}$)

Simulating the random assignment of bike frame (difference in means)

How many simulations gave a result as extreme as the initial statistic?
($\bar{x}_{\mathrm{carbon}}-\bar{x}_{\mathrm{steel}}$)

Note: "as extreme as" in both directions (2 tailed test)
Upper limit:
$\bar{x}_{\mathrm{carbon}}-\bar{x}_{\mathrm{steel}}\geq0.53$
Lower limit:
$\bar{x}_{\mathrm{carbon}}-\bar{x}_{\mathrm{steel}}\leq0-0.53$

Statistic calculated for each simulation:
difference in mean ($\bar{x}_{\mathrm{carbon}}-\bar{x}_{\mathrm{steel}}$)
10000 simulations

What does this p-value mean?

If the mean commute times for the bikes were the same in the long run, and we were repeating random assignment of the lighter bike (carbon) to 26 days and the heavier (steel) to 30 days, a difference at least as extreme as 0.53 minutes would occur in about 72% of the repetitions.
=> Therefore, we don’t have evidence that the commute times for the two bikes will differ in the long run.

Have we proven that the bike Groves chooses is not associated with commute time? (Can we conclude the null?)
No! A large p-value is not “strong evidence that the null hypothesis is true.” It suggests that the null hypothesis is plausible
=> There could be long-term difference just like we saw, but it is just very small.

Scope of conclusions of the study

Can we generalize our conclusion to a larger population?

Two Key questions:

Was the sample randomly obtained from a larger population?
No, Groves commuted on consecutive days which didn’t include all seasons.
=> We cannot generalize to all the commute trips

Were the observational units randomly assigned to treatments?
Yes, he flipped a coin to choose which bike to ride each day
=> We can draw cause-and-effect conclusions.

We can’t generalize beyond Groves and his two bikes. A limitation is that this study is not double-blind. The researcher and the subject (which happened to be the same person) were not blind to which treatment (bike) was being used. (Perhaps Groves likes his old bike and wanted to show it was just as good as the new carbon-frame bike for commuting to work.)

Simulating the random assignment of bike frame (difference in medians)

Statistic of interest?
The observed difference in median commute time
$d=m_{\mathrm{carbon}} - m_{\mathrm{steel}}$

We can simulate this study with index cards.

  1. Write all 56 times on 56 cards
  2. Shuffle all 56 cards and randomly redistribute into two stacks
    • One with 26 cards (representing the times for the carbon-frame bike)
    • Another 30 cards (representing the times for the steel-frame bike)
  3. Calculate the difference in the median time between the two stacks of cards

Shuffling assumes the null hypothesis of no association between commute time and bike.

Simulating the random assignment of bike frame (difference in medians)

Statistic calculated for each simulation:
difference in median ($\mathrm{m}_{\mathrm{carbon}}-\mathrm{m}_{\mathrm{steel}}$)

Simulating the random assignment of bike frame (difference in medians)

How many simulations gave a result as extreme as the initial statistic?
($\mathrm{m}_{\mathrm{carbon}}-\mathrm{m}_{\mathrm{steel}}$)

Note: "as extreme as" in both directions (2 tailed test)
Upper limit:
$\mathrm{m}_{\mathrm{carbon}}-\mathrm{m}_{\mathrm{steel}}\geq0.17$
Lower limit:
$\mathrm{m}_{\mathrm{carbon}}-\mathrm{m}_{\mathrm{steel}}\leq0-0.17$

Statistic calculated for each simulation:
difference in median ($\mathrm{m}_{\mathrm{carbon}}-\mathrm{m}_{\mathrm{steel}}$)
10000 simulations

But isn't getting an estimate of the true value of the parameter what's really important?

I couldn't agree more!!!
You're right, let's compute the $95\%$ confidence intervals of the difference in median between the times obtained with the carbon frame and the times obtained with a steel frame

Resampling confidence intervals

Resampling confidence intervals

  1. Draw a bootstrap sample from sample 1
  2. Draw a bootstrap sample from sample 2
  3. Compute your statistics (ex: difference in median times) and store this value
  4. Repeat this process (steps 1-3) 10000 times
  5. Determine the $\frac{\alpha}{2}$ and $(1-\frac{\alpha}{2})$ percentiles in your stored result array

Resampling confidence intervals

The true value of the difference in median times for Groves to ride the 27mi using a carbon frame bike versus a steel frame bike is in the interval $[-4.82, 4.68]$ min.

=> Zero (the "no difference in median times") is in this interval. So we do not have evidence against the fact that Groves could have obtained the results he obtained if there were no difference in the times it would take for a trip using either bike.

Resampling for comparing 2 groups

NHST

  1. Pool the data into a single population
  2. Shuffle the population and separate it into 2 groups with same sizes than the original 2 samples
  3. Calculate the statistic of interest ($\bar{x}_1-\bar{x}_2$, $m_1-m_2$, ...) and store this statistic
  4. Repeat (#2-#3) 10000 times, you obtain 10000 statistics of interest defining a distribution centered around your null hypothesis ($0$ if null is "no difference")
  5. Count how many simulations resulted in a statistic of interest as at least as extreme as (in one or both directions) the original observed statistic
  6. Divide this count by the number of simulations (10000), this is your (one-sided or two-sided) p-value
  7. Is this p-value below your $\alpha$ threshold?

$\alpha$ confidence intervals

  1. Keep the samples separated, these are the 2 populations you're going to draw bootstrap samples from
  2. Draw a bootstrap sample (with replacement) from each population
  3. Calculate the statistic of interest ($\bar{x}_1-\bar{x}_2$, $m_1-m_2$, ...) and store this statistic
  4. Repeat (#2-#3) 10000 times, you obtain 10000 statistics of interest defining a distribution centered around your initial observed statistic
  5. Determine the values of the $\frac{\alpha}{2}$ and $\frac{1-\alpha}{2}$ percentiles in the array of the 10000 statistics of interest you obtained
  6. These values are the lower and upper limits of your $(1-\alpha)$ confidence intervals
  7. Is your null hypothesis value inside this interval?

Theory: Two-sample $t$-test

When comparing means between two groups, the null hypothesis is typically H$_0$ ∶ $\mu_1=\mu_2$ or, equivalently, H$_0$ ∶ $\mu_1-\mu_2=0$.
Thus the “Null parameter” is usually equal to zero and we use the difference in means for two samples, $\bar{x}_1 − \bar{x}_2$, as the “Sample statistic”.

If the underlying populations are reasonably normal or the sample sizes are large, we can estimate the standard error (standard deviation of the sampling distribution) of $\bar{x}_1 − \bar{x}_2$ with $SE = \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$, where $s_1$ and $s_2$ are the standard deviations in the two samples.

Note: When we use the samples standard deviations in estimating SE, we need to switch to a $t$-distribution rather than the standard normal when finding a p-value.

Theory: Two-sample $t$-test

For a two-sample $t$-test, you are making the (strong) assumption that your samples are coming from populations normally distributed. You can then compute the ($t$) distribution of the difference in means from 2 samples from these underlying distributions.

Theory: Two-sample $t$-test

To test H$_0$ ∶ $\mu_1=\mu_2$ vs H$_a$ ∶ $\mu_1\neq\mu_2$ (or a one-tail alternative $\mu_1>\mu_2$ or $\mu_1<\mu_2$) based on samples of sizes n$_1$ and n$_2$ from the two groups, we use the two-sample $t$-statistic.

$t=\frac{\textrm{Statistic}-\textrm{Null value}}{SE}$

$t=\frac{(\bar{x}_1-\bar{x}_2)-0}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}$
where $\bar{x}_1$ and $\bar{x}_2$ are the means and $s_1$ and $s_2$ are the standard deviations for the respective samples.

$t=\frac{(108.34-107.81)-0}{\sqrt{\frac{6.25^2}{26}+\frac{4.89^2}{30}}}=0.35$

Theory: Two-sample $t$-test

To test H$_0$ ∶ $\mu_1=\mu_2$ vs H$_a$ ∶ $\mu_1\neq\mu_2$ (or a one-tail alternative $\mu_1>\mu_2$ or $\mu_1<\mu_2$) based on samples of sizes n$_1$ and n$_2$ from the two groups, we use the two-sample $t$-statistic.

$t=\frac{(108.34-107.81)-0}{\sqrt{\frac{6.25^2}{26}+\frac{4.89^2}{30}}}=0.35$

The initial observation is $0.35$ SD away from the center of the (null) distribution.
Using this $t$-statistic & the degrees of freedom of the $t$-distribution, we can obtain a p-value by referring to a $t$-statistic table.

Theory: confidence intervals

If the underlying populations are reasonably normal or the sample sizes are large, we can estimate the standard error (standard deviation of the sampling distribution) of $\bar{x}_1 − \bar{x}_2$ with $SE = \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$, where $s_1$ and $s_2$ are the standard deviations in the two samples.

If the distribution for a statistic follows the shape of a $t$ distribution with standard error SE, we find a confidence interval for the parameter using:

$\textrm{Sample Statistic} \pm t^*\times SE$

where $t^*$ is chosen so that the proportion between −$t^*$ and +$t^*$ in the $t$-distribution is the desired level of confidence.

If the sample size in each group is large:

  • $t^*\simeq2$ ($1.96$) for the $95\%$ confidence intervals
  • $t^*\simeq1.645$ for the $90\%$ confidence intervals
  • $t^*\simeq2.576$ for the $99\%$ confidence intervals

Theory: confidence intervals

If the underlying populations are reasonably normal or the sample sizes are large, we can estimate the standard error (standard deviation of the sampling distribution) of $\bar{x}_1 − \bar{x}_2$ with $SE = \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$, where $s_1$ and $s_2$ are the standard deviations in the two samples.

If the distribution for a statistic follows the shape of a $t$ distribution with standard error SE, we find a confidence interval for the parameter using:

$\textrm{Sample Statistic} \pm t^*\times SE$

$SE=\sqrt{\frac{6.25^2}{26}+\frac{4.89^2}{30}}=1.51$

Mean & $95\%$ confidence intervals:
$0.53\pm2\times1.51$
$0.53\pm3.02$

Theory: What about the median?

Can't we just get an equation to model the distribution of the sampling distribution of the difference in medians?

NO!!! Remember, the median is a procedure, not the result of a formula, so there is no theorem or formula that can predict what the sampling distribution of the difference in median would look like. There is no two-sample t-test for the median

Using resampling statistic, you are not limited to the the study of the mean for which formula and theorems are available.
Using resampling statistic, you can study any statistic that is relevant to your study, even if if is a very exotic statistic, since each time you can do a simulation of what the sampling distribution for this particular exotic statistic would look like.