Guillaume Calmettes
How to compare 2 proportions
Sports teams prefer to play in front of their own fans rather than at the opposing team's site. Having a sell out crowd should provide even more excitement and lead to an even better performance right?
Well, consider the Oklahoma City Thunder (NBA team) in its second season (2008-2009) after moving from Seattle. This team had a win-loss record that was actually worse for home games with a sell out crowd (3 wins and 15 losses) than for home games without a sell out crowd (12 wins and 11 losses) .
Game outcome? |
Sell out crowd? | ||
Yes | No | Total | |
Win | 3 | 12 | 15 |
Loss | 15 | 11 | 26 |
Total | 18 | 23 | 41 |
Sports teams prefer to play in front of their own fans rather than at the opposing team's site. Having a sell out crowd should provide even more excitement and lead to an even better performance right?
Well, consider the Oklahoma City Thunder (NBA team) in its second season (2008-2009) after moving from Seattle. This team had a win-loss record that was actually worse for home games with a sell out crowd (3 wins and 15 losses) than for home games without a sell out crowd (12 wins and 11 losses) .
Game outcome? |
Sell out crowd? | |
Yes | No | |
Win | 0.17 | 0.52 |
Loss | 0.83 | 0.48 |
Total | 1 | 1 |
$\textrm{Relative risk}=\frac{0.52}{0.17}=3.1$
(3.1 times more likely to win if it is not a sell out crowd game)
Null hypothesis:
There is no association between playing in front of
a sell out crowd and winning/losing a game.
If there was no association between the variables, what would be the probability of observing an association as strong as the one we observed ($RR=3.1$) between the fact that a game is a sell out or not and winning or losing?
To evaluate the statistical significance of the observed
association in our groups, we
will investigate how large the $RR$ in conditional
proportions tends to be just from the random assignment of
outcomes (win or loss) to the explanatory variable groups
(sell out crowd or not).
=> we will do a simulation in which the chances of winning a game is the same
whether it is a sell out or not.
What would be the probability of observing a RR of 3.1 if there was no association between sell out crowd and winning or losing?
=> The probability is very low (only 2.6% of the simulations had such an extreme RR in both directions).
10000 simulations
We have strong evidences supporting the idea that it is less likely for the Oklahoma City Thunder to win when they play in front of a sell out crowd.
What is the true value of how much more likely the Oklahoma City Thunder is to win when there is not a sell out crowd compare to when they play in a packed stadium?
1- Draw a bootstrap sample from the sell out crowd games
2- Draw a bootstrap sample from the non sell out crowd games
3- Compute your statistics (relative risk of the conditional proportions) and store this value.
4- Repeat this process (steps 1-3) 10000 times
5- Determine the $\frac{\alpha}{2}$ and $(1-\frac{\alpha}{2})$ percentiles in your stored result array
What is the true value of how much more likely the Oklahoma City Thunder is to win when there is not a sell out crowd compare to when they play in a packed stadium?
=> The Oklahoma City Thunder are 1.3 to 11 times more likely to win when they do not play in front of a sell out crowd
Are we just going to accept this conclusion?
Why does this NBA team is less likely to win when
there is a sell out crowd?
What could explain this?
There are three possible explanations for this odd finding that the team had a better winning percentage with smaller crowds:
Random chance (but we pretty much ruled it out)
The sell out crowd caused the Thunder to play worse, perhaps because of pressure or nervousness
The sell out crowd did not cause a worse performance, and some other issue (variable) explains why they had a worse winning percentage with sell out crowds
=> In other words, for #3, a third variable is at play, which is related to both the crowd size and the game outcome. (confounding variable)
Two variables are associated
(or related), if
the value of one variable gives you
information about the value of the other variable.
=> When comparing groups, this means that the proportions
or means take on different values in the different groups.
A confounding variable is a variable that is related both to the explanatory and to the response variable in such a way that its effects on the response variable cannot be separated from the effects of the explanatory variable.
Always think critically when you observe a cause-and-effect conclusion.
In an observational study, the groups you compare are "just there", that is they are defined by what you see rather than by what you do.
In an experiment, you actively create the groups by what you choose to do. More formally, you assign the conditions to be compared. These conditions may be one or more treatments or a control (a group you do nothing to).
In a randomized experiment, you use a chance device to make the assignements. The role of the random assignement is to balance out potentially confounding variables among the explanatory variable groups, giving us the potential to draw cause-and-effect conclusions. (e.g.: you flip a coin before a game to assign if this will be sell-out or not)
In a double-blind study, neither the subjects nor those evaluating the response variable know which treatment group the subject is in. (ex: the teams and the statistician do not know if this is a sell out crowd game or not)
The different possible study designs
The UK government’s Cycle to Work scheme allows an employee to purchase a bicycle (up to a cost of $1560) at a significant discount by using tax incentives, provided the bicycle is used for commuting to and from work. The initiative aims to “promote healthier journeys to work and reduce environmental pollution.”
A British researcher, Jeremy Groves, decided to take advantage of this opportunity to buy a new bike (and publish a statistical study about it).
"Bicycle weight and commuting time: randomised trial" - BMJ, (2010) 341:c6801
I purchased a bike at the top end of the cost allowed by the scheme and opted for a carbon frame because it was significantly lighter than my existing bicycle’s steel frame. The wheels were lighter and tyres narrower too. All were factors that made me believe that the extra £950 I had spent would get me to work in a trice.
One sunny morning, I got to work in 43 minutes, the fastest I could recall. My steel bike was consigned to a corner of the garage to gather dust—until I had a puncture. The next day I was back on my old steel bike. I fitted the cycle computer, set off . . . and discovered I had got to work in 44 minutes. “Hang on,” I thought, “was that minute worth £950 or was it a fluke?” There was only one answer: a randomised trial.
Randomized experiment conducted by Jeremy Groves
Research question:
Groves wanted to know if bicycle weight affected his commute to work.
Experimental design:
For 56 days (January to July) Groves tossed a coin to decide if he would bike the 27 miles to work on his
carbon frame bike (20.9lbs) or
steel frame bicycle (29.75lbs).
He recorded the
commute time for
each trip.
What are the observational units? What are the variables? What type/kind of variables are those?
Null hypothesis:
There is no association between which bike is used and
commute time. Commute time is not affected by which
bike is used.
Alternative hypothesis:
There is an association between which bike is used and
commute time. Commute time is affected by which bike is used.
The parameters of interest are:
$\mu_{\mathrm{carbon}}$ = Long term average commute time with carbon frame bike
$\mu_{\mathrm{steel}}$ = Long term average commute time with steel frame bike
H$_0$: $\mu_{\mathrm{carbon}} = \mu_{\mathrm{steel}}$
($\mu_{\mathrm{carbon}} - \mu_{\mathrm{steel}} = 0$)
H$_a$: $\mu_{\mathrm{carbon}}\neq\mu_{\mathrm{steel}}$
($\mu_{\mathrm{carbon}} - \mu_{\mathrm{steel}} \neq 0$)
H$_0$: $\mu_{\mathrm{carbon}} = \mu_{\mathrm{steel}}$
($\mu_{\mathrm{carbon}} - \mu_{\mathrm{steel}} = 0$)
H$_a$: $\mu_{\mathrm{carbon}}\neq\mu_{\mathrm{steel}}$
($\mu_{\mathrm{carbon}} - \mu_{\mathrm{steel}} \neq 0$)
Remember:
The hypotheses are about the association between commute time
and bike used, not just the 56 trips of Groves.
Hypotheses are always about populations or processes, not the sample data.
Frame type | ||
Carbon | Steel | |
Sample size | 26 | 30 |
Sample mean (min) | 108.34 | 107.81 |
Sample SD (min) | 6.25 | 4.89 |
The sample average and variability for commute time was higher for the carbon frame bike. Does this indicate a tendency? Or could a higher average just come from the random assignment?
Perhaps the carbon frame bike was randomly assigned to days where traffic was heavier or weather slowed down Dr. Groves on his way to work?
Frame type | ||
Carbon | Steel | |
Sample size | 26 | 30 |
Sample mean (min) | 108.34 | 107.81 |
Sample SD (min) | 6.25 | 4.89 |
Is it possible to get a difference of 0.53 minutes if commute time isn’t affected by the bike used?
Yes it’s possible, how likely though?
=> is the observed difference in commute time arising solely
from the randomness in the of the frame type used each day?
Statistic of interest?
The observed difference in average commute time
$d=\mu_{\mathrm{carbon}} - \mu_{\mathrm{steel}}$
We can simulate this study with index cards.
Shuffling assumes the null hypothesis of no association between commute time and bike.
Shuffling procedure
Statistic calculated for each simulation:
difference in mean
($\bar{x}_{\mathrm{carbon}}-\bar{x}_{\mathrm{steel}}$)
How many simulations gave a result as extreme as the
initial statistic?
($\bar{x}_{\mathrm{carbon}}-\bar{x}_{\mathrm{steel}}$)
Note: "as extreme as" in both
directions (2 tailed test)
Upper limit:
$\bar{x}_{\mathrm{carbon}}-\bar{x}_{\mathrm{steel}}\geq0.53$
Lower limit:
$\bar{x}_{\mathrm{carbon}}-\bar{x}_{\mathrm{steel}}\leq0-0.53$
Statistic calculated for each simulation:
difference in mean
($\bar{x}_{\mathrm{carbon}}-\bar{x}_{\mathrm{steel}}$)
10000 simulations
If the mean commute times for the bikes were the same in the long
run, and we were repeating random assignment of the lighter bike
(carbon) to 26 days and the heavier (steel) to 30 days,
a difference at least as extreme as 0.53 minutes would occur in
about 72% of the repetitions.
=> Therefore, we don’t have evidence that the commute times for
the two bikes will differ in the long run.
Have we proven that the bike Groves chooses is not associated
with commute time? (Can we conclude the
null?)
No! A large p-value is not “strong
evidence that the null hypothesis is true.”
It suggests that the null hypothesis is
plausible
=> There could be long-term difference just like we saw, but it
is just very small.
Can we generalize our conclusion to a larger population?
Two Key questions:
Was the sample randomly obtained from a larger population?
No, Groves commuted on consecutive days which didn’t include all seasons.
=> We cannot generalize to all the commute trips
Were the observational units randomly assigned to treatments?
Yes, he flipped a coin to choose which bike to ride each day
=> We can draw cause-and-effect conclusions.
We can’t generalize beyond Groves and his two bikes. A limitation is that this study is not double-blind. The researcher and the subject (which happened to be the same person) were not blind to which treatment (bike) was being used. (Perhaps Groves likes his old bike and wanted to show it was just as good as the new carbon-frame bike for commuting to work.)
Statistic of interest?
The observed difference in median commute time
$d=m_{\mathrm{carbon}} - m_{\mathrm{steel}}$
We can simulate this study with index cards.
Shuffling assumes the null hypothesis of no association between commute time and bike.
Statistic calculated for each simulation:
difference in median
($\mathrm{m}_{\mathrm{carbon}}-\mathrm{m}_{\mathrm{steel}}$)
How many simulations gave a result as extreme as the
initial statistic?
($\mathrm{m}_{\mathrm{carbon}}-\mathrm{m}_{\mathrm{steel}}$)
Note: "as extreme as" in both
directions (2 tailed test)
Upper limit:
$\mathrm{m}_{\mathrm{carbon}}-\mathrm{m}_{\mathrm{steel}}\geq0.17$
Lower limit:
$\mathrm{m}_{\mathrm{carbon}}-\mathrm{m}_{\mathrm{steel}}\leq0-0.17$
Statistic calculated for each simulation:
difference in median
($\mathrm{m}_{\mathrm{carbon}}-\mathrm{m}_{\mathrm{steel}}$)
10000 simulations
I couldn't agree more!!!
You're right, let's compute the $95\%$ confidence intervals of the difference in median between the
times obtained with the carbon frame and the times obtained with
a steel frame
The true value of the difference in median times for Groves to ride the 27mi using a carbon frame bike versus a steel frame bike is in the interval $[-4.82, 4.68]$ min.
=> Zero (the "no difference in median times") is in this interval. So we do not have evidence against the fact that Groves could have obtained the results he obtained if there were no difference in the times it would take for a trip using either bike.
NHST
$\alpha$ confidence intervals
When comparing means between two groups, the null hypothesis is typically H$_0$ ∶ $\mu_1=\mu_2$ or, equivalently, H$_0$ ∶ $\mu_1-\mu_2=0$.
Thus the “Null parameter” is usually equal to zero and we use the difference in means for two samples, $\bar{x}_1 − \bar{x}_2$, as the “Sample statistic”.
If the underlying populations are reasonably normal or the sample sizes are large, we can estimate the standard error (standard deviation of the sampling distribution) of $\bar{x}_1 − \bar{x}_2$ with $SE = \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$, where $s_1$ and $s_2$ are the standard deviations in the two samples.
Note: When we use the samples standard deviations in estimating SE, we need to switch to a $t$-distribution rather than the standard normal when finding a p-value.
For a two-sample $t$-test, you are making the (strong) assumption that your samples are coming from populations normally distributed. You can then compute the ($t$) distribution of the difference in means from 2 samples from these underlying distributions.
To test H$_0$ ∶ $\mu_1=\mu_2$ vs H$_a$ ∶ $\mu_1\neq\mu_2$ (or a one-tail alternative $\mu_1>\mu_2$ or $\mu_1<\mu_2$) based on samples of sizes n$_1$ and n$_2$ from the two groups, we use the two-sample $t$-statistic.
$t=\frac{\textrm{Statistic}-\textrm{Null value}}{SE}$
$t=\frac{(\bar{x}_1-\bar{x}_2)-0}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}$
where $\bar{x}_1$ and $\bar{x}_2$ are the means and $s_1$
and $s_2$ are the standard deviations
for the respective samples.
$t=\frac{(108.34-107.81)-0}{\sqrt{\frac{6.25^2}{26}+\frac{4.89^2}{30}}}=0.35$
To test H$_0$ ∶ $\mu_1=\mu_2$ vs H$_a$ ∶ $\mu_1\neq\mu_2$ (or a one-tail alternative $\mu_1>\mu_2$ or $\mu_1<\mu_2$) based on samples of sizes n$_1$ and n$_2$ from the two groups, we use the two-sample $t$-statistic.
$t=\frac{(108.34-107.81)-0}{\sqrt{\frac{6.25^2}{26}+\frac{4.89^2}{30}}}=0.35$
The initial observation is $0.35$ SD away from the center
of the (null) distribution.
Using this $t$-statistic & the degrees of freedom of the
$t$-distribution, we can obtain a p-value by referring to a $t$-statistic table.
If the underlying populations are reasonably normal or the sample sizes are large, we can estimate the standard error (standard deviation of the sampling distribution) of $\bar{x}_1 − \bar{x}_2$ with $SE = \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$, where $s_1$ and $s_2$ are the standard deviations in the two samples.
If the distribution for a statistic follows the shape of a $t$ distribution with standard error SE, we find a confidence interval for the parameter using:
$\textrm{Sample Statistic} \pm t^*\times SE$
where $t^*$ is chosen so that the proportion between −$t^*$ and +$t^*$ in the $t$-distribution is the desired level of confidence.
If the sample size in each group is large:
If the underlying populations are reasonably normal or the sample sizes are large, we can estimate the standard error (standard deviation of the sampling distribution) of $\bar{x}_1 − \bar{x}_2$ with $SE = \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$, where $s_1$ and $s_2$ are the standard deviations in the two samples.
If the distribution for a statistic follows the shape of a $t$ distribution with standard error SE, we find a confidence interval for the parameter using:
$\textrm{Sample Statistic} \pm t^*\times SE$
$SE=\sqrt{\frac{6.25^2}{26}+\frac{4.89^2}{30}}=1.51$
Mean & $95\%$ confidence intervals:
$0.53\pm2\times1.51$
$0.53\pm3.02$
Can't we just get an equation to model the distribution of the sampling distribution of the difference in medians?
NO!!! Remember, the median is a procedure, not the result of a formula, so there is no theorem or formula that can predict what the sampling distribution of the difference in median would look like. There is no two-sample t-test for the median
Using resampling statistic, you are not limited to the
the study of the mean for which
formula and theorems are available.
Using resampling statistic, you can study any statistic that
is relevant to your study, even if if is a very exotic statistic,
since each time you can do a simulation of what the sampling
distribution for this particular exotic statistic would look like.