Guillaume Calmettes
Statistical significance
Would the observed result be surprising if the
observed phenomenon
was only governed by random chance?
Hypothesis testing for one proportion
How likely would we observe such an extreme
proportion of successes under the null hypothesis (=random chance)?
Based on these data, can we say that the UCLA Bruins are better than the USC Trojans at Basketball?
What are the observational units?
Each UCLA-USC basketball game
What is the variable measured?
Is it Categorical or Quantitative?
Whether or not UCLA won the game (categorical)
We are interested in investigating whether the UCLA Bruins are better
than the USC Trojans at
basketball, and we think they are.
Alternative hypothesis:When they are playing against
the Trojans, the Bruins are more likely to win a game
($\pi>0.5$).
What is the null hypothesis in this investigation?
UCLA and USC are equally likely to win a game ($\pi=0.5$).
The recorded proportion of victories of the Bruins over the 249 games
could be explained by chance alone.
We know that the Bruins won 140 out of 249 games against USC.
What is the observed statistic?:
Proportion of UCLA wins:
$\overset{\hat{}}{p}=\frac{140}{249}=0.562$
If UCLA and USC were equally likely to win a game, would it be surprising to observe $\overset{\hat{}}{p}\geq0.562$ for UCLA over 249 games?
We can use simulations to investigate whether this provides us with strong enough evidence against the null and if the proportion of victories of the Bruins over the Trojans is statistically significant.
Single simulation?
249 random drawings of 0 or 1
Statistic of interest?
proportion of 1 (win) in the sample
Number of simulations?
10000
We know that the Bruins won 140 out of 249 games against USC.
What is the observed statistic?:
Proportion of UCLA wins:
$\overset{\hat{}}{p}=\frac{140}{249}=0.562$
If UCLA and USC were equally likely to win a game, would it be surprising to observe $\overset{\hat{}}{p}\geq0.562$ for UCLA over 249 games?
10000 simulations
of 249 game outcomes
(win, $\pi=0.5$)
Simulation # | Random sample ($n=249$) | sample $\overset{\hat{}}{p}$ | |||||||
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 0 | ... | 1 | 0 | 0 | 1 | 0.504 |
2 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0.531 |
3 | 1 | 1 | 0 | ... | 0 | 1 | 1 | 1 | 0.486 |
... | ... | ... | |||||||
9999 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 1 | 0.497 |
10000 | 0 | 1 | 1 | ... | 0 | 1 | 1 | 0 | 0.519 |
Where is it centered?
$\pi=0.5$
(null hypothesis)
What do we need to count?
How many simulations have $\overset{\hat{}}{p}\geq0.562$
Out of the 10000 simulations, 272
resulted in $\overset{\hat{}}{p}\geq0.562$
$\textrm{p-value}=\frac{272}{10000}=0.00272$
=> We have strong evidence against the chance model
So far we've looked at a measure of the strength of evidence (p-value). However, we've not yet formally looked at what factors impact the strength of evidence.
In other words, why is the strength of evidence (measured by p-value) sometimes strong and sometimes weak or non-existant?
Difference between the observed statistic and the null hypothesis parameter value ($\overset{\hat{}}{p}-\pi_0$)
What if instead of 140 wins out of the 249 games, UCLA had won 155 games of those games? Or what if they only had won 135 of those games?
How would the number of UCLA wins in the sample impact our strength of evidence against the null?
=> Intuitively, the more extreme the observed statistic, the more evidence there is against the null hypothesis.
If UCLA had won 155 games that is a success rate of
$\overset{\hat{}}{p}=\frac{155}{249}=0.622$
$0$ simulations
(p-value < 0.0001)
If UCLA had only won 135 games that is a success rate of
$\overset{\hat{}}{p}=\frac{135}{249}=0.542$
$10005$ simulations
(p-value = 0.1)
The further away the observed statistic is from the mean of the null distribution, the more evidence there is against the null hypothesis.
So far we've looked at a measure of the strength of evidence (p-value). However, we've not yet formally looked at what factors impact the strength of evidence.
In other words, why is the strength of evidence (measured by p-value) sometimes strong and sometimes weak or non-existant?
Difference between the observed statistic and the null hypothesis parameter value ($\overset{\hat{}}{p}-\pi_0$)
Sample size
What if the relative proportion of basketball wins of UCLA over USC (0.562) was the result of 124 games instead of 249? Or what if it was obtained from 498 games (twice as many)?
How would the sample size, for the same observed statistic, impact the strength of evidence against the null?
Do you think that increasing the sample size would:
=> Intuitively, it seems reasonable to think that as we increase the sample size, the strength of evidence against the null hypothesis will increase. If the same proportion of UCLA wins had been observed over more games, we would have more knowledge about the truth.
Effect of changing the sample size (number of UCLA-USC games played)
on the null distribution.
The greater the sample size, the less variability in the sample
statistic
sample size decreased to 124 ($\simeq$ half as many)
original observation
sample size 249
sample size increased to 498 (twice as many)
As the sample size increases (and the value of the observed statistic stays the same), the strength of evidence against the null hypothesis increases.
The bigger the sample size, the more reliable ("trustable") the observed statistic will be (less variability in the sample statistic from sample to sample)
If you are trying to pass a true/false test (let's say 60% or higher) but know NOTHING about what is going to be on the test, would you rather have more questions or fewer questions on the test?
As the sample size changes, the observed statistic will likely change as well
Importantly, we can't automatically assume that if we collect more data and have a bigger sample size the strength of evidence will increase (smaller p-value), because if we collect more data, our observed statistic will almost always change as well.
If UCLA and USC are playing more games, the proportion of wins by UCLA won't be exactly 0.562 forever.
So far we've looked at a measure of the strength of evidence (p-value). However, we've not yet formally looked at what factors impact the strength of evidence.
In other words, why is the strength of evidence (measured by p-value) sometimes strong and sometimes weak or non-existant?
Difference between the observed statistic and the null hypothesis parameter value ($\overset{\hat{}}{p}-\pi_0$)
Sample size
Whether we do a one- or two-sided test
What if we were wrong and instead of UCLA being better than USC at basketball, it was USC that was better?
Currently, as we've stated our null and alternative hypotheses
we haven't allowed for this possibility:
$\pi=0.5$
$\pi>0.5$
$\pi<0.5$
for UCLA
This type of alternative hypothesis is called "one-sided" because it only looks at one of the two possible ways that the null hypothesis could be wrong.
If we only consider the possibility that UCLA can win, this way of formulating our altenative hypothesis could be considered too narrow and too biased towards assuming that we are correct ahead of time.
A more objective approach would be to conduct a "two-sided" test, which allow all the possibility of the null hypothesis to be wrong.
In this case our hypothesis would be:
$\pi=0.5$
$\pi\neq0.5$
We create the randomization distribution
by assuming the null hypothesis is true.
The alternative hypothesis does not play any role in this process
as the randomization samples depend only on the null hypothesis.
However, the alternative hypothesis is important
in determining the p-value because it determines
which tail(s) to use to calculate the p-value.
If the alternative hypothesis specifies a particular direction, we refer to these as right-tailed or left-tailed tests, depending on whether the alternative hypothesis is greater than or less than, respectively.
Otherwise, we are only looking to see if there is a difference without specifying in advance in which direction it might lie. These are called two-tailed tests.
The definition of “more extreme” to compute a p-value depends on whether the alternative hypothesis yields a test that is right-, left-, or two-tailed.
Because the p-value for a two-sided test
is about twice as large
as that for a one-sided test, two-sided tests provide less evidence against
the null hypothesis.
However, note that two-sided tests are used more often in
scientific practice.
1) Lots of factors can influence the p-value
Difference between the observed statistic and the null hypothesis parameter value ($\overset{\hat{}}{p}-\pi_0$)
Sample size
Whether we do a one- or two-sided test
2) The p-value does not give information about how much different from the null our observed statistics is
A small p-value provides evidence against the null hypothesis (strength of evidence), but does not tell anything about the true value of the population parameter from which we obtained our sample
Statistical inference is the process of using data from a sample to gain information about the population.
What we really want to know, is how close to the true population parameter
the statistic we calculated from the sample is.
We want to know about the population parameter, not the sample statistic.
If we had strong evidence that the probability of UCLA to win a game against USC was larger than 0.5, what we would really want to know would be "how much larger than 0.5?", not just "is it different from 0.5?"
In our case, we did not find strong evidence that UCLA was more
likely than USC to win a game (two-tailed p-value >0.05).
But at least it would be interesting to know about
the true probability
for UCLA to win a game against USC.
In other words, if UCLA and USC
were to play an infinite number of basketball games,
what parameter $\pi$ would we observe on the long-run?
How could we get an idea about the true value of the probability of UCLA to win a game against USC?
We know how to test our statistic against a specific parameter value for our null hypothesis (0.5).
What if we were testing against a different null-hypothesis parameter value? Or over a full range of null hypothesis parameters?
$\overset{\hat{}}{p}$ | 0.562 | ||||||
$\pi_0$ | 0.48 | 0.5 | 0.55 | 0.58 | 0.6 | 0.62 | 0.65 |
$\overset{\hat{}}{p}$ | 0.562 | ||||||
$\pi_0$ | 0.48 | 0.5 | 0.55 | 0.58 | 0.6 | 0.62 | 0.65 |
Simulation # | Random sample ($n=249$) | sample $\overset{\hat{}}{p}$ | |||||||
---|---|---|---|---|---|---|---|---|---|
1 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 1 | 0.476 |
2 | 1 | 1 | 1 | ... | 0 | 1 | 1 | 0 | 0.498 |
3 | 1 | 1 | 0 | ... | 0 | 1 | 1 | 1 | 0.482 |
... | ... | ... | |||||||
9999 | 0 | 1 | 0 | ... | 0 | 1 | 0 | 1 | 0.481 |
10000 | 0 | 1 | 0 | ... | 0 | 1 | 1 | 0 | 0.509 |
Simulation # | Random sample ($n=249$) | sample $\overset{\hat{}}{p}$ | |||||||
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 1 | ... | 1 | 0 | 1 | 1 | 0.649 |
2 | 0 | 1 | 1 | ... | 0 | 1 | 0 | 0 | 0.682 |
3 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 1 | 0.628 |
... | ... | ... | |||||||
9999 | 0 | 0 | 1 | ... | 0 | 0 | 1 | 1 | 0.617 |
10000 | 1 | 1 | 1 | ... | 0 | 1 | 1 | 0 | 0.696 |
$\overset{\hat{}}{p}$ | 0.562 | ||||||
$\pi_0$ | 0.48 | 0.5 | 0.55 | 0.58 | 0.6 | 0.62 | 0.65 |
Our observed statistic
would be plausible if our sample had been drawn from
population distributions with parameter $\pi$
ranging at least from $0.5$ to $0.62$
=> null hypothesis not rejected
at the two-tailed 0.05 level.
The long-run probability $\pi$ for UCLA to win a game against USC is in this range estimate.
Key idea:
This interval of plausible values we just
obtained is called a
95% confidence interval for $\pi$, the
long-run probability for UCLA to win a game against USC. The 95% value
is called the confidence level, and is a
measure of how confident we are about our interval containing the
true population parameter.
Note:
We can also actually use the 95%
confidence interval as a
measure of the strength of evidence!
=> if the parameter of the
null distribution is in the interval, then we cannot
reject the null hypothesis (we cannot reject that $\pi=\pi_0$)
Right now, we only have a rough idea of the boundaries of the 95% confidence interval (we only tested a limited quantity of plausible values).
The bootstrap is
a resampling method that uses only the data in the original sample
(the only data we know about).
It can be used for estimating the variability
of a statistic and to approximate
a sampling distribution using just the
information in that one sample.
Think about it:
The sample is the only information we know to be true about
the population it has been drawn from.
=> The best estimate for a population parameter ($\pi$) is the relevant sample
statistic ($\overset{\hat{}}{p}$).
=> We can expand this idea and consider the sample as our best
estimate for the underlying population.
Ideally, we’d like to sample repeatedly from the population to create a sampling distribution
=> reapeatedly randomly draw 249 other
UCLA-USC games from the population and
calculate another value for our statistic
How can we make the sample data look like data from the entire population?
We assume that the population of all the UCLA-USC basketball games is basically just many, many copies of the games in our original sample (n=249).
In practice, instead of actually making many copies of the sample and
sampling from that, we use a sampling technique that is equivalent:
=> we sample with replacement from the original sample.
Once a UCLA-USC game has been selected from the sample,
it is still available to be selected again.
=> Each UCLA-USC game in the original sample actually represents
many other games with a similar outcome (win or loss).
Each sample selected in this way, with replacement from the original
sample, is called a bootstrap sample.
Note: Recall that the variability of a sample statistic depends on the size of the sample. Because we are trying to uncover the variability of the sample statistic, it is important that each bootstrap sample is the same size as the original sample (n=249).
data sample (n=249)
$140$ ones & $109$ zeros
$P(1)=0.562$
$P(0)=0.438$
1 | 1 | 0 | 0 | 0 | 1 | ... | 1 | 0 | 1 | 1 | 0 | 1 |
$\overset{\hat{}}{p}=0.562$
Now that we have our Bootstrap sampling distribution, we can compute the confidence intervals at the 95% level.
Where is it centered?
centered at $\overset{\hat{}}{p}$
(original observed statistic)
How do we compute the 95% confidence intervals or our population parameter?
We consider the central 95% of the bootstrap statistics
(discard the 2.5% more extreme cases on each side)
We are 95% confident that the true probability of the Bruins to win a game against the Trojans (parameter $\pi$) is lying in the interval $[0.498-0.622]$
A bootstrap confidence interval at the alpha level, is computed by considering the central (1-alpha) bootstrap statistics of the bootstrap sampling distribution we constructed.
In the sorted array of bootstrap statistics, this corresponds to the statistics at positions:
If we were were able to get an unlimited number of original samples from the population of UCLA-USC basketball games, and were computing the $alpha$-% confidence interval from each of them, the true probability of UCLA winning a game against USC would be in $alpha$-% of these calculated intervals.
Confidence intervals provide information about both:
- strength of evidence
(Is the parameter value we want to test against,
in the interval or not? => 95% confidence interval)
- estimate of the true
parameter value
($alpha$-% of the time our interval will include our true parameter)
In that sense, reporting the $alpha$-% confidence intervals is better than just reporting a p-value, since what we really want to know is the true value of our population parameter, not just if our true parameter is different or not from a particular value.
Bootstrap regression analysis