Guillaume Calmettes
$\alpha$ confidence intervals
Hypothesis testing ($\alpha$ level)
Confidence intervals are most useful when you want to estimate population parameters
Hypothesis tests and p-values are most useful when you want to test hypotheses about population parameters
Confidence intervals give you a range of plausible values
p-values quantify the strength of evidence against the null hypothesis
Are the following questions best assessed using:
How much do college students sleep, on average?
Do college students sleep more than the recommended 8 hours a night, on average?
What proportion of college students in the sleep study sample slept at least 8 hours?
Hypothesis testing is very powerful, as it helps shed light on whether an observed effect could be real or just due to random chance.
However, statistical significance is not foolproof, and it is possible to make the wrong decision, rejecting a true null hypothesis or not rejecting a false null hypothesis.
We will discuss common pitfalls associated with formal hypothesis testing, along with factors that influence the chances of these errors occurring.
Paper published in Nature with p-value < 0.01
Paper published in Nature with p-value < 0.01
Concluded, based on the data, that mate choice improves offspring fitness.
This went against conventional wisdom.
Researchers at Penn State tried to replicate the results…
600 female fruit flies, randomly divided:
After mating, females were separated from the males and put in egg-laying chambers.
200 larvae from each chamber were taken and placed in a cage with 200 mutant flies (for competition).
Offspring survival rate was measured.
Experiment repeated 10 times/day for 5 days (50 runs, 10000 larvae tested in each group).
Larvae survived? |
Mate choice? | ||
Yes | No | Total | |
Yes | 6067 | 5976 | 12043 |
No | 3933 | 4024 | 7957 |
Total | 10000 | 10000 | 20000 |
Difference in survival rate:
$d=\overset{\hat{}}{p}_\textrm{choice}-\overset{\hat{}}{p}_\textrm{no choice}$
$d=0.6067-0.5976=0.009$
One-tail p-value = 0.102
Two studies investigated the same topic.
One study found significant results. ($p<0.01$)
One study found insignificant results. ($p=0.102$)
Errors can happen!
Decision | |||
Reject H$_0$ | Do not reject H$_0$ | ||
Truth | H$_0$ true | Type I error | |
H$_0$ false | Type II error |
Errors can happen!
Option #1:
The original study (p-value < 0.01) made a Type I error, and H$_0$ is really true
Option #2:
The second study (p-value = 0.102) made a Type II error, and H$_a$ is really true
Option #3:
No errors were made; different experimental settings yielded different results
What is the probability of making a Type I or Type II error?
If the null hypothesis is true:
If $\alpha=0.05$, there is a 5% chance of a Type I error
If the null hypothesis is true:
If $\alpha=0.01$, there is a 1% chance of a Type I error
The probability of making a Type I error (rejecting a true null) is the significance level, $\alpha$.
How can we reduce the probability of making a Type II Error (not rejecting a false null)?
($=$ How could we increase the probability of rejecting the null?)
Increase the sample size
Larger sample size makes it easier to reject the null
(decrease the variability of the sampling distribution)
increase $n$ to decrease chance of Type II error
Set 1:
Mean: 16
SD: 5
Set 2:
Mean: 20
SD: 5
$\bar{x}_2-\bar{x}_1=4$
(n=9)
$p=0.1109$
(n=15)
$p=0.0376$
(n=21)
$p=0.0132$
How can we reduce the probability of making a Type II Error (not rejecting a false null)?
($=$ How could we increase the probability of rejecting the null?)
Increase the sample size
Increase the significance level
Increase $\alpha$ to reduce the chance of Type II error
Decrease $\alpha$ to reduce the chance of Type I error
While we wish to avoid both types of errors, in practice we have to accept some trade-off between them:
This balance is set by how easy or hard it is to reject H$_0$, which is exactly determined by the significance level! ($\alpha$)
The significance level $\alpha$, represents the tolerable probability of making a Type I error.
If the consequences of a Type I error are severe (for example, approving a new drug that is potentially dangerous) we might use a very small $\alpha$ (perhaps even $\alpha=0.005$).
If the consequences of a Type II error are severe (for example, failing to diagnose a treatable disease), we would want to make it easier to reject H$_0$, so we might use a relatively large $\alpha$.
However, remember that there is always a trade-off between the two types of errors, so we usually use the common significance levels of 5%, 10% or 1%.
Because the chance of a Type I error is $\alpha$ …
$\alpha$ of all tests with true null hypotheses will yield significant results just by chance.
If 100 tests are done with $\alpha=0.05$ and nothing is really going on (the null hypothesis is really true), 5% of them will yield significant results (reject the null), just by chance.
This is known as the problem of multiple testing
Consider a topic that is being investigated by research teams all over the world:
Using $\alpha=0.05$, 5% of teams are going to find something significant, even if the null hypothesis is true.
Consider a research team/company doing many hypothesis tests:
Using $\alpha=0.05$, 5% of the tests are going to be significant, even if the null hypotheses are all true.
multiple hypothesis tests
When multiple tests are conducted, if the null hypotheses are all true, the proportion of all the tests that will yield statistically significant results just by random chance is about $\alpha$, the significance level.
The experiment was actually comprised of 50 smaller experiments. (experiment repeated 10 times/day for 5 days - 50 runs, 10000 larvae tested in each group).
Larvae survived? |
Mate choice? | ||
Yes | No | Total | |
Yes | 6067 | 5976 | 12043 |
No | 3933 | 4024 | 7957 |
Total | 10000 | 10000 | 20000 |
What if we had calculated the p-value for each run?
50
p-values
0.9570 | 0.8498 | 0.1376 | 0.5407 | 0.7640 | 0.9845 | 0.3334 | 0.8437 | 0.2080 | 0.8912 |
0.8879 | 0.6615 | 0.6695 | 0.8764 | 1.0000 | 0.0064 | 0.9982 | 0.7671 | 0.9512 | 0.2730 |
0.5812 | 0.1088 | 0.0181 | 0.0013 | 0.6242 | 0.0131 | 0.7882 | 0.0777 | 0.0001 | 0.9641 |
0.8851 | 0.1280 | 0.3421 | 0.1805 | 0.1121 | 0.6562 | 0.0133 | 0.3082 | 0.6923 | 0.1925 |
0.4207 | 0.0607 | 0.3059 | 0.2383 | 0.2391 | 0.1584 | 0.1735 | 0.0319 | 0.0171 | 0.1082 |
What if we just reported the run that yielded a p-value of 0.0001?
Is that ethical?
Publication bias refers to the fact that usually only the significant results get published.
The one study that turns out significant gets published, and no one knows about all the insignificant results (also known as the file drawer problem)
Publication bias, combined with the problem of multiple testing can yield very misleading results.
The study was 100 percent authentic. My colleagues and I recruited actual human subjects in Germany. We ran an actual clinical trial, with subjects randomly assigned to different diet regimes. And the statistically significant benefits of chocolate that we reported are based on the actual data.
It was, in fact, a fairly typical study for the field of diet research. Which is to say: It was terrible science. The results are meaningless, and the health claims that the media blasted out to millions of people around the world are utterly unfounded.
I know what you’re thinking. The study did show accelerated weight loss in the chocolate group—shouldn’t we trust it? Isn’t that how science works?
Here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. (One subject was dropped.) That study design is a recipe for false positives (Type I error).
Newspapers never covered initial studies reporting null findings and rarely reported subsequent null observations. Only 48.7% of the 156 studies reported by newspapers were confirmed by the corresponding meta-analyses.
The p-value was never meant to be used the way it's used today
It can be proven that most claimed research findings are false.
Many researchers believe that if scientists set out to reproduce preclinical work published over the past decade, a majority would fail. This, in short, is the reproducibility crisis.
A recent study tried to replicate 100 results published in psychology journals: 97% of the original results were significant, only 36% of replicated results were significant
Each is one medical study
Schoenfeld J. and Ioannidis JPA (2013). 'Is everything we eat associated with cancer? A systematic cookbook review'. Am J Clin Nutr 97:1 127-134
A single study is rarely the final answer.
And yet for science reporters, new studies are irresistible — a bold new finding makes a great headline.
Point #1:
Errors (type I and II) are possible
Point #2:
Multiple testing and publication bias are a huge problem
What can you do?
Replication (or reproducibility) of a study in another setting or by another researcher is extremely important!
Studies that have been replicated with similar conclusions gain credibility
Studies that have been replicated with different conclusions lose credibility
Replication helps guard against Type I errors AND helps with generalizability
J.B. Landers (CAUSEweb.org)
Actually, the research at Penn State included 3 different experiments; two different species of fruit flies and three different mutant types.
Multiple possible outcomes (% surviving in each group, % of survivors who were from experimental group (not mutants)
Multiple ways to analyze – proportions, quantitative paired analysis
Original study:
Significant in favor of choice
(p < 0.01)
Follow-up study #1:
Not significant
$\frac{6067}{10000}-\frac{5976}{10000}= 0.6067 - 0.5976 = \fragindex{1}{\fraglight{highlight-current-bckg-red}{0.009}}$
($p = 0.102$)
Follow-up study #2:
Significant in favor of no choice
$\frac{4579}{10000}-\frac{4749}{10000}= 0.4579 - 0.4749 = \fragindex{1}{\fraglight{highlight-current-bckg-red}{-0.017}}$
($p= 0.008$)
Follow-up study #3:
Significant in favor of no choice
$\frac{1641}{5000}-\frac{1758}{5000}= 0.3282 - 0.3516 =
\fragindex{1}{\fraglight{highlight-current-bckg-red}{-0.02}}$
($p=0.007$)
Suppose a weight loss program recruits 10,000 people for a randomized experiment.
Because of the huge sample size, a difference in average weight loss of only 0.5 lbs could be found to be statistically significant
Suppose the experiment lasted for a year.
Is a loss of ½ a pound practically significant?
A statistically significant result is not always practically significant, especially with large sample sizes
What do we do then?
Set 1:
Mean: 16
SD: 5
Set 2:
Mean: 20
SD: 5
$\bar{x}_2-\bar{x}_1=4$
(n=9)
$p=0.1109$
(n=15)
$p=0.0376$
(n=21)
$p=0.0132$
What is really important here, is not so much if the p-value suggests a statistical significant result or not, but the actual value of the effect of your treatment ($\bar{x}_2-\bar{x}_1=4$)
Conclusions based off p-values are not perfect
Type I and Type II errors can happen
$\alpha$ of all tests will be significant just by chance and often, only the significant results get published
Replication of results is important
Larger sample sizes make it easier to get significant results
Reporting confidence intervals along/instead of the p-value
gives you an estimate of the true
value of your effect (treatment).
This is what is of practical importance.
Note: for more details, see the 2016 American Statistical Association’s Statement on p-values