Stats 13

Lecture 3

Introduction to
hypothesis testing

Guillaume Calmettes

Last time

Importance of visualizing your data

Know/Describe your dataset

How to present categorical and quantitative variables

Shape

Central tendency

Spread

Statistical inference

A key step in the statistical investigation method is drawing conclusions beyond the observed data. Statisticians often call this statistical inference.

They are 4 main types of conclusions (inferences) that
we can draw from data:

  • Significance
  • Estimation
  • Generalization
  • Causation

Could the observed result be considered statistically significant?
Is our result unlikely to happen by random chance?

Can dolphins communicate?

Dr. Jarvins Bastian's experiment (1964)

Can dolphins communicate?

In one phase of the study, Dr. Bastian had Buzz attempt to push the correct button a total of 16 different times.
=> Buzz pushed the correct button 15 out of 16 times. ($\overset{\hat{}}{p} = \frac{15}{16} = 0.938$)

Think about it:
Based on these data, do you think Buzz somehow knew which button to push?
Is 15 out of 16 correct pushes convincing to you?
Or do you think that Buzz could have just been guessing?

Can dolphins communicate?

2 possible explanations

Buzz is just guessing (his probability of choosing the correct button is 0.50) and he got really lucky in these 16 attempts. $\pi=0.5$

Buzz is doing something other than just guessing (his probability of choosing the correct button is more than 0.50). $\pi>0.5$

The key question here is to determine what results would occur in the long-run under the assumption that Buzz is just guessing.

=>

We call this assumption of random guess by Buzz the null hypothesis (or null model)

How could we decide between our 2 hypotheses?

The chance model

Statisticians often employ chance models to generate data from random processes to help them investigate such processes.

What probability do we need to simulate to test our hypothesis?
=> 50/50 chance model

What would be a good simulation model for Buzz & Doris communication experiment?
=> coin flip

Model Experiment
Coin flip Button choice by Buzz
Heads Correct button
Tails Wrong button
Chance of heads Probability of Buzz
pressing the correct button
One set of
16 coin flips
One set of
16 experiments
Let's simulate it!

In-class simulation - Live data

If Buzz was just guessing which button to push each time, what would be the number of correct choices we would observe for 16 attempts?

https://goo.gl/cYUbFn

1- Flip a coin 16 times

2- Record the number of heads that you obtain

3- Enter this value in the spreedsheet in the cell assigned to your name (please only fill up the value for your name!)

Analysis of the dotplot

What are the:
observational units?
variables?
Hint: what does each data point represent?

Each dot represents the
number of heads in
one set of 16 coin tosses

Does it seem that the number of correct button choices by Buzz would have been surprising if in fact he was just guessing?

Long-run pattern a single coin-flip

We really need to simulate this random selection process hundreds, preferably thousands of times to obtain the long-run pattern of our simulation.

Long-run pattern for 16 coin-flips

This would be very tedious and time-consuming with coins to obtain the long-run distribution of the number of heads for 16 coin-flips, so we'll turn to technology to simulate it.

How many of these 1000 simulations produced 15 or more correct choices by Buzz?
1 out of 1000

What is the corresponding proportion of simulations that produced such an extreme result?
$\frac{1}{1000}=0.001$

p-value

A p-value is the probability of obtaining a result as extreme as the one observed, assuming that the null hypothesis is true.

A small p-value casts doubt on the null hypothesis/model used to perform the calculation (in this case, that Buzz was just guessing which button to push).

A p-value

$\leq{0.10}$
$\leq{0.05}$
$\leq{0.01}$
$\leq{0.001}$

is generally considered
to be

some
fairly strong
very strong
extremely strong

evidence against the null

The results ($\textrm{p-value}=\frac{1}{1000}=0.001$) mean our evidence is strong enough to be considered statistically significant. That is, we don't think our study result (15 out of 16 correct) happened by chance alone, but rather, something other than "random chance" was at play.

Follow up study

One goal of statistical significance testing is to rule out random chance as plausible (believable) explanation for what we have observed. We still need to worry about how well the study was conducted.

- Could Buzz see the light through the curtain?
- Could Buzz have detected a pattern in the succession of light signal?
- We haven't completely rule out random chance (but probability is very small)

Follow up study

One option that Dr. Bastian pursued was to redo the study except now he replaced the curtain with a wooden barrier between the two sides of the tank in order to ensure more complete separation between Doris and Buzz.

=> In this case, Buzz pushed the correct button only 16 out of 28 times.

Follow up study

Buzz' successes: 16 out of 28 ($\overset{\hat{}}{p} = \frac{16}{28} = 0.57$)

Simulations:
This time we need to do repetitions of 28 coin flips, not just 16.

Follow up study

Buzz' successes: 16 out of 28 ($\overset{\hat{}}{p} = \frac{16}{28} = 0.57$)

Simulations:
This time we need to do repetitions of 28 coin flips, not just 16.

p-value:

$\frac{2795}{10000}=0.280$

Not enough evidence that the "by-chance-alone" model is wrong.

Follow up study

In fact, Dr. Bastian soon discovered that in this set of attempts the equipment malfunctioned and the food dispenser for Doris did not operate and so Doris was not receiving her fish rewards during the study.

=>

It is not so surprising that removing the incentive hindered the communication between the dolphins and we cannot refute that Buzz was not guessing.

Dr. Bastian fixed the equipment and ran the study again. This time he found convincing evidence that Buzz was not guessing.

Conclusion/Generalization:
Dolphins can communicate abstract concepts!

Resampling p-value

1- Collect your sample and calculate your statistic of interest (ex: $\overset{\hat{}}{p}=\frac{17}{24}=0.708$)

2- State your null and alternative hypothesis
(ex: H$_0$ $\pi=0.5$ / H$_a$ $\pi>0.5$)

3- Simulate your null hypothesis distribution. (it should be centered at the stated H$_0$ parameter of interest)

4- Calculate the proportion of samples that resulted in cases at least as extreme as your initial observed statistic. This is your p-value.
(ex: $\textrm{p-value}=\frac{289}{10000}=0.029$)

5- Conclude about whether your have evidence in favor or against the null hypothesis.

Theory: Standardized statistic

Another way (than p-value) to measure strength of evidence is to standardize the observed statistic by measuring how far it is from the mean of the null distribution using standard deviation units.
This measure is commonly noted $z$.

$z=\frac{\textrm{observed statistic-}\fragindex{2}{\fraglight{highlight-blue-gc}{\textrm{mean of null distribution}}}}{\fragindex{4}{\fraglight{highlight-red-gc}{\textrm{standard deviation of null distribution}}}}$

Mean of null distribution:
$\mu = 0.5$ ($\pi$)

SD of null distribution:
???

Normal distribution

Theory: Central limit theorem

In the early 1900s (and even earlier), computers weren't available to do simulations, and as peole didn't want to sit around and flip coins all day long, they focused their attention on mathematical and probabilistic rules and theories that could predict what would happen if someone did simulate.

They proved the following result:

Central limit theorem:
If the sample size (n) is large enough, the distribution of sample proportion will be bell-shaped (or normal), centered at the long-run proportion, with a standard deviation of $\sqrt{\frac{\pi(1-\pi)}{n}}$

Validity conditions:
The normal approximation can be thought of as a prediction of what would occur if simulation was done. Many times this prediction is valid, but not always. The prediction is considered valid when there are at least 10 successes and 10 failures in the sample.

Theory: one proportion z-test

$z=\frac{\textrm{observed statistic-mean of null distribution}}{\textrm{standard deviation of null distribution}}$

Applied to the coin flip model, this correspond to:

$z=\frac{\overset{\hat{}}{p}-\pi}{\sqrt{\frac{\pi(1-\pi)}{n}}}$

In the second experiment, Buzz got 16 correct choices out of 28 attempts.

$z=\frac{\frac{16}{28}-0.5}{\sqrt{\frac{0.5(1-0.5)}{28}}}$

$z=0.756$

Theory: one proportion z-test

Buzz got 16 correct choices out of 28 attempts

$z=0.756$

the number of correct choices is 0.756 SD away from the mean of the null normal distribution

Theory: Binomial distribution

Note:
Mathematically, it is possible to calculate the exact probability of getting $\geq16$ heads in 28 tosses using the binomial distribution.
p = 0.2858

Theory: one proportion z-test

Note:
Mathematically, it is possible to calculate the exact probability of getting $\geq16$ heads in 28 tosses using the binomial distribution.
p = 0.2858

Which method gives the p-value closest to the true probability? Why?

Hypothesis testing for one proportion

Lots of assumptions & required validity conditions!

Resampling simulation Normal approximation (Z-test)
Collect your sample and calculate your statistic of interest ($\overset{\hat{}}{p}$)
State your null (H$_0$) and alternative (H$_a$) hypotheses
Directly simulate your null hypothesis distribution (this results in a resampling distribution centered at the stated H$_0$ proportion $\pi$)
Consider the normal distribution centered at the long-run H$_0$ proportion $\pi$, with a standard deviation of $\sqrt{\frac{\pi(1-\pi)}{n}}$
Calculate the proportion of resampling samples that resulted in cases at least as extreme as your initial observed statistic.
This is your p-value
Determine the proportion of area under the curve that is at least as extreme as your initial observed statistic.
This is your p-value
Conclude about whether your have evidence in favor or against the null hypothesis