Guillaume Calmettes
Null Hypothesis Significance Testing (NHST) for
one proportion
- Resampling p-value
- One proportion z-test (p-value)
Limits of the p-value
Lots of factors can influence p-value (sample size,
one-sided/two-sided considerations, etc...)
Confidence Intervals (CI) as a more informative alternative of the p-value (95%)
It can be obtained directly using the characteristics of the data sample we
have (one way: Bootstrap, the data is our population)
Two ways at looking at the same thing.
NHST
95% CIs
One of the most important ideas in statistics is that we can learn a lot about a large group (called a population) by studying a small piece of it (called a sample).
Statistical inference is the process of using data from a sample to gain information about the population. If we are using the sample to infer about the population, we have to take precautions to ensure our sample is representative of our population.
How can we obtain a statistic that we trust to be reasonably close to the actual (but unknown to us) value of the parameter?
Let's do a little experiment!
In most studies, we do not have access to the entire population and can only consider a sample from this population.
https://goo.gl/cYUbFn
1- Select a sample of 10 representative words
2- Record the number of letters in each of the ten words in your sample
3- Calculate the average (mean) number of letters in your ten words. Enter this value in the spreedsheet in the cell assigned to your name (please only fill up the value for your name!)
The day after the 1948 presidential election, the Chicago Tribune ran the headline “Dewey Defeats Truman”. However, Harry S Truman defeated Thomas E. Dewey to become the 33rd president of the United States.
What problem could have occured?
The newspaper went to press before all the results had come in, and the headline was based partly on the results of a large telephone poll which showed Dewey sweeping Truman.
But when the dust settled, Truman easily defeated Dewey in the Electoral College, by a 303 to 189 margin. Truman also won by a 50-45 percent popular vote.
What is the sample and what is the population?
The sample is all the people who participated in
the telephone poll. The population is
all voting
Americans
.
What did the pollsters want to infer about the population based on the sample?
To estimate the percentage of all voting Americans who would vote for each candidate.
Why do you think the telephone poll yielded such inaccurate results?
People with telephones in 1948 were not
representative of all American voters.
People with telephones tended to be wealthier
and prefer Dewey
while people without phones tended to prefer Truman.
Sampling bias occurs when the method of selecting a sample causes the sample to differ from the population in some relevant way. If sampling bias exists, then we cannot trust generalizations from the sample to the population.
To avoid sampling bias, we try to obtain a sample that is representative of the population. A representative sample resembles the population, only in smaller numbers. The more representative a sample is, the more valuable the sample is for making inferences about the population.
Since a representative sample is essential for drawing valid inference to the population, you are probably wondering how to select such a sample! The key is random sampling.
A simple random sample gives every
observational unit in the population the same
chance of being selected.
Taking a simple random sample avoids sampling bias.
Although the principle of simple random sampling is probably clear, it is by no means simple to implement.
The key to obtaining a representative sample is using some type of random mechanism to select the observational units from the population rather than relying on convenience sample or any type of human judgment.
=> We must use a formal random sampling method such as technology.
The first step is to obtain a sampling frame
where each observational unit of the population can be assigned a number.
One still gets the occasional "unlucky" sample whose results are not close to the population even with large sample sizes.
The sample size means little if the sampling method
is not random.
In 1936, the Literary Digest magazine had
a huge sample of 2.4 million people, yet their predictions for the
Presidential election did not come close to the truth about the
population.
Although the role of sample size is crucial in assessing how close the sample results will be to the population results, the size of the population does not affect this. As long as the population is large relative to the sample size (at least 10 times as large), the precision of a sample statistic depends on the sample size but not on the population size.
While a random sample is ideal, often it may not be achievable:
- a list of the entire population may not exist
- it may be impossible to contact some members of the population
- or it may be too expensive or time consuming to do so
Often we must make do with whatever sample is convenient. The study can still be worth doing, but we have to be very careful when drawing inferences to the population and should at least try to avoid obvious sampling bias as much as possible.
When it is difficult to take a random sample from the population of interest, we may have to redefine the population to which we generalize.
Bias exists when the method of collecting data causes the sample data to inaccurately re ect the population.
Bias can occur when people we have selected to be in our sample choose not to
participate.
If the people who choose to respond would answer differently than the people who choose not to respond, results will be biased.
The way questions are worded can also bias the results.
In 1941 Daniel Rugg asked people the same question in two different ways.
“Do you think that the United States should allow public speeches against democracy?“
=> 21% said the speeches should be allowed.
“Do you think that the United States should forbid public speeches against democracy?“
=> 39% said the speeches should not be forbidden.
=> Merely changing the wording of the question nearly doubled the percentage of people in favor of allowing (not forbidding) public speeches against democracy.
Bias exists when the method of collecting data causes the sample data to inaccurately re ect the population.
The most important message is to always think critically about the way data are collected and to recognize that not all methods of data collection lead to valid inferences. Recognizing sources of bias is often simply common sense.
UCLA salaries (full time employee, 2014)
Population data (n=32578)
UCLA salaries (full time employee, 2014)
Population data (n=32578)
We've seen that usually, we do not have the luxury of getting access to the full true population, and we have to rely on a sample (from this population) to make inferences about the characteristics of the full population.
Let's consider that we have obtained a random sample of 100 UCLA salaries.
We want to know how UCLA salaries in 2014 compared to the National average.
Year: 2014
Population: 15 years old and over,
all races, both sexes
Mean | $62,931 |
Median | $46,480 |
Contrary to what we have done with categorical (proportion) data, in the case of a quantitative data sample, there is not really any probabilistic model we can simulate to test our data against. We just know the mean & median values we want to compared our data to.
Would a theoretical model be good anyways?
=> Making inferences using the data sample we have is the more robust approach without needing to make any assumption.
Repetitively drawing bootstrap samples from the data sample to compute the 95% confidence intervals of the population mean ($\mu$) provides a simple and robust way to estimate the variability of $\bar{x}$ and approximate the sampling distribution of $\mu$ using just the information in that one sample.
Sample statistic:
$\bar{x}$ (sample mean)
95% CI:
central $95\%$ of the bootstrap
$\bar{x}$ distribution
Sample
UCLA Salaries
(n=100)
Each bootstrap sample:
n=100, drawing WITH replacement
from original sample
Statistic calculated: mean ($\bar{x}$)
Sample
UCLA Salaries
(n=100)
$1$ selection
$2$ selections
$\geq3$ selections
Bootstrap distribution:
10000 bootstrap $\bar{x}$
95% CIs:
[69 280, 94 210]
Inference:
UCLA mean salary > US mean salary
In a similar way to the calculation of a $z$ statistic for a
proportion, it is possible to measure a
standardized statistic for a
sample mean (quantitative data), by
measuring how far it is from the mean of
a hypothesized population mean in term of
standard deviation of the sampling distribution of the mean.
$t=\frac{\textrm{sample mean-}\fragindex{1}{\fraglight{highlight-blue-gc}{\textrm{hypothesized population mean}}}}{\fragindex{3}{\fraglight{highlight-red-gc}{\sigma\textrm{ of sampling distribution of the mean}}}}$
Hypothesized mean:
$\mu = \textrm{value to test}$
$\sigma$ of the sampling distribution of the mean (Standard Error)
$SE=\frac{\sigma}{\sqrt{n}}$
Central limit theorem for sample means:
If the sample size $n$ is large, the distribution
of sample means (=sampling distribution) from a population
with mean $\mu$ and standard deviation $\sigma$ is approximately normally distributed
with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$ (=standard error of the sample means).
However in practice we don't usually know the standard deviation $\sigma$
for the population of interest, and we usually have only the information of a
sample in the population.
We can just substitute in $s$, the standard deviation of the sample, when estimating
the standard error of the sample means.
$SE=\frac{s}{\sqrt{n}}$
When we use the $SE$ based on $\frac{s}{\sqrt{n}}$ to standardize the sample mean, the distribution is no longer the standard normal, but rather the $t$-distribution. The standardized statistic is then commonly noted $t$.
The shape of a $t$-distribution looks a lot like a normal distribution but it is a bit more spread out than a normal distribution (more observations in the "tails", less in the middle).
A key fact about the $t$-distribution is that it depends on the size of the sample ($n$). The sample size is reflected in a parameter called the degreees of freedom ($df$) for the $t$-distribution. When working with $\bar{x}$ for a sample of size $n$, we use a $t$-distribution with $df=n-1$.
Salary data
US average ($\mu_0$)
$\$62931$
UCLA sample ($n=100$):
$\bar{x}=\$80919$ / $s=\$63956$
$t$-statistic: $t=\frac{\bar{x}-\mu_0}{\frac{s}{\sqrt{n}}}$
$t=\frac{80919-62931}{\frac{63956}{\sqrt{100}}}$
$t=2.81$
Beacause the theoretical sampling distribution of the mean follow a $t\textrm{-distribution}$ that behaves more or less like a normal distribution, we expect $95\%$ of the sample means to lie within $2$ standard errors on each side of the true mean (more correct number $1.96$).
The theoretical $95\%$ confidence intervals
of the population mean can then be approximated using the sample mean ($\bar{x}$) and the lower and upper limits below:
- lower limit: $\bar{x}-1.96\times SE$
- upper limit: $\bar{x}+1.96\times SE$
Validity condition for a one-sample $t$-test The quantitative variable should have a symmetric distribution, or should have at least 20 observations and the sample ditribution should not be strongly skewed. If the sample size is small and the data are heavily skewed or contain outliers the $t$-distribution should not be used.
The median is a procedure, and cannot be described
with an equation.
WE CANNOT STUDY THE MEDIAN USING A THEORETICAL APPROACH
We can easily calculate the median of all
of our bootstrap samples.
A RESAMPLING APPROACH CAN BE USED TO STUDY THE MEDIAN!
Sample statistic:
$m$ (sample median)
95% CI:
central $95\%$ of the bootstrap
$m$ distribution
Each bootstrap sample:
n=100, drawing WITH replacement
from original sample
Statistic calculated: median ($m$)
Sample
UCLA Salaries
(n=100)
$1$ selection
$2$ selections
$\geq3$ selections
Bootstrap distribution:
10000 bootstrap $m$
95% CIs:
[52 654, 67 276]
Inference:
UCLA median salary > US median salary
Sample statistic:
[fancy descriptor]
95% CI:
central $95\%$ of the bootstrap
[francy descriptor] distribution