Guillaume Calmettes
To answer a question about a population (of data).
In a perfect world, we would have access to the full population so we would be able to answer the question based on data collected on all the observational units of the population. We would know the value of the true parameter we are interested in.
But usually we just have one sample.
Statistics is about using that single sample to make inferences about the population parameter.
Let's consider that we are in charge of running a new health campaign that targets the students of a particular school. This school has 500 students. We want to know if using TV ads would be an efficient way of making the students aware of this new campaign.
One parameter that could be of interest to answer this question is the average time a student spend per quarter watching TV.
If we could collect data on all the students, then the data would look like this:
We only have a limited budget ($\$$60) to investigate our question ("What is the value of $\mu$?"), and it costs $\$$10/student to collect information.
We randomly select 6 students in the school to obtain a representative sample we can work with. From this sample (n=6) we calculate the average time spent per student watching TV per quarter (if representative sample, $\bar{x}\approx\mu$).
If we had randomly selected another sample, the value of $\bar{x}$ obtained would have been a bit different (but still $\bar{x}\approx\mu$).
Lots of possible different samples from the
population, lots of slightly different
values for the statistics of interest ($\bar{x}\approx\mu$).
All of these possible statistics of interest define the
sampling distribution of the statistic.
Any statistical test/procedure relies on the sampling distribution:
In the case of a null hypothesis significance testing, we are interested in the sampling distribution of the statistic of interest of the null hypothesis we are testing.
When we want to determine an estimate (confidence intervals) of a parameter of interest, we are interested in the sampling distribution of the sample statistic.
What is the probability that the statistic I calculated from my sample(s) comes from the sampling distribution of the null hypothesis? (p-value)
What would be a range of possible values for the parameter of interest? (This estimate depends on how variable the sampling distribution of the statistic is)
When we perform a null hypothesis test on a one-sample dataset, we are usually interested in knowing if the parameter of the population our sample is coming from is different from a specific value.
Categorical sample: the sampling distribution we are comparing our statistic to is usually linked to a chance model with a specific probability of success.
Quantitative sample: unless we have a specific model for the null hypothesis we want to compare our sample to, then we cannot really generate a reliable null sampling distribution without making strong assumptions.
Ex: we have a machine generating series of numbers, and
we want to know if a particular series of numbers that we
found could have come from this machine.
generate
the sampling distribution of $\bar{x}$ (from samples obtained from the machine) and assess the likelyhood of our original
sample $\bar{x}$ to come from this distribution.
When performing a two-sample null hypothesis test, the question is usually whether we can consider the two samples as different (coming from different populations). We can characterize this difference using the sampling distribution of a single statistic of interest:
Categorical samples: the difference in conditional proportions ($\overset{\hat{}}{p}_1-\overset{\hat{}}{p}_2$), or the relative risk ($\frac{\overset{\hat{}}{p}_1}{\overset{\hat{}}{p}_2}$).
Quantitative samples: the difference in mean/median ($\bar{x}_1-\bar{x}_2$, $m_1-m_2$) or the ratio of the mean/median ($\frac{\bar{x}_1}{\bar{x}_2}$, $\frac{m_1}{m_2}$)
Difference between the observed statistic and the null hypothesis parameter value ($\overset{\hat{}}{p}-\pi_0$)
Sample size
Whether we do a one- or two-sided test
How accurate is your sampling distribution
NHST is all nice and sweet, but in the end, we spend time answering a question coming from the "wrong angle": instead on focusing on what our data are, we focus on where our data would stand if the null hypothesis was true.
Determining confidence intervals is a way at looking at the same thing than NHST, but using what we know (observed statistic), not an hypothesized situation ("what would [something] be if [something]").
But again, this involves looking directly at the sampling distribution of your statistic of interest! (obtained directly from the population of your data, not an hypothesized null distribution)
We want to determine the sampling distribution of our statistic of interest, to know how variable it is and get an estimate of the true value of the parameter.
We do not have the full population, so we are using the sample as a surrogate for the population (bootstrap resampling, with replacement).
The sampling distribution obtained will be centered at the statistic of interest calculated from the original sample(s).
The $\alpha$ level chosen will determine which interval to consider.
Estimate of the true value of the parameter
Can be used for hypothesis testing
Is the parameter of the
null distribution inside the confidence interval?
The different possible study designs
Collection of data:
Observational unit: one word
Variable or interest: length of the word
Analysis of data:
Observational unit: one set of 10 words
Variable of interest: average word length in the set