Stats 13

Lecture 9

Recap & AMA

Guillaume Calmettes

Goal of statistics

To answer a question about a population (of data).

In a perfect world, we would have access to the full population so we would be able to answer the question based on data collected on all the observational units of the population. We would know the value of the true parameter we are interested in.

But usually we just have one sample.

Statistics is about using that single sample to make inferences about the population parameter.

Population data

Let's consider that we are in charge of running a new health campaign that targets the students of a particular school. This school has 500 students. We want to know if using TV ads would be an efficient way of making the students aware of this new campaign.

One parameter that could be of interest to answer this question is the average time a student spend per quarter watching TV.

If we could collect data on all the students, then the data would look like this:

Sampling distribution

We only have a limited budget ($\$$60) to investigate our question ("What is the value of $\mu$?"), and it costs $\$$10/student to collect information.

We randomly select 6 students in the school to obtain a representative sample we can work with. From this sample (n=6) we calculate the average time spent per student watching TV per quarter (if representative sample, $\bar{x}\approx\mu$).

If we had randomly selected another sample, the value of $\bar{x}$ obtained would have been a bit different (but still $\bar{x}\approx\mu$).

Lots of possible different samples from the population, lots of slightly different values for the statistics of interest ($\bar{x}\approx\mu$).
All of these possible statistics of interest define the sampling distribution of the statistic.

Statistical test/procedure

Any statistical test/procedure relies on the sampling distribution:

In the case of a null hypothesis significance testing, we are interested in the sampling distribution of the statistic of interest of the null hypothesis we are testing.

When we want to determine an estimate (confidence intervals) of a parameter of interest, we are interested in the sampling distribution of the sample statistic.

What is the probability that the statistic I calculated from my sample(s) comes from the sampling distribution of the null hypothesis? (p-value)

What would be a range of possible values for the parameter of interest? (This estimate depends on how variable the sampling distribution of the statistic is)

NHST: One-sample test

When we perform a null hypothesis test on a one-sample dataset, we are usually interested in knowing if the parameter of the population our sample is coming from is different from a specific value.

Categorical sample: the sampling distribution we are comparing our statistic to is usually linked to a chance model with a specific probability of success.

Quantitative sample: unless we have a specific model for the null hypothesis we want to compare our sample to, then we cannot really generate a reliable null sampling distribution without making strong assumptions.

Ex: we have a machine generating series of numbers, and we want to know if a particular series of numbers that we found could have come from this machine.
generate the sampling distribution of $\bar{x}$ (from samples obtained from the machine) and assess the likelyhood of our original sample $\bar{x}$ to come from this distribution.

NHST: Two-sample test

When performing a two-sample null hypothesis test, the question is usually whether we can consider the two samples as different (coming from different populations). We can characterize this difference using the sampling distribution of a single statistic of interest:

Categorical samples: the difference in conditional proportions ($\overset{\hat{}}{p}_1-\overset{\hat{}}{p}_2$), or the relative risk ($\frac{\overset{\hat{}}{p}_1}{\overset{\hat{}}{p}_2}$).

Quantitative samples: the difference in mean/median ($\bar{x}_1-\bar{x}_2$, $m_1-m_2$) or the ratio of the mean/median ($\frac{\bar{x}_1}{\bar{x}_2}$, $\frac{m_1}{m_2}$)

What does influence the p-value?

Difference between the observed statistic and the null hypothesis parameter value ($\overset{\hat{}}{p}-\pi_0$)

Sample size

Whether we do a one- or two-sided test

How accurate is your sampling distribution

So what's the big deal with Confidence intervals?

NHST is all nice and sweet, but in the end, we spend time answering a question coming from the "wrong angle": instead on focusing on what our data are, we focus on where our data would stand if the null hypothesis was true.

Determining confidence intervals is a way at looking at the same thing than NHST, but using what we know (observed statistic), not an hypothesized situation ("what would [something] be if [something]").

But again, this involves looking directly at the sampling distribution of your statistic of interest! (obtained directly from the population of your data, not an hypothesized null distribution)

Confidence intervals

We want to determine the sampling distribution of our statistic of interest, to know how variable it is and get an estimate of the true value of the parameter.

We do not have the full population, so we are using the sample as a surrogate for the population (bootstrap resampling, with replacement).

Confidence intervals

The sampling distribution obtained will be centered at the statistic of interest calculated from the original sample(s).

The $\alpha$ level chosen will determine which interval to consider.

Estimate of the true value of the parameter

Can be used for hypothesis testing
Is the parameter of the null distribution inside the confidence interval?

Random sampling is king in StatisticsLand

The different possible study designs

Note on observational units & Variables

Collection of data:
Observational unit: one word
Variable or interest: length of the word

Analysis of data:
Observational unit: one set of 10 words
Variable of interest: average word length in the set

Ask me anything