Guillaume Calmettes
1) You have a question
Questions are always about population data.
2) Data analysis
You cannot always access the full population data and
have to rely on a small part of it, the sample data, that you will analyze.
3) Answer the question
Using the new knowledge you gained from the sample analysis, you can make conclusions (under some conditions) about the population data. This is the process of making inference.
Collect data / Design experiment
Visualize data (important!). Distribution? Skewness? outliers?
Choose a statistic to describe the sample (will depend on skewness, etc ...)
Formulate hypotheses
Analysis
We obtain information about (observational) units.
A variable is any characteristic that is recorded on each unit. It can be categorical or quantitative.
When selecting data, the goal is to select a sample that is similar (representative) to the population, only smaller.
Sampling biais occurs when the method of selecting a sample causes the sample to differ from the population in some relevant way.
Sampling bias:
when a sampling method systematically yields results that are either too high or too low.
can be avoided by using good sampling technique (random).
Sampling variation:
natural variation in results from one random sample to next.
can be reduced by using a larger sample.
If we want to show relationships (association) between several variables, we are analyzing the effects of an explanatory variable on a response variable.
A third variable that is associated with both the explanatory variable and the response variable is called a confounding variable.
Observational study | Randomized experiment |
The independent (explanatory) variable is not under the control of the researcher because of ethical concerns or logistical constraints. | The explanatory variable for each unit is determined randomly, before the response variable is measured. |
Confounding variables can be present both in observational studies or randomized experiments, but ... | |
Because the response variable assignment is not randomized, the effects of confounding variables are almost always present in observational studies. | Because the explanatory variable is randomly assigned, it is not associated with any other variables. The effect of confounding variables are most likely eliminated!!! |
Observational studies can almost never be used to establish causation. | Randomized experiments make it possible to infer causation! |
In order to make sense of data, we need ways to summarize and visualize it.
Summarizing and visualizing variables and relationships between two variables is often known as descriptive statistics (also known as exploratory data analysis)
Type of summary statistics and visualization methods depend on the type of variable(s) being analyzed (categorical or quantitative).
Variable(s) | Visualisation | Summary statistic |
Categorical | Bar chart, pie chart | (relative) Frequency table, proportion |
Quantitative | Dotplot, histogram, boxplot | Mean, median, max, min, standard deviation, range, IQR, five number summary |
Categorical vs categorical | Side-by-side bar chart, segmented bar chart | Two-way table, difference in proportions, ratio of the proportion (relative risk) |
Quantitative vs categorical | Side-by-side boxplots, side-by-side dotplots | Statistics for each group, difference in means/median, ratio of the means/medians |
Quantitative vs quantitative | Scatterplot | Correlation, regression line |
A sampling distribution is the distribution of statistics computed for different samples of the same size taken from the same population.
The spread of the sampling distribution helps us to assess the uncertainty in the sample statistic. (Note: the standard deviation of the sampling distribution is the standard error of the parameter).
We rarely get to see the sampling distribution, we usually only have one sample.
A bootstrap sample is a random sample taken with replacement from the original sample, of the same size as the original sample.
A bootstrap statistic is the statistic computed on the bootstrap sample.
A bootstrap distribution is the distribution of many bootstrap statistics.
The bootstrap distribution will be centered at the statistic of the original sample the bootstrap samples have been drawn from.
A confidence interval for a parameter is an interval computed from sample data by a method that will capture the parameter for a specified proportion of all samples.
The parameter is fixed (from population)
The statistic is random (depends on the sample)
The interval is random (depends on the statistic)
A 95% confidence interval will contain the true parameter for 95% of all samples
Drawing bootstrap samples from a sample is one of the method to compute the confidence intervals.
All the statistical problems we have seen could be solved using two main types of approaches:
Null Hypothesis Testing
How unusual would it be to get results as extreme (or more extreme) than those observed, if the null hypothesis is true?
95% confidence intervals
What is an estimate of the range of possible values of the true parameter? Is the null hypothesis parameter contained in this interval?
Note: this is looking at the same thing, but with a different angle.
A randomization distribution is the distribution of sample statistics we would observe, just by random chance, if the null hypothesis were true.
The p-value is the probability of getting a statistic as extreme (or more extreme) as that observed, just by random chance, if the null hypothesis is true.
The p-value measures evidence against the null hypothesis
The p-value is calculated by finding the proportion of statistics in the randomization distribution that fall beyond (one or both direction) the observed statistic
A small p-value casts doubt on the null hypothesis/model used to perform the calculation.
A p-value
$\leq{0.10}$
$\leq{0.05}$
$\leq{0.01}$
$\leq{0.001}$
is generally considered
to be
some
fairly strong
very strong
extremely strong
evidence against the null
Formal decisions:
p-value $<\alpha$ Reject $H_0$
p-value $\geq\alpha$ Do not reject $H_0$
Errors can happen!
Decision | |||
Reject H$_0$ | Do not reject H$_0$ | ||
Truth | H$_0$ true | Type I error | |
H$_0$ false | Type II error |
Errors can happen!
Situation | Statistic of interest | Null |
One categorical variable | Proportion ($\overset{\hat{}}{p}$) | $\pi_0=x$ |
One quantitative variable | Mean ($\bar{x}$), Median ($m$), Min, Max, etc ... | $\mu_0=x$ |
Comparing two categorical variables | Difference in proportions ($\overset{\hat{}}{p}_1-\overset{\hat{}}{p}_2$), ratio of the proportions (relative risk), $\chi^2$ |
$\pi_1-\pi_2=0$ ($rr=1$) |
Comparing two quantitative variables | Difference in means ($\bar{x}_1-\bar{x}_2$), difference in medians, ratio of the means, ratio of the medians, etc ... |
$\mu_1-\mu_2=0$ ($rr=1$) |
Comparing more than 2 categorical variables | $\chi^2$, MAD (of proportions) | $MAD=0$, $\chi^2=0$ |
Comparing more than 2 quantitative variables | MAD (of the means, of the medians), $F$-statistic | $MAD=0$, $F=0$ |
Association of 2 quantitative variables |
Correlation coefficient ($r$), regression slope ($m$) |
$\rho=0$, $\beta_1=0$ |
1. State hypotheses ($H_0$, $H_a$)
2. Calculate the statistic of interest from your sample(s) data
3. Construct the randomization distribution of the statistic of interest
- One variable (categorical): simulation of random samples ($\pi_0=x$)
- $\geq$2 variables: pool samples together ($H_0$: no difference), shuffle and split into random groups (same size than original samples), and calculate new statistic of interest from this randomization
distribution of statistic of interest if the null hypothesis were true
4. Measure how extreme your test statistic from (2) is, as compared to the distribution generated in (3)
p-value
Comparing proportions:
Comparing means (medians, etc ...):
1. State hypotheses ($H_0$, $H_a$)
2. Calculate the statistic of interest from your sample(s) data
3. Construct the bootstrap distribution of the statistic of interest
- One variable: bootstrap samples from original sample (same size)
- $\geq$2 variables: Independant bootstrap samples (same size than original samples), and calculate new statistic of interest from these bootstrap samples
distribution of statistic of interest if more samples were acquired
4. Taking the middle 95% of the bootstrap distribution.
is the null hypothesis parameter in this interval?
Comparing proportions:
Comparing means: