Stats 13

Lecture 17

Essential Synthesis #1

Guillaume Calmettes

The big picture

1) You have a question
Questions are always about population data.

2) Data analysis
You cannot always access the full population data and have to rely on a small part of it, the sample data, that you will analyze.

3) Answer the question
Using the new knowledge you gained from the sample analysis, you can make conclusions (under some conditions) about the population data. This is the process of making inference.

The statistical analysis process

Collect data / Design experiment

Visualize data (important!). Distribution? Skewness? outliers?

Choose a statistic to describe the sample (will depend on skewness, etc ...)

Formulate hypotheses

Analysis

Data collection

We obtain information about (observational) units.

A variable is any characteristic that is recorded on each unit. It can be categorical or quantitative.

When selecting data, the goal is to select a sample that is similar (representative) to the population, only smaller.

Sampling biais occurs when the method of selecting a sample causes the sample to differ from the population in some relevant way.

Sampling

Sampling bias: when a sampling method systematically yields results that are either too high or too low.
can be avoided by using good sampling technique (random).

Sampling variation: natural variation in results from one random sample to next.
can be reduced by using a larger sample.

Experimental design

If we want to show relationships (association) between several variables, we are analyzing the effects of an explanatory variable on a response variable.

A third variable that is associated with both the explanatory variable and the response variable is called a confounding variable.

Experimental design

Observational study Randomized experiment
The independent (explanatory) variable is not under the control of the researcher because of ethical concerns or logistical constraints. The explanatory variable for each unit is determined randomly, before the response variable is measured.
Confounding variables can be present both in observational studies or randomized experiments, but ...
Because the response variable assignment is not randomized, the effects of confounding variables are almost always present in observational studies. Because the explanatory variable is randomly assigned, it is not associated with any other variables. The effect of confounding variables are most likely eliminated!!!
Observational studies can almost never be used to establish causation. Randomized experiments make it possible to infer causation!

Randomized experiment

Data collection / Study design

Descriptive statistics

In order to make sense of data, we need ways to summarize and visualize it.

Summarizing and visualizing variables and relationships between two variables is often known as descriptive statistics (also known as exploratory data analysis)

Type of summary statistics and visualization methods depend on the type of variable(s) being analyzed (categorical or quantitative).

Exploratory data analysis

Variable(s) Visualisation Summary statistic
Categorical Bar chart, pie chart (relative) Frequency table, proportion
Quantitative Dotplot, histogram, boxplot Mean, median, max, min, standard deviation, range, IQR, five number summary
Categorical vs categorical Side-by-side bar chart, segmented bar chart Two-way table, difference in proportions, ratio of the proportion (relative risk)
Quantitative vs categorical Side-by-side boxplots, side-by-side dotplots Statistics for each group, difference in means/median, ratio of the means/medians
Quantitative vs quantitative Scatterplot Correlation, regression line

Sampling distribution

A sampling distribution is the distribution of statistics computed for different samples of the same size taken from the same population.

The spread of the sampling distribution helps us to assess the uncertainty in the sample statistic. (Note: the standard deviation of the sampling distribution is the standard error of the parameter).

We rarely get to see the sampling distribution, we usually only have one sample.

Bootstrap

A bootstrap sample is a random sample taken with replacement from the original sample, of the same size as the original sample.

A bootstrap statistic is the statistic computed on the bootstrap sample.

A bootstrap distribution is the distribution of many bootstrap statistics.

The bootstrap distribution will be centered at the statistic of the original sample the bootstrap samples have been drawn from.

Confidence intervals

A confidence interval for a parameter is an interval computed from sample data by a method that will capture the parameter for a specified proportion of all samples.

The parameter is fixed (from population)
The statistic is random (depends on the sample)
The interval is random (depends on the statistic)
A 95% confidence interval will contain the true parameter for 95% of all samples

Drawing bootstrap samples from a sample is one of the method to compute the confidence intervals.

Statistical analysis

All the statistical problems we have seen could be solved using two main types of approaches:

Null Hypothesis Testing

How unusual would it be to get results as extreme (or more extreme) than those observed, if the null hypothesis is true?

95% confidence intervals

What is an estimate of the range of possible values of the true parameter? Is the null hypothesis parameter contained in this interval?

Note: this is looking at the same thing, but with a different angle.

Randomization distribution & p-value

A randomization distribution is the distribution of sample statistics we would observe, just by random chance, if the null hypothesis were true.

The p-value is the probability of getting a statistic as extreme (or more extreme) as that observed, just by random chance, if the null hypothesis is true.

The p-value measures evidence against the null hypothesis

The p-value is calculated by finding the proportion of statistics in the randomization distribution that fall beyond (one or both direction) the observed statistic

p-value common thresholds

A small p-value casts doubt on the null hypothesis/model used to perform the calculation.

A p-value

$\leq{0.10}$
$\leq{0.05}$
$\leq{0.01}$
$\leq{0.001}$

is generally considered
to be

some
fairly strong
very strong
extremely strong

evidence against the null

Formal decisions:
p-value $<\alpha$ Reject $H_0$
p-value $\geq\alpha$ Do not reject $H_0$

Errors in significance testing

Errors can happen!

  • A Type I error is rejecting a true null (false positive)
  • A Type II error is not rejecting a false null (false negative)
Decision
Reject H$_0$ Do not reject H$_0$
Truth H$_0$ true Type I error
H$_0$ false Type II error

Errors in significance testing

Errors can happen!

  • A Type I error is rejecting a true null (false positive)
  • A Type II error is not rejecting a false null (false negative)

Statistics of interest & Null

Situation Statistic of interest Null
One categorical variable Proportion ($\overset{\hat{}}{p}$) $\pi_0=x$
One quantitative variable Mean ($\bar{x}$), Median ($m$), Min, Max, etc ... $\mu_0=x$
Comparing two categorical variables Difference in proportions ($\overset{\hat{}}{p}_1-\overset{\hat{}}{p}_2$), ratio of the proportions (relative risk), $\chi^2$ $\pi_1-\pi_2=0$
($rr=1$)
Comparing two quantitative variables Difference in means ($\bar{x}_1-\bar{x}_2$), difference in medians, ratio of the means, ratio of the medians, etc ... $\mu_1-\mu_2=0$
($rr=1$)
Comparing more than 2 categorical variables $\chi^2$, MAD (of proportions) $MAD=0$, $\chi^2=0$
Comparing more than 2 quantitative variables MAD (of the means, of the medians), $F$-statistic $MAD=0$, $F=0$
Association of 2 quantitative variables Correlation coefficient ($r$),
regression slope ($m$)
$\rho=0$,
$\beta_1=0$

Hypothesis testing

1. State hypotheses ($H_0$, $H_a$)

2. Calculate the statistic of interest from your sample(s) data

3. Construct the randomization distribution of the statistic of interest
- One variable (categorical): simulation of random samples ($\pi_0=x$)
- $\geq$2 variables: pool samples together ($H_0$: no difference), shuffle and split into random groups (same size than original samples), and calculate new statistic of interest from this randomization
distribution of statistic of interest if the null hypothesis were true

4. Measure how extreme your test statistic from (2) is, as compared to the distribution generated in (3)
p-value

Hypothesis testing

Comparing proportions:

Hypothesis testing

Comparing means (medians, etc ...):

95% confidence intervals

1. State hypotheses ($H_0$, $H_a$)

2. Calculate the statistic of interest from your sample(s) data

3. Construct the bootstrap distribution of the statistic of interest
- One variable: bootstrap samples from original sample (same size)
- $\geq$2 variables: Independant bootstrap samples (same size than original samples), and calculate new statistic of interest from these bootstrap samples
distribution of statistic of interest if more samples were acquired

4. Taking the middle 95% of the bootstrap distribution.
is the null hypothesis parameter in this interval?

95% confidence intervals

Comparing proportions:

95% confidence intervals

Comparing means:

Questions?