Stats 13

Lecture 5

Making inferences about
the population mean

Guillaume Calmettes

Last time

Null Hypothesis Significance Testing (NHST) for one proportion
- Resampling p-value
- One proportion z-test (p-value)

Limits of the p-value
Lots of factors can influence p-value (sample size, one-sided/two-sided considerations, etc...)

Confidence Intervals (CI) as a more informative alternative of the p-value (95%)
It can be obtained directly using the characteristics of the data sample we have (one way: Bootstrap, the data is our population)

Last time

Two ways at looking at the same thing.

NHST
95% CIs

From sample to population

One of the most important ideas in statistics is that we can learn a lot about a large group (called a population) by studying a small piece of it (called a sample).

Statistical inference is the process of using data from a sample to gain information about the population. If we are using the sample to infer about the population, we have to take precautions to ensure our sample is representative of our population.

How can we obtain a statistic that we trust to be reasonably close to the actual (but unknown to us) value of the parameter?

Let's do a little experiment!

In class activity (sampling words)

In most studies, we do not have access to the entire population and can only consider a sample from this population.

https://goo.gl/cYUbFn

1- Select a sample of 10 representative words

2- Record the number of letters in each of the ten words in your sample

3- Calculate the average (mean) number of letters in your ten words. Enter this value in the spreedsheet in the cell assigned to your name (please only fill up the value for your name!)

Human judgment is not very good at selecting representative samples from populations

The day after the 1948 presidential election, the Chicago Tribune ran the headline “Dewey Defeats Truman”. However, Harry S Truman defeated Thomas E. Dewey to become the 33rd president of the United States.

What problem could have occured?

The 1948 presidential election

The newspaper went to press before all the results had come in, and the headline was based partly on the results of a large telephone poll which showed Dewey sweeping Truman.

But when the dust settled, Truman easily defeated Dewey in the Electoral College, by a 303 to 189 margin. Truman also won by a 50-45 percent popular vote.

What is the sample and what is the population?
The sample is all the people who participated in the telephone poll. The population is all voting Americans .

What did the pollsters want to infer about the population based on the sample?
To estimate the percentage of all voting Americans who would vote for each candidate.

Why do you think the telephone poll yielded such inaccurate results?
People with telephones in 1948 were not representative of all American voters.
People with telephones tended to be wealthier and prefer Dewey while people without phones tended to prefer Truman.

Sampling bias

Sampling bias occurs when the method of selecting a sample causes the sample to differ from the population in some relevant way. If sampling bias exists, then we cannot trust generalizations from the sample to the population.

To avoid sampling bias, we try to obtain a sample that is representative of the population. A representative sample resembles the population, only in smaller numbers. The more representative a sample is, the more valuable the sample is for making inferences about the population.

Simple random sample

Since a representative sample is essential for drawing valid inference to the population, you are probably wondering how to select such a sample! The key is random sampling.

A simple random sample gives every observational unit in the population the same chance of being selected.
Taking a simple random sample avoids sampling bias.

Although the principle of simple random sampling is probably clear, it is by no means simple to implement.

Randomness

The key to obtaining a representative sample is using some type of random mechanism to select the observational units from the population rather than relying on convenience sample or any type of human judgment.

=> We must use a formal random sampling method such as technology.
The first step is to obtain a sampling frame where each observational unit of the population can be assigned a number.

Sampling frame

Sampling frame

Three important points about random sampling

One still gets the occasional "unlucky" sample whose results are not close to the population even with large sample sizes.

The sample size means little if the sampling method is not random.
In 1936, the Literary Digest magazine had a huge sample of 2.4 million people, yet their predictions for the Presidential election did not come close to the truth about the population.

Although the role of sample size is crucial in assessing how close the sample results will be to the population results, the size of the population does not affect this. As long as the population is large relative to the sample size (at least 10 times as large), the precision of a sample statistic depends on the sample size but not on the population size.

The sampling distribution

500
0.5
25
100

Realities of random sampling

While a random sample is ideal, often it may not be achievable:
- a list of the entire population may not exist
- it may be impossible to contact some members of the population
- or it may be too expensive or time consuming to do so

Often we must make do with whatever sample is convenient. The study can still be worth doing, but we have to be very careful when drawing inferences to the population and should at least try to avoid obvious sampling bias as much as possible.

When it is difficult to take a random sample from the population of interest, we may have to redefine the population to which we generalize.

Other sources of bias

Bias exists when the method of collecting data causes the sample data to inaccurately re ect the population.

Bias can occur when people we have selected to be in our sample choose not to participate.
If the people who choose to respond would answer differently than the people who choose not to respond, results will be biased.

The way questions are worded can also bias the results.
In 1941 Daniel Rugg asked people the same question in two different ways.
Do you think that the United States should allow public speeches against democracy?“ => 21% said the speeches should be allowed.
Do you think that the United States should forbid public speeches against democracy?“ => 39% said the speeches should not be forbidden.
=> Merely changing the wording of the question nearly doubled the percentage of people in favor of allowing (not forbidding) public speeches against democracy.

Other sources of bias

Bias exists when the method of collecting data causes the sample data to inaccurately re ect the population.

The most important message is to always think critically about the way data are collected and to recognize that not all methods of data collection lead to valid inferences. Recognizing sources of bias is often simply common sense.

Studying a quantitative variable

UCLA salaries (full time employee, 2014)
Population data (n=32578)

Studying a quantitative variable

UCLA salaries (full time employee, 2014)
Population data (n=32578)

UCLA salaries (2014) - sample

We've seen that usually, we do not have the luxury of getting access to the full true population, and we have to rely on a sample (from this population) to make inferences about the characteristics of the full population.

Let's consider that we have obtained a random sample of 100 UCLA salaries.

We want to know how UCLA salaries in 2014 compared to the National average.

US Census Bureau data

Year: 2014
Population: 15 years old and over, all races, both sexes

Mean $62,931
Median $46,480

Contrary to what we have done with categorical (proportion) data, in the case of a quantitative data sample, there is not really any probabilistic model we can simulate to test our data against. We just know the mean & median values we want to compared our data to.

Would a theoretical model be good anyways?

Bootstrap confidence intervals for a population mean

=> Making inferences using the data sample we have is the more robust approach without needing to make any assumption.

Repetitively drawing bootstrap samples from the data sample to compute the 95% confidence intervals of the population mean ($\mu$) provides a simple and robust way to estimate the variability of $\bar{x}$ and approximate the sampling distribution of $\mu$ using just the information in that one sample.

Bootstrap confidence intervals for a population mean

Sample statistic:
$\bar{x}$ (sample mean)

95% CI:
central $95\%$ of the bootstrap $\bar{x}$ distribution

  1. Generate bootstrap samples by sampling with replacement from the original sample, using the same sample size
  2. Compute the statistic of interest ("bootstrap statistic") for each of the bootstrap samples
  1. Collect the statistics for many bootstrap samples to create a bootstrap distribution
  2. Rank the bootstrap distribution statistics and take the values at positions alpha (lower ci limit) and
    (1-alpha) (higher ci limit)

Bootstrap confidence intervals for a population mean

Sample
UCLA Salaries
(n=100)

Bootstrap confidence intervals for a population mean

Each bootstrap sample:
n=100, drawing WITH replacement
from original sample
Statistic calculated: mean ($\bar{x}$)

Sample
UCLA Salaries
(n=100)

$1$ selection
$2$ selections
$\geq3$ selections

Bootstrap confidence intervals for a population mean

Bootstrap distribution:
10000 bootstrap $\bar{x}$

95% CIs:
[69 280, 94 210]

Inference:
UCLA mean salary > US mean salary

Theory: Standardized statistic

In a similar way to the calculation of a $z$ statistic for a proportion, it is possible to measure a standardized statistic for a sample mean (quantitative data), by measuring how far it is from the mean of a hypothesized population mean in term of standard deviation of the sampling distribution of the mean.

$t=\frac{\textrm{sample mean-}\fragindex{1}{\fraglight{highlight-blue-gc}{\textrm{hypothesized population mean}}}}{\fragindex{3}{\fraglight{highlight-red-gc}{\sigma\textrm{ of sampling distribution of the mean}}}}$

Hypothesized mean:
$\mu = \textrm{value to test}$

$\sigma$ of the sampling distribution of the mean (Standard Error)
$SE=\frac{\sigma}{\sqrt{n}}$

Theory: central limit theorem

Central limit theorem for sample means:
If the sample size $n$ is large, the distribution of sample means (=sampling distribution) from a population with mean $\mu$ and standard deviation $\sigma$ is approximately normally distributed with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$ (=standard error of the sample means).

However in practice we don't usually know the standard deviation $\sigma$ for the population of interest, and we usually have only the information of a sample in the population.
We can just substitute in $s$, the standard deviation of the sample, when estimating the standard error of the sample means. $SE=\frac{s}{\sqrt{n}}$

Theory: Standardized statistic (t)

When we use the $SE$ based on $\frac{s}{\sqrt{n}}$ to standardize the sample mean, the distribution is no longer the standard normal, but rather the $t$-distribution. The standardized statistic is then commonly noted $t$.

The shape of a $t$-distribution looks a lot like a normal distribution but it is a bit more spread out than a normal distribution (more observations in the "tails", less in the middle).

A key fact about the $t$-distribution is that it depends on the size of the sample ($n$). The sample size is reflected in a parameter called the degreees of freedom ($df$) for the $t$-distribution. When working with $\bar{x}$ for a sample of size $n$, we use a $t$-distribution with $df=n-1$.

Theory: One-sample t-test

Salary data
US average ($\mu_0$)
$\$62931$
UCLA sample ($n=100$):
$\bar{x}=\$80919$ / $s=\$63956$

$t$-statistic: $t=\frac{\bar{x}-\mu_0}{\frac{s}{\sqrt{n}}}$

$t=\frac{80919-62931}{\frac{63956}{\sqrt{100}}}$

$t=2.81$

Theory: 95% confidence intervals

Beacause the theoretical sampling distribution of the mean follow a $t\textrm{-distribution}$ that behaves more or less like a normal distribution, we expect $95\%$ of the sample means to lie within $2$ standard errors on each side of the true mean (more correct number $1.96$).

The theoretical $95\%$ confidence intervals of the population mean can then be approximated using the sample mean ($\bar{x}$) and the lower and upper limits below:
- lower limit: $\bar{x}-1.96\times SE$
- upper limit: $\bar{x}+1.96\times SE$

Theory vs resampling

Validity condition for a one-sample $t$-test The quantitative variable should have a symmetric distribution, or should have at least 20 observations and the sample ditribution should not be strongly skewed. If the sample size is small and the data are heavily skewed or contain outliers the $t$-distribution should not be used.

What about the median?

The median is a procedure, and cannot be described with an equation.
WE CANNOT STUDY THE MEDIAN USING A THEORETICAL APPROACH

We can easily calculate the median of all of our bootstrap samples.
A RESAMPLING APPROACH CAN BE USED TO STUDY THE MEDIAN!

What about the median?

Sample statistic:
$m$ (sample median)

95% CI:
central $95\%$ of the bootstrap $m$ distribution

  1. Generate bootstrap samples by sampling with replacement from the original sample, using the same sample size
  2. Compute the statistic of interest ("bootstrap statistic") for each of the bootstrap samples
  1. Collect the statistics for many bootstrap samples to create a bootstrap distribution
  2. Rank the bootstrap distribution statistics and take the values at positions alpha (lower ci limit) and
    (1-alpha) (higher ci limit)

What about the median?

Each bootstrap sample:
n=100, drawing WITH replacement
from original sample
Statistic calculated: median ($m$)

Sample
UCLA Salaries
(n=100)

$1$ selection
$2$ selections
$\geq3$ selections

What about the median?

Bootstrap distribution:
10000 bootstrap $m$

95% CIs:
[52 654, 67 276]

Inference:
UCLA median salary > US median salary

What about a fancy descriptor?

Sample statistic:
[fancy descriptor]

95% CI:
central $95\%$ of the bootstrap [francy descriptor] distribution

  1. Generate bootstrap samples by sampling with replacement from the original sample, using the same sample size
  2. Compute the statistic of interest ("bootstrap statistic") for each of the bootstrap samples
  1. Collect the statistics for many bootstrap samples to create a bootstrap distribution
  2. Rank the bootstrap distribution statistics and take the values at positions alpha (lower ci limit) and
    (1-alpha) (higher ci limit)