Stats 13

Lecture 11

Association of
two quantitative variables

Guillaume Calmettes

Last time

Paired data

Explanatory variable

So far, we have dealt with Categorical & Quantitative variables, whether it was for making inferences from one sample or comparing 2 groups. When comparing two groups, whatever was the nature of our response variable (categorical or quantivative) the explanatory variable has always been categorical.

Today, we will learn how to deal with and describe the relationship between two quantitative variables, meaning that the explanatory will also be quantitative.

Scatterplot

A scatterplot is the graph of the relationship between two quantitative variables.

If there are explanatory and response variables:
Explanatory variable on the x axis
Response variable on the y axis

The paired data for each case are plotted as a single point on the scatterplot.

Houses for sale in Santa Monica.
Zillow search (05/11/2017)

The numerical scales (and units) are independant, one for each variable.

Scatterplot

Note:
A scatterplot can have more than 2 dimensions, it allows you to build complex visualizations:
Size encoding
Color encoding
Symbol encoding

Ex:
- Point size: Population
- Color: Number of larcenies

(US data, 2005)

In this case, multiple relationships between the different variables can be visually appreciated at the same time.

Scatterplot

Note:
Multiple datasets can be displayed on the same scatterplot.
The different datasets can be differentiated by color/symbol encoding
All the datasets must have the same variables (displayed on the same scales)

Visual comparisons can be made.

(Teacher salary sample, $n$=100)

Scatterplot to show relationships

There are 4 majors characteristics to consider when describing a scatterplot:

1. Direction (positive/negative)

A positive association means that values of one variable tend to be higher when values of the other variable are higher
A negative association means that values of one variable tend to be lower when values of the other variable are higher
Two variables are not associated if knowing the value of one variable does not give you any information about the value of the other variable

Scatterplot to show relationships

There are 4 majors characteristics to consider when describing a scatterplot:

1. Direction (positive/negative)
2. Form (linear or not)

An association is considered linear if the overall shape of the data point clouds can be describe as a line.
An association is considered non-linear if there are obvious non-linearity in the overall shape of the data points.

Scatterplot to show relationships

There are 4 majors characteristics to consider when describing a scatterplot:

1. Direction (positive/negative)
2. Form (linear or not)
3. Strength (strong-moderate-weak, we will let correlation help us decide)

The strength of an association is reflected by how close the data points are to each other while describing the overall linear shape. More local variability in both x and y means less strength.

Scatterplot to show relationships

There are 4 majors characteristics to consider when describing a scatterplot:

1. Direction (positive/negative)
2. Form (linear or not)
3. Strength (strong-moderate-weak, we will let correlation help us decide)
4. Unusual observations

Unusual observations (outliers) refer to data points for which one of the variable is much lower or much greater than the same variable for the other data points. This data point will appear to be out of place compared to the other ones.

Cars data association

Make initial guesses for the strength and direction of association for each of the following car characteristics

Correlation

In statistics, dependence (or association) is any statistical relationship, whether causal or not, between two (random) variables.
When the association is linear, we often refer to such an association as correlation.

The numerical statistic to measure the strength and direction of linear association between two quantitative variables is the correlation coefficient (often simply called "correlation").

sample correlation coefficient: $r$
population correlation coefficient: $\rho$ ("rho")

Cars data correlation

Correlation coefficient ($r$)

What are the properties of $r$?

Correlation properties

The correlation coefficient:

Reflects the linear strength and direction of a linear relationship

Does not reflect the slope of the relationship

Does not reflect many aspects of nonlinear relationships

Correlation properties

Range of values: $-1\leq r\leq1$

The sign indicates the direction of association:

  • Positive association ($r>0$): with positive correlation one variable increases, on average, as the other increases
  • Negative association ($r<0$): with negative correlation, one variable decreases, on average, as the other increases
  • No linear relationship ($r\approx 0$)

The closer $r$ is to $\pm 1$, the stronger the linear association (the closer the points fit to a line)

$r$ has no units and does not depend on the units of measurement

The correlation makes no distinction between explanatory and response variables (swap the axes and the correlation coefficient will be the same)

Correlation guidelines

The commonly accepted ranges for interpreting the correlation coefficient are as follow:

Correlation value (absolute value) Strength of association What this means
0.7 to 1.0 Strong The points will appear to be nearly
a straight line
0.3 to 0.7 Moderate When looking at the graph the increasing/decreasing pattern will be clear, but it won’t be nearly a line
0.1 to 0.3 Weak With some effort you will be able to see a slightly increasing/decreasing pattern
0 to 0.1 None No discernible increasing/decreasing pattern

Correlation guessing game

Go to http://istics.net/Correlations/

Enter group id:

Cal17

Formula for correlation

We routinely rely on technology to compute correlations (computing the correlation "by hand" can be tedious, especially with a very high number of data points).

$r=\frac{\sum_{n=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{n=1}^{n}(x_i-\bar{x})^2}\sqrt{\sum_{n=1}^{n}(y_i-\bar{y})^2}}$

This correlation formula can be expressed under the form:

$r=\frac{1}{n-1}\sum_{n=1}^{n} (\frac{x_i-\bar{x}}{s_x})(\frac{y_i-\bar{y}}{s_y})$

with $s_x=\sqrt{\frac{\sum_{n=1}^{n}(x_i-\bar{x})^2}{n-1}}$

Essentially the correlation calculation involves converting all values for both variables to z-scores, and can be considered as the mean of the products of the standard scores (local variation of the data in both directions).

Correlation caution #1

A strong positive or negative correlation does not (necessarily) imply a cause and effect relationship between the two variables.

Correlation caution #2

A correlation near zero does not (necessarily) mean that the two variables are not associated, since the correlation measures only the strength of a linear relationship.

Correlation caution #3

Correlation can be heavily influenced by outliers. Always plot your data!

Inference for correlation - NHST

If there was no association between height and foot length, what is the probability we would get a correlation as high as 0.711 just by chance?

If there is no association between height and foot length, it means that there is no pairing between the two variables.

Inference for correlation - NHST

Resampling methodology:
We can break apart the heights and their corresponding foot lengths by shuffling one of the variables while keeping the other one untouched.

We create hypothetical pairs that could have been observed if there was no association between height and foot length.

For each simulation, we calculate the new $r$ resulting from this possible association.

Inference for correlation - NHST

By repeating this process 10000 times, we obtain the sampling distribution of the correlation coefficient if there was no association between the two variables in our dataset. This is the sampling distribution of the null hypothesis.

A p-value can be calculated by counting how many simulations resulted in a $r$ as extreme as our initial observation and dividing it by the total number of simulations.

Inference for correlation - NHST

$p=0.0003$

We have strong evidence against the null hypothesis, a correlation coefficient as extreme as $r=0.711$ would only be observed 0.3% of the time if the null hypothesis was true.
We can conclude that foot length is positively correlated with height.

Inference for correlation - 95% CIs

Resampling methodology:
To construct the 95% confidence intervals, we construct the sampling distribution of our statistic of interest ($r$) by drawing (with replacement) random bootstrap samples of the pairs from our original data sample.

For each simulation, we calculate the new $r$ resulting from this new obtained sample.

Inference for correlation - 95% CIs

By repeating this process 10000 times, we obtain the sampling distribution of $r$, centered at the value of $r$ observed in our original sample.

The 95% confidence intervals can then be calculated from this distribution.

Inference for correlation - 95% CIs

The correlation coefficient between height and foot length is 0.711, with a 95% CI of [0.54, 0.86].

$0$ (the "no-association value") is not in this interval. We have strong evidence suggesting that height and foot length are positively correlated.