Guillaume Calmettes
Paired data
So far, we have dealt with Categorical & Quantitative variables, whether it was for making inferences from one sample or comparing 2 groups. When comparing two groups, whatever was the nature of our response variable (categorical or quantivative) the explanatory variable has always been categorical.
Today, we will learn how to deal with and describe the relationship between two quantitative variables, meaning that the explanatory will also be quantitative.
A scatterplot is the graph of the relationship between two quantitative variables.
If there are explanatory and response variables:
Explanatory variable on the x axis
Response variable on the y axis
The paired data for each case are plotted as a single point on the scatterplot.
The numerical scales (and units) are independant, one for each variable.
Note:
A scatterplot can have more than 2 dimensions,
it allows you to build complex visualizations:
Size encoding
Color encoding
Symbol encoding
Ex:
- Point size: Population
- Color: Number of larcenies
In this case, multiple relationships between the different variables can be visually appreciated at the same time.
Note:
Multiple datasets can be displayed
on the same scatterplot.
The different datasets can be differentiated by
color/symbol encoding
All the datasets must have the same variables
(displayed on the same scales)
Visual comparisons can be made.
There are 4 majors characteristics to consider when describing a scatterplot:
1. Direction (positive/negative)
A positive association means that values of one variable tend to be higher when values of the other variable are higher
A negative association means that values of one variable tend to be lower when values of the other variable are higher
Two variables are not associated if knowing the value of one variable does not give you any information about the value of the other variable
There are 4 majors characteristics to consider when describing a scatterplot:
1. Direction (positive/negative)
2. Form (linear or not)
An association is considered linear if the overall shape of the data point clouds can be
describe as a line.
An association is considered non-linear if there are obvious non-linearity in the
overall shape of the data points.
There are 4 majors characteristics to consider when describing a scatterplot:
1. Direction (positive/negative)
2. Form (linear or not)
3. Strength (strong-moderate-weak, we will let correlation help us decide)
The strength of an association is reflected by how close the data points are to each other while describing the overall linear shape. More local variability in both x and y means less strength.
There are 4 majors characteristics to consider when describing a scatterplot:
1. Direction (positive/negative)
2. Form (linear or not)
3. Strength (strong-moderate-weak, we will let correlation help us decide)
4. Unusual observations
Unusual observations (outliers) refer to data points for which one of the variable is much lower or much greater than the same variable for the other data points. This data point will appear to be out of place compared to the other ones.
Make initial guesses for the strength and direction of association for each of the following car characteristics
In statistics, dependence (or association) is any statistical relationship,
whether causal or not, between two (random) variables.
When the association is linear, we often refer to such an association as correlation.
The numerical statistic to measure the strength and direction of linear association between two quantitative variables is the correlation coefficient (often simply called "correlation").
sample correlation coefficient: $r$
population correlation coefficient: $\rho$ ("rho")
Correlation coefficient ($r$)
What are the properties of $r$?
The correlation coefficient:
Reflects the linear strength and direction of a linear relationship
Does not reflect the slope of the relationship
Does not reflect many aspects of nonlinear relationships
Range of values: $-1\leq r\leq1$
The sign indicates the direction of association:
The closer $r$ is to $\pm 1$, the stronger the linear association (the closer the points fit to a line)
$r$ has no units and does not depend on the units of measurement
The correlation makes no distinction between explanatory and response variables (swap the axes and the correlation coefficient will be the same)
The commonly accepted ranges for interpreting the correlation coefficient are as follow:
Correlation value (absolute value) | Strength of association | What this means |
0.7 to 1.0 | Strong |
The points will appear to be nearly a straight line |
0.3 to 0.7 | Moderate | When looking at the graph the increasing/decreasing pattern will be clear, but it won’t be nearly a line |
0.1 to 0.3 | Weak | With some effort you will be able to see a slightly increasing/decreasing pattern |
0 to 0.1 | None | No discernible increasing/decreasing pattern |
Go to http://istics.net/Correlations/
Enter group id:
Cal17We routinely rely on technology to compute correlations (computing the correlation "by hand" can be tedious, especially with a very high number of data points).
This correlation formula can be expressed under the form:
$r=\frac{1}{n-1}\sum_{n=1}^{n} (\frac{x_i-\bar{x}}{s_x})(\frac{y_i-\bar{y}}{s_y})$
with $s_x=\sqrt{\frac{\sum_{n=1}^{n}(x_i-\bar{x})^2}{n-1}}$
Essentially the correlation calculation involves converting all values for both variables to z-scores, and can be considered as the mean of the products of the standard scores (local variation of the data in both directions).
A strong positive or negative correlation does not (necessarily) imply a cause and effect relationship between the two variables.
A correlation near zero does not (necessarily) mean that the two variables are not associated, since the correlation measures only the strength of a linear relationship.
Correlation can be heavily influenced by outliers. Always plot your data!
If there was no association between height and foot length, what is the probability we would get a correlation as high as 0.711 just by chance?
If there is no association between height and foot length, it means that there is no pairing between the two variables.
Resampling methodology:
We can break apart the heights and their corresponding foot lengths
by shuffling one of the variables while
keeping the other one untouched.
We create hypothetical pairs that could have been observed if there was no association between height and foot length.
For each simulation, we calculate the new $r$ resulting from this possible association.
By repeating this process 10000 times, we obtain the sampling distribution of the correlation coefficient if there was no association between the two variables in our dataset. This is the sampling distribution of the null hypothesis.
A p-value can be calculated by counting how many simulations resulted in a $r$ as extreme as our initial observation and dividing it by the total number of simulations.
$p=0.0003$
We have strong evidence against the null hypothesis,
a correlation coefficient as extreme as $r=0.711$ would
only be observed 0.3% of the time if the null hypothesis
was true.
We can conclude that foot length is positively
correlated with height.
Resampling methodology:
To construct the 95% confidence intervals, we construct
the sampling distribution of our statistic of interest ($r$)
by drawing (with replacement)
random bootstrap samples of the pairs from our original data sample.
For each simulation, we calculate the new $r$ resulting from this new obtained sample.
By repeating this process 10000 times, we obtain the sampling distribution of $r$, centered at the value of $r$ observed in our original sample.
The 95% confidence intervals can then be calculated from this distribution.
The correlation coefficient between height and foot length is 0.711, with a 95% CI of [0.54, 0.86].
$0$ (the "no-association value") is not in this interval. We have strong evidence suggesting that height and foot length are positively correlated.