Objectives for Unit Seven
Correlation and Regression
1. Know the type of data required
to do a correlation analysis.
In order to do a correlation analysis
you must have two variables in which the data consists of matched or paired
cases. The two paired variables are usually referred to as X and Y. For
correlational analysis either variable can be designated as X or Y.
2. Know the meaning of high,
moderate, low, positive, and negative correlations, and be able to recognize
each from a verbal description of data.
If one variable changes in a consistently
predictable manner as another variable changes, there is a high correlation
between the variables. If the change in one variable is not predictable
from changes in the other variable, there is a low correlation. A moderate
correlation would be somewhere between a high and a low correlation.
There are no absolute numbers differentiating high, moderate, and low correlations. A .70 correlation may be low in some circumstances and high in others. The meanings of high and low are relative.
A positive correlation occurs when the change in one variable is in the same direction as the change in the other direction. When the changes are in opposite directions there is a negative correlation.
The highest correlation possible is +1.00 and -1.00 which are equally high. The lowest correlation possible is .00.
The number used to describe relationships is called the correlation coefficient.
3. Know the meaning of linear
and non-linear relationships and the relevance of each to correlation analysis.
A linear relationship is one where
a given change in one variable will have a consistent change in the other
variable at all values of the variables. A non-linear relationship will
have different changes of Y for a given change in X, depending on the value
of X.
Normal correlation analysis describes the linear relationship between X and Y. It is inappropriate to use normal correlation analysis to describe a relationship that is not linear. If it is done, the correlation coefficient will underestimate the true relationship between X and Y.
4. Know how to interpret scatter
diagrams (scatterplots) and estimate correlation coefficients and linear/non-linear
relationships from them.
Scatter diagrams are conventionally
arranged with the Y variable as the vertical axis and the X variable as
the horizontal axis with the intersection of the two axes being the smallest
value plotted with higher values extending up and to the right. Each case
is represented on the plot with a point at the intersection of the that
cases X and Y values.
If the points could be considered to be clustered closely around a straight line there is a high correlation. If the points represent a circle there is no correlation. If the points go up as you move to the right, there is a positive correlation. If the points go down as you move to the right, there is a negative correlation. If the points do not consistently go up or down as you move to the right there is no correlation.
If the change in Y values was consistent as you moved to the right it would be a linear relationship. If the change in Y values was inconsistent as you moved to the right it would be a non-linear relationship.
5. Know the effect of changing
the units of X and/or Y N on the correlation coefficient.
Adding, subtracting, multiplying
or dividing a constant to all of the numbers in one or both variables does
not change the correlation coefficient. This is because the correlation
coefficient is, in effect, the relationship between the z-scores of the
two distributions. Adding and subtracting a constant to data values changes
the means and multiplying and dividing a constant to data values changes
the standard deviations but using z-scores ignores the values of the means
and standard deviations.
6. Know the type of scale required
for correlation analysis.
Correlation analysis requires that
both variables be measured at least at the interval level. There are other
procedures to measure relationships with nominal and ordinal data.
7. Know the effect of the unreliability
of the variables on the correlation coefficient.
If either of the variables are
unreliable (have measurement error) the correlation coefficient will be
spuriously low (underestimate the true relationship between the variables).
Scores with a large component of "randomness" cannot be correlated with
anything,.
8. Know the effect of a restricted
(truncated) range on the correlation coefficient.
If either of the variables has
a restricted range (not the full range of the population of interest),
the correlation will be spuriously low (underestimate the true relationship
between the variables). This is because error (lack of perfect correlation)
will be a larger proportion of the variance in a restricted range.
9. Know the relationship between
correlation and causation.
If two variables are highly correlated
with each other, it should not be assumed that one variable causes the
other. It may be that a common variable causes both. A high correlation
just suggests that a causal relationship might be investigated. If no correlation
exists between two variables, it can be assumed that no causal relation
exists although the lack of correlation may be caused by poor measurement,
restriction of range, a non-linear relationship, or other extraneous factors
that mask the true relationship.
10. Know how to interpret a correlation
coefficient of in terms of percent of variance accounted for.
The square of the correlation coefficient
(coefficient of determination) is equal to the percent of the variation
in one variable that is accounted for (predicted) by the other variable.
11. Know the purpose of a regression
equation.
Regression equation are used to
predict values of one variable, given values on another variable. Prediction
can be made from X to Y or from Y to X although the common terminology
is to use X to predict Y.
12. Know the meaning, functions
and symbols for each component of a regression equation.
A regression equation has a regression
coefficient (slope) and a constant (Y-intercept). The regression coefficient
is the change in Y that occurs for each change of X of one unit. The constant
is the value that is added to each predicted value.
The regression coefficient is symbolized by (b), the constant by (a), and the predicted value by Y' or Y-hat (Y with a caret above it). Each Y' can be considered to the average Y value that can be predicted for all of the cases in the distribution with a corresponding X value.
Neither a nor b can be used to evaluate the value of the regression equation. The benefit of the prediction is evaluated by the correlation coefficient or the standard error of estimate. A regression equation is useful when it is associated with a high correlation coefficient and a low standard error of estimate.
13. Know the meaning of residual.
Each predicted score has a corresponding
residual which is the difference between the predicted Y score (Y') and
the actual Y score.
14. Know the criteria used for
forming the regression equation.
The regression equation meets the
"Least Squares" criterion. The equation is that straight line for which
the squared vertical (Y) distance (deviation or residual) from each point
is a minimum.
15. Know how to predict using
the correlation coefficient and z scores.
A predicted z score (for Y) is
equal to the correlation coefficient times the corresponding z score for
X.
16. Know the meaning of and how
to apply regression to the mean.
Regression to the mean refers to
the fact that when predicting with less than perfect prediction, the predicted
z score for Y is always closer to the mean than the actual z score for
X. With perfect prediction (r=1.00) the X and Y z scores are the same.
With no prediction (r=0.00) the predicted Y z score is 0.00 (the mean of
Y). For all predictions in between r=.00 and r=1.00, the predicted z scores
for Y are between the mean (z=0.00) and the z score for X.
17. Know the meaning of total
variation, unexplained variation, and explained variation.
Total variation is the sum of the
squared deviation of each score from the total mean. It is the variation
that exists within the distribution of Y scores before prediction.
Unexplained variation is the sum of the squared deviations of each score from the predicted value. It is the amount of variation of the Y scores that remains after prediction.
Explained variation is the sum of the squared deviations of each predicted score from the Y mean. It is the amount of variation of the Y scores that can be predicted.
18. Know how r² can be computed
from total variation and explained variation.
r² is explained variation
divided by total variation. It is the proportion of the total variation
that can be explained.
19. Know the meaning of and how
to interpret the standard error of estimate.
The standard error of estimate
is an indicator of the accuracy of prediction. It is equivalent to the
standard deviation of the residuals. If there is perfect prediction all
of the residuals will be zero and the standard error of estimate will be
zero. If there is no prediction (zero correlation), the residuals will
be the same as the deviation scores and the standard error of estimate
will be the same as the standard deviation of the Y scores (with a slight
adjustment for one less degree of freedom).
If it can be assumed that the residuals
will be normally distributed with equal variance for each Y value, the
standard error of estimate can be interpreted as a standard deviation of
Y values at each X value in terms of the normal curve (68% of the actual
Y values will be within one standard error of estimate of the predicted
Y value).