EDRM611 - Applied Statistics in Education and Psychology I

Objectives for Unit Seven
Correlation and Regression

1. Know the type of data required to do a correlation analysis.
In order to do a correlation analysis you must have two variables in which the data consists of matched or paired cases. The two paired variables are usually referred to as X and Y. For correlational analysis either variable can be designated as X or Y.

2. Know the meaning of high, moderate, low, positive, and negative correlations, and be able to recognize each from a verbal description of data.
If one variable changes in a consistently predictable manner as another variable changes, there is a high correlation between the variables. If the change in one variable is not predictable from changes in the other variable, there is a low correlation. A moderate correlation would be somewhere between a high and a low correlation.

There are no absolute numbers differentiating high, moderate, and low correlations. A .70 correlation may be low in some circumstances and high in others. The meanings of high and low are relative.

A positive correlation occurs when the change in one variable is in the same direction as the change in the other direction. When the changes are in opposite directions there is a negative correlation.

The highest correlation possible is +1.00 and -1.00 which are equally high. The lowest correlation possible is .00.

The number used to describe relationships is called the correlation coefficient.

3. Know the meaning of linear and non-linear relationships and the relevance of each to correlation analysis.
A linear relationship is one where a given change in one variable will have a consistent change in the other variable at all values of the variables. A non-linear relationship will have different changes of Y for a given change in X, depending on the value of X.

Normal correlation analysis describes the linear relationship between X and Y. It is inappropriate to use normal correlation analysis to describe a relationship that is not linear. If it is done, the correlation coefficient will underestimate the true relationship between X and Y.

4. Know how to interpret scatter diagrams (scatterplots) and estimate correlation coefficients and linear/non-linear relationships from them.
Scatter diagrams are conventionally arranged with the Y variable as the vertical axis and the X variable as the horizontal axis with the intersection of the two axes being the smallest value plotted with higher values extending up and to the right. Each case is represented on the plot with a point at the intersection of the that cases X and Y values.

If the points could be considered to be clustered closely around a straight line there is a high correlation. If the points represent a circle there is no correlation. If the points go up as you move to the right, there is a positive correlation. If the points go down as you move to the right, there is a negative correlation. If the points do not consistently go up or down as you move to the right there is no correlation.

If the change in Y values was consistent as you moved to the right it would be a linear relationship. If the change in Y values was inconsistent as you moved to the right it would be a non-linear relationship.

5. Know the effect of changing the units of X and/or Y N on the correlation coefficient.
Adding, subtracting, multiplying or dividing a constant to all of the numbers in one or both variables does not change the correlation coefficient. This is because the correlation coefficient is, in effect, the relationship between the z-scores of the two distributions. Adding and subtracting a constant to data values changes the means and multiplying and dividing a constant to data values changes the standard deviations but using z-scores ignores the values of the means and standard deviations.

6. Know the type of scale required for correlation analysis.
Correlation analysis requires that both variables be measured at least at the interval level. There are other procedures to measure relationships with nominal and ordinal data.

7. Know the effect of the unreliability of the variables on the correlation coefficient.
If either of the variables are unreliable (have measurement error) the correlation coefficient will be spuriously low (underestimate the true relationship between the variables). Scores with a large component of "randomness" cannot be correlated with anything,.

8. Know the effect of a restricted (truncated) range on the correlation coefficient.
If either of the variables has a restricted range (not the full range of the population of interest), the correlation will be spuriously low (underestimate the true relationship between the variables). This is because error (lack of perfect correlation) will be a larger proportion of the variance in a restricted range.

9. Know the relationship between correlation and causation.
If two variables are highly correlated with each other, it should not be assumed that one variable causes the other. It may be that a common variable causes both. A high correlation just suggests that a causal relationship might be investigated. If no correlation exists between two variables, it can be assumed that no causal relation exists although the lack of correlation may be caused by poor measurement, restriction of range, a non-linear relationship, or other extraneous factors that mask the true relationship.

10. Know how to interpret a correlation coefficient of in terms of percent of variance accounted for.
The square of the correlation coefficient (coefficient of determination) is equal to the percent of the variation in one variable that is accounted for (predicted) by the other variable.

11. Know the purpose of a regression equation.
Regression equation are used to predict values of one variable, given values on another variable. Prediction can be made from X to Y or from Y to X although the common terminology is to use X to predict Y.

12. Know the meaning, functions and symbols for each component of a regression equation.
A regression equation has a regression coefficient (slope) and a constant (Y-intercept). The regression coefficient is the change in Y that occurs for each change of X of one unit. The constant is the value that is added to each predicted value.

The regression coefficient is symbolized by (b), the constant by (a), and the predicted value by Y' or Y-hat (Y with a caret above it). Each Y' can be considered to the average Y value that can be predicted for all of the cases in the distribution with a corresponding X value.

Neither a nor b can be used to evaluate the value of the regression equation. The benefit of the prediction is evaluated by the correlation coefficient or the standard error of estimate. A regression equation is useful when it is associated with a high correlation coefficient and a low standard error of estimate.

13. Know the meaning of residual.
Each predicted score has a corresponding residual which is the difference between the predicted Y score (Y') and the actual Y score.

14. Know the criteria used for forming the regression equation.
The regression equation meets the "Least Squares" criterion. The equation is that straight line for which the squared vertical (Y) distance (deviation or residual) from each point is a minimum.

15. Know how to predict using the correlation coefficient and z scores.
A predicted z score (for Y) is equal to the correlation coefficient times the corresponding z score for X.

16. Know the meaning of and how to apply regression to the mean.
Regression to the mean refers to the fact that when predicting with less than perfect prediction, the predicted z score for Y is always closer to the mean than the actual z score for X. With perfect prediction (r=1.00) the X and Y z scores are the same. With no prediction (r=0.00) the predicted Y z score is 0.00 (the mean of Y). For all predictions in between r=.00 and r=1.00, the predicted z scores for Y are between the mean (z=0.00) and the z score for X.

17. Know the meaning of total variation, unexplained variation, and explained variation.
Total variation is the sum of the squared deviation of each score from the total mean. It is the variation that exists within the distribution of Y scores before prediction.

Unexplained variation is the sum of the squared deviations of each score from the predicted value. It is the amount of variation of the Y scores that remains after prediction.

Explained variation is the sum of the squared deviations of each predicted score from the Y mean. It is the amount of variation of the Y scores that can be predicted.

18. Know how r² can be computed from total variation and explained variation.
r² is explained variation divided by total variation. It is the proportion of the total variation that can be explained.

19. Know the meaning of and how to interpret the standard error of estimate.
The standard error of estimate is an indicator of the accuracy of prediction. It is equivalent to the standard deviation of the residuals. If there is perfect prediction all of the residuals will be zero and the standard error of estimate will be zero. If there is no prediction (zero correlation), the residuals will be the same as the deviation scores and the standard error of estimate will be the same as the standard deviation of the Y scores (with a slight adjustment for one less degree of freedom).

If it can be assumed that the residuals will be normally distributed with equal variance for each Y value, the standard error of estimate can be interpreted as a standard deviation of Y values at each X value in terms of the normal curve (68% of the actual Y values will be within one standard error of estimate of the predicted Y value).