The regression equation was of the form y = bx + a, where b is the slope or regression coefficient and a is the intercept or regression constant. The symbols used vary and are commonly reversed as well! The correlation strength, for quantitative (interval or ratio) data, was expressed in terms of r, the Pearson product moment correlation coefficient, with values in magnitude near 1 indicating a strong correlation and values near zero indicated no or little correlation. When the regression equation slope is positive, the regression is said to be positive and the correlation coefficient will be positive as well. When the regression equation slope is negative, the regression is said to be negative and the correlation coefficient will be negative as well. When the slope is zero (b = 0), the correlation coefficient will be 0 (r = 0). r2 is called the coefficient of determination and is a measure of the variation in y explained by the variation in x.
Unless there is perfect correlation (|r| = 1), there will be some scatter of the data about the best fit regression line. Ideally, the distribution of the data about the regression line, the conditional distributions, will conform to normal distributions with the mean on the regression line and standard error of estimate: syx = sysqrt(1-r2)sqrt((n - 1)/(n - 2)). In addition, ideally, these conditional distributions will have equal variances (hence standard deviations). This assumption is homoscedasticity.
Against this background we can establish probabilities regarding scores away from the predicted score. We will do this by use of z-scores formed from a test y value less the predicted y value, and that difference divided by the standard error of the estimate: z = (y - y) / sy. We then use these z-scores and a normal distribution to establish probabilities.
When used for confidence intervals, we have to take into account how far we are away from the centroid (the ordered pair (mean x, mean y)) and for the standard error of predicted score sy = syxsqrt(1 + (1/n) + (x - x)2/SSx). SSx = (n - 1)s2x. Since we typically have a small number of observations, the t-distribution should be used with n - 2 degrees of freedom. The confidence interval than is:
|CI = y +/- tcvsy,|
Note that this increase in the standard deviation of the conditional distribution is also indicative of our being better able to predict a distribution mean than an individual score within that distribution.
Example: Using homework 6
we can put a confidence interval around our predicted score.
Solution: Since n = 9, there are seven degrees of freedom and our tcv = 2.365 for a 95%, two-tailed, confidence interval. We calculate our standard error for the predicted score of 48.8 + 2.549 = 71.66. We see that syx = 20.920.3681.069 = 8.22 and sy = 8.221.086 = 8.93. Thus 71.66 - 2.3658.93 to 71.66 + 2.3658.93 or (50.54, 92.78) is our 95% confidence interval. Not what you would call reassuring!
Rounding errors may slightly change the results, but that is the only difference between these two ways of checking for the significance of the regression. Note again the population correlation coefficient is often termed rho.