Back to the Table of Contents

Applied Statistics - Lesson 12

Testing/Using Linear Regression

Lesson Overview

Confidence Interval for Predicted Scores

We developed correlation and linear regression earlier as a way of expressing the strength and linear relationship between two variables, commonly x and y. Although a specific value of either variable can be used to predict the other, we generally treat x as our independent or input variable and y as our dependent or output variable. One calls this the regression of y on x. Regression does not have to be linear nor use only one input variable, but fitting to a higher order polynomial (or sinusoid, etc.) or doing multiple regression (more than one input variable) will be deferred into a later course.

The regression equation was of the form y = bx + a, where b is the slope or regression coefficient and a is the intercept or regression constant. The symbols used vary and are commonly reversed as well! The correlation strength, for quantitative (interval or ratio) data, was expressed in terms of r, the Pearson product moment correlation coefficient, with values in magnitude near 1 indicating a strong correlation and values near zero indicated no or little correlation. When the regression equation slope is positive, the regression is said to be positive and the correlation coefficient will be positive as well. When the regression equation slope is negative, the regression is said to be negative and the correlation coefficient will be negative as well. When the slope is zero (b = 0), the correlation coefficient will be 0 (r = 0). r2 is called the coefficient of determination and is a measure of the variation in y explained by the variation in x.

Unless there is perfect correlation (|r| = 1), there will be some scatter of the data about the best fit regression line. Ideally, the distribution of the data about the regression line, the conditional distributions, will conform to normal distributions with the mean on the regression line and standard error of estimate: sy•x = sy•sqrt(1-r2)•sqrt((n - 1)/(n - 2)). In addition, ideally, these conditional distributions will have equal variances (hence standard deviations). This assumption is homoscedasticity.

Against this background we can establish probabilities regarding scores away from the predicted score. We will do this by use of z-scores formed from a test y value less the predicted y value, and that difference divided by the standard error of the estimate: z = (y - y) / sy. We then use these z-scores and a normal distribution to establish probabilities.

When used for confidence intervals, we have to take into account how far we are away from the centroid (the ordered pair (mean x, mean y)) and for the standard error of predicted score sy = sy•x•sqrt(1 + (1/n) + (x - x)2/SSx).     SSx = (n - 1)s2x. Since we typically have a small number of observations, the t-distribution should be used with n - 2 degrees of freedom. The confidence interval than is:
CI = y +/- tcvsy,
where y is the predicted score and sy the standard error of the predicted score given above.

Note that this increase in the standard deviation of the conditional distribution is also indicative of our being better able to predict a distribution mean than an individual score within that distribution.

Example: Using homework 6 we can put a confidence interval around our predicted score.
Solution: Since n = 9, there are seven degrees of freedom and our tcv = 2.365 for a 95%, two-tailed, confidence interval. We calculate our standard error for the predicted score of 48.8 + 2.54•9 = 71.66. We see that sy•x = 20.92•0.368•1.069 = 8.22 and sy = 8.22•1.086 = 8.93. Thus 71.66 - 2.365•8.93 to 71.66 + 2.365•8.93 or (50.54, 92.78) is our 95% confidence interval. Not what you would call reassuring!

Testing Significance of r

When r = 0, b = 0, and the predicted y value is always the mean y value. It is thus common to test the correlation coefficient for how different from 0 it might be and thus how statistically significant it might be. Step 1: The slope b or regression coefficient is often represented by ß so our null hypothesis is: H0: ß=0 and the alternative hypothesis is: Ha: ß#0. The sampling distribution of the regression coefficient is a t-distribution with n - 2 degrees of freedom and the standard deviation of this distribution is defined as sb=(syx/sqrt(SSx) and is called the standard error of the regression coefficient. Step 2: t = (b - ß)/sb. For steps 3 and 4 you would need a specific slope and degrees of freedom to compute the test statistic and interpret the results.

slope=0 vs. rho=0

Instead of testing the slope for zero (b = 0) one can test the correlation coefficient for zero (r = 0). In lesson 9 we introduced the test statistic
t=r•sqrt((n-2)/(1-r2)).

Rounding errors may slightly change the results, but that is the only difference between these two ways of checking for the significance of the regression. Note again the population correlation coefficient is often termed rho.

BACK HOMEWORK ACTIVITY CONTINUE