Back to the Table of Contents

Applied Statistics - Lesson 10

Hypothesis Testing: Two sample means

Lesson Overview

Often one wants to compare two treatments or populations and determine if there is a difference. This can be done either with or without matching. Matching produces dependence between the two samples and will be discuss after the unmatched/independent case.

The various assumptions made in this situation primarily affects the appropriate number of degrees of freedom to be used. This can range from one less than the smallest sample to two less than the sum of the sample sizes with various values inbetween possible. The choice reflects the conservativeness of the researcher and the care taken in evaluating the underlying assumptions. There seems to be conflict between various sources which I have yet to resolve.

Unmatched (independent) Two-sample t Test

Two assumptions are used: two independent simple random samples from two distinct populations (matching would negate independence); and both populations are normally distributed with unknown means and standard deviations. Our null hypothesis would look like H0: µ12 or we might want to give a confidence interval for the difference µ12. We use the sample means and standard deviations to estimate the unknown parameters. Although the statistic [x bar]1 - [x bar]2 has a normal distribution in terms of the combined population variance, when we use the combined sample variance, we do not obtain a t distribution. Nonetheless, we do use the t distribution for hypothesis testing in this case. The two-sample t statistic is as follows:
t = (([x bar]1 - [x bar]2) - (µ1 - µ2)) ÷ sqrt(s12/n1 + s22/n2)

The expression in the denominator reflects the way variances sum (standard deviations do not sum). There are two options for obtaining a value for the degrees of freedom. Calculate a fractional degrees of freedom as given below, or use the smaller of n1-1 or n2-1. This latter value always results in conservative results. As sample size increases, this latter procedure also becomes more accurate. The two-sample t procedures are more robust than the one-sample methods, especially when the distributions are not symmetric. If the two sample sizes are equal and the two distributions have similar shapes, it can be accurate down to sample sizes as small as n1 = n2 = 5. The two-sample t procedure is most robust against nonnormality when the two samples are of equal size. Thus when planning such a study, you should make them equal.

The fractional degrees of freedom formula is as follows:
d.f.= (s12/n1 + s22/n2) ÷ (((s12/n1)2 ÷ (n1-1)) + ((s22/n2)2 ÷ (n2-1)))

Suppose instead of two distinct populations we randomly select our sample and then randomly assign half the subjects to an experimental (treatment) group and the other half to a control group. In this case and others, the population variances are equal and the estimated standard error of the difference used in the formula above: sqrt(s12/n1 + s22/n2) simplifies to sqrt(s2(1/n1 + 1/n2)). However, s2 is the pooled estimate of the population variance which comes from the sum of the sums of squares for the two groups divided by n1 + n2 - 2, which is the degrees of freedom in this case. It can also be obtained from the two sample variances as s2 = ((n1 - 1)s12 + (n2 - 1)s22)/ (n1 + n2 - 2)

Note: pooling assumes equal variances for the two populations. If in doubt, a statistical test should be performed.

Confidence intervals are constructed in the usual way using standard error of the difference between the mean just like we used the standard error of the mean before.

Effect Size

The effect size is defined as the degree to which a phenomenon exists. Consideration of this value can help researchers differentiate between statistical significance and practical importance. Effect size is given in units of standard deviations so is calculated much like a z or t score (with the sample standard deviation not the standard error of the mean). If the measured difference is over ¼[sigma] but less than ½[sigma] many would say the effect is small. If the measured difference is over 1[sigma] many would say the effect is large. Researchers have been known to go to great lengths to inflate their error bars (margin or error) so that the true value is almost certain to be within 1.96[sigma], instead of only 95% confident.

Effect size gives an alternate indication of the magnitude of a difference to help distinguish between statistical significance and practical importance when the sample size can muddy the waters. Some groups have recommended the reporting of effect size for published research.

Testing Variance Homogeneity

Oftentimes a test for variance homogeneity between the two populations is performed and if this assumption proves unreasonable, adjustments in the test procedure for equal means may be indicated. A newer test, Levene's test, has replaced the simpler F Max test in some packages. The F Max test is based on the F distribution usually associated with ANOVA and hence will be deferred into a course covering that material.

Note also that the estimated standard error of the difference given in the blue box above and the fractional degrees of freedom given in the blue box above are the same as those given by Hinkle for use when the population variances are unequal. (Hinkle gives a source (Satterthwaite, 1946) for the fractional degrees of freedom formula.)

Matched Pair Test

Comparative studies are more convincing than single sample investigations. Thus one sample inference testing is less common. A common design compares two treatments, either before and after, or randomly picking one of each pair for treatment. In a before and after situation, one might say the test subject is his/her own control. We can also say the data are correlated. In such a matched pair design, we apply the one sample t procedures to the observed differences. Our null hypothesis would be that these differences are zero and our alternative hypothesis would be that they are not (two-tailed) or positive/negative (one-tailed).

An example might be before and after SAT scores for a high-priced course of study. Or your typical freshman practice EXPO project where peas, corn, or other seeds are grown with and without (control) a treatment. Some Biology instructors and EXPO judges have expected our freshmen to perform these t tests!

Note: Hinkle uses different symbols for the dependent situation than for the independent situation to emphasize the difference.

Suppose a teacher wonders if there is a statistical difference between two pages of a test after noting similar means and standard deviations for the pages and decides to do a matched pair test.
Page 3 Page 4diff d2
20 17 3 9
23 21 2 4
17 27 -10 100
21 17 4 16
17 17 0 0
16 19 -3 9
9 19 -10 100
11 11 0 0
13 13 0 0
--- --- --- ---
n=9 so 147/9=16.3 is the mean for page 3 and 161/9=17.9 is the mean for page 4. The estimated standard error of the difference is (sqrt((238-196/9)/8)/3=5.20/3=1.73 which results in a t statistic of -1.56/1.73=-0.90. So we fail to reject any reasonable hypothesis of a difference. That difference might be caused by different content (Chapter 6 vs Chapter 7), different testing methods (essay and computational vs multiple choice and true/false), student study patterns, but we can't identify any difference, let alone ascribe a cause to it.

HT two-sample, other statistics

Hinkle Chapter 12 deals with testing for equal correlation coefficients in two independent samples. The Fisher z transformation is again necessary. It also deals with testing for equal proportion both in two independent samples and two dependent samples. It concludes with a followup of the assumption of homogeneity of variance for the test of no mean difference in two independent samples. This can be important if pooling for the variance is to be done.