Average most often refers to the
arithmetic mean, but is actually ambiguous and may be used to also refer to the mode, median, or midrange. |
You should always clarify which average is being used, preferrably by using a more specific term. Averages give us information about a typical element of a data set. They are measures of central tendency.
Mean most often refers to the
arithmetic mean, but is also ambiguous. Unless specified otherwise, we will assume arithmetic mean whenever the term mean is used. |
The
Arithmetic Mean is
obtained by summing all elements of the data set and dividing by the number of elements. |
Symbolically, the arithmetic mean is expressed as where (pronounced "x-bar") is the arithmetic mean for a sample and is the capital Greek letter sigma and indicates summation. x_{i} refers to each element of the data set as i ranges from 1 to n. n is the number of elements in the data set. The equation is essentially the same for finding a population mean; however, the symbol for the population mean is the small Greek letter µ (mu). Roman letters usually represent sample statistics, whereas Greek letters usually represent population parameters.
When arithmetic means are combined for different groups, we must take into account the possibly disparent number of data elements in each group and weight the means accordingly.
Example: Suppose there are 10 freshmen boys and 20 freshmen girls. Suppose further that the boys' test average was 72.5 and the girls' test average was 73.7. Find the overall average (arithmetic mean). Solution: (10×72.5+20×73.7)/30=73.3.
The arithmetic mean has two important properties which make it the most frequently used measure of central tendancy: 1) the sum of all deviations from the mean is zero; and 2) the sum of the squares of the deviations from the mean is minimized. Deviation here refers to the directed distance (i.e. plus or minus sign included) a given score is from the mean.
Sample Size is the number of elements in a sample. It is referred to by the symbol n. |
Be sure to use a lower case n for sample size. An upper case N refers to Population Size, unless being used in the context of a normally distributed population.
Mode is the data element which occurs most frequently. |
A useful mnemonic is to alliterate the words mode and most. Alliterations start with the same sound like: "seven slippery slimy snakes...".
A data set with only one mode is termed unimodal. Some data sets contain no repeated elements. In this case, there is no mode (or the mode is the empty set). It is also possible for two or more [nonadjacent] elements to be repeated with the same frequency. In these cases, there are two or more modes and the data set is said to be bimodal or multimodal. In the rare instance of a uniform or nearly uniform distribution, one where each element is repeated the same or nearly the same number of times, one could term it multimodal, but some authors invoke subjectivity by specifying multimodality only when separate, distinct, and fairly high peaks (ignoring fluctuations due to randomness) occur.
For binned data, such as occurs with a frequency table, the interval which contains the most items is the modal interval and the midpoint of this interval is considered the mode. The mode is rather unsophisticated, tends to provide little information, and does not readily lend itself to mathematical manipulation. It thus has limited value except when there are a large number of scores and it can help describe the distribution or when used for nominal variables.
The Median is the middle element when the data set is arranged in order of magnitude. |
A useful mnemonic is to remember that the median is the grassy strip (in the rural area of the midwest where I come from) that divides opposing lanes in a highway. It is in the middle.
If there are an odd number of data elements, the median is a member of the data set. If there are an even number of data elements, the median is computed as the arithmetic mean of the middle two.
The median has other names, such as P_{50}, which will be discussed below. The Hinkle textbook uses the symbol Mdn for median.
The Midrange is the arithmetic mean of the highest and lowest data elements. |
Midrange is a type of average. Range is a measure of dispersion and will be discussed below. A common mistake is to confuse the two. Symbolically, midrange is computed as (x_{max}+x_{min})/2
Some basic facts regarding averages are as follows.
The midrange and possibly the median are the arithmetic mean of two data set elements. One additional significant digit may be necessary to accurately convey this information.
The number of significant digits for the mean should conform to one of the following rules.
Presenting more than five significant digits is probably a joke and points will be deducted! |
As we have seen in this lecture, this is a rather ambiguous question and the answers 1 (mode), 2 (median), 3.0 (mean), and 4.0 (midrange) are all possible and correct!
A sample of size 5 (n=5) is taken of student quiz scores with the following results: 1, 7, 8, 9, 10.
The mean is (1+7+8+9+10)/5 = 35/5 = 7.0 (note one more decimal place is given).
All scores occur only once, hence there is no mode. The median score is 8 (not 8.0). The midrange is (10+1)/2 = 5.5 (note the extra decimal place is required).
An extreme score (1) distorts the mean so perhaps the median is a better measure of central tendency. For a larger data set, this could be further defined in terms of skewness (median and generally mean to the left of (negatively skewed), right of (positively skewed), or same as (zero skewness) the mode) and symmetry of the data set. It is more common to be positively skewed, since exceptionally large values are easier to obtain due to lower limits.
Range is the difference between the highest and lowest data element. |
Symbolically, range is computed as x_{max}-x_{min}. Although this is very similar to the formula for midrange, please do not make the common mistake of reversing the two. This is not a reliable measure of dispersion, since it only uses two values from the data set. Thus, extreme values can distort the range to be very large while most of the elements may actually be very close together. For example, the range for the data set 1, 1, 2, 4, 7 introduced earlier would be 7-1=6.
Recently it has come to my attention that a few books define statistical range the same as its more mathematical usage. I've seen this both in grade school and college textbooks. Thus instead of being a single number it is the interval over which the data occurs. Such books would state the range as [x_{min},x_{max}] or x_{min} to x_{max}. Thus for the example above, the range would be from 1 to 7 or [1,7]. Be sure you do not say 1-7 since this could be interpretted as -6.
Hinkle defines range as (Highest score - Lowest score) + 1, where the +1 ensures that both extreme values are included. Although he notes the definition given above, he does note that this +1 definition is used throughout the book. The appropriateness of this modification increases as the level of measurement decreases.
sample standard deviation |
Notice the difference between the sample and population standard deviations. The sample standard deviation uses n-1 in the denominator, hence is slightly larger than the population standard deviation which use N (which is often written as n).
population standard deviation |
It is much easier to remember and apply these formulae, if you understand what all the parts are for. We have already discussed the use of Roman vs. Greek letters for sample statistics vs. population parameters. This is why s is used for the sample standard deviation and (sigma) is used for the population standard deviation. However, another sigma, the capital one (), appears inside the formula. It serves to indicate that we are adding things up. What is added up are the deviations from the mean: - x_{i}. But the average deviation from the mean is actually zero—by definition of the mean! Occasionally the mean deviation, using average distance or using the symbols for absolute value: | - x_{i}| is used. However, a better measure of variation comes from squaring each deviation, summing those squares, then taking the square root after dividing by one less than the number of data elements. This is very similar to a quadratic mean. The n-1 can be understood in terms of degrees of freedom—a topic we will have to cover for inferential statistics.
Another formula for standard deviation is also commonly encountered. It is as follows.
| Shortcut formula for standard deviation |
This formula can be algebraically derived from the former and has two primary applications. First, calculators and computer programs often employ it because less intermediate results are necessary and it can be calculated in one pass through the data set. That is, you don't have to calculate the mean first and then find the deviations. Second, it is closely related to a formula which may be used to calculate the standard deviation for a frequency table. In general, the formulae are not used and we rely instead on calculators or computers.
Occasionally, the abbreviations SD for standard deviation and Var for variance will be seen.
A commonly given rule of thumb is that the range of a data set is approximately 4 standard deviations (4s). Thus the maximum data element will be about 2 standard deviations above the mean and the minimum data element about 2 standard deviations below the mean.
It is rather meaningless to calculate the standard deviation for a data set of two elements.
Three is considered the smallest sample size where standard deviation is meaningful. |
It is not uncommon for an experiment to involve millions of events and associated data. If you examine the standard deviation formula above, you will note that it depends inversely on the square root of n. We could thus expect to reduce the standard deviation of our answer by perhaps a thousand fold. It is the goal of many experiments to obtain very precise values, so great care is exercised to reduce systematic errors and also reduce the affect of random errors by increasing the repetitions.
Example:
Consider a simple example of counting pennies where the outcomes
99, 100 and 101 are obtained. Find the mean and standard deviation.
Solution: We can easily calculate the mean as 100
and the standard deviation as 1.0.
Example:
Consider further if this exercise were repeated 1000 times and 100
was obtained 991 times, 99 5 times and 101 4 times.
Again, calculate the mean and standard deviation.
Solution:
The mean is now 99.999 and the standard deviation is now 0.095.
Here the additional precision is justified and the mean and
standard deviation are given to the same 3 decimal place precision.
It would be a mistake to report these results to only one more digit
than the original data set, as in 100.0 and 0.1.
DO NOT USE a rounded s to obtain s^{2}. Variance is the primary statistic, s is a derived quantity. |
Standard deviation should be reported to at least one more decimal place than the data, or three significant digits.
A typical example might be ACT and SAT scores. ACT scores range from 1 to 36 with a national mean of about 21.0 and standard deviation of about 4.7. SAT scores range from 200 to 800 (for each subtest) with a national mean of about 508 and standard deviation of about 111. Both ACTs and SATs appear to be approximately normally distributed. High school students often take both, perhaps several times and those from a particular school would represent a sample. This sample would have its own mean and standard deviation, but of course, these would be statistics, not parameters. (Our Math and Science Center students average about 1050 (total) when they take the SAT their eighth grade year and average over 1300 (total) when they take it their junior year. Our average ACT score (junior) is about 29.) The formulae used for z-score appear in two virtually identical forms, recognizing the fact that we may be dealing with sample statistics or population parameters. These formulae are as follows.
z-score formulae |
Negative z-scores indicate a data element's position below the mean. |
Positive z-scores indicate a data element's position above the mean. |
z-scores should always be rounded to two decimal places. |
IQs of 0 and 210 will be discussed in lesson 4 and z-scores of -6.67 and 7.33 should be obtained respectively, based on a population mean of 100 and a standard deviation of 15.
The population does not have to be normally distributed to calculate z-scores, but that is one of its primary applications.
In summary, z-scores provide a useful measurement for comparing data elements from different data sets. |
Data elements more than 2 standard deviations away from the mean are termed unusual. |
Data elements less than 2 standard deviations away from the mean are termed ordinary. |
As you will recall, in a normally distributed population, 95% of the data
will then be ordinary, so only 5% can be unusual. Chebyshev's theorem guarantees
at least 75% of the data to be ordinary, so no more than 25% can be unusual.
Note first how the median divides a population into two halves: a top half and a bottom half. |
The top half consists of those data elements above the median, whereas the bottom half consists of those data elements below the median. If we subdivide each of these halves yet again, we have quartered the population and each of these division points are termed quartiles. Although one might occasionally speak of the bottom quartile, top quartile, etc., the term quartile technically refers to the three division points and not to the four divisions of the data.
Q_{1} is the term used for the median of the bottom half. |
Q_{3} is the term used for the median of the top half. |
Q_{2} is another term used for the median. |
The precise definition specifies that at least 25% of the data will be less than or equal to Q_{1} and at least 75% of the data will be less than or equal to Q_{3}. The terms upper (right) and lower (left) hinge are noted below and some software packages may not clearly differentiate between hinges and quartiles. All these measures of position assume the data is quantitative and can be put in numeric order.
Data are ranked when arranged in [numeric] order. |
Since range is sensitive to outliers (defined below), sometimes the interquartile range is calculated. This range is the difference between the third and first quartiles: Q_{3}-Q_{1}. It is another measure of dispersion. Other common terms include: semi-interquartile range, (Q_{3}-Q_{1})/2, another measure of dispersion, and midquartile or (Q_{1}+Q_{3})/2, which is a measure of central tendancy (an average).
The upper hinge is the median of the upper half of all scores, including the median. |
The lower hinge is the median of the lower half of all scores, including the median. |
Outliers are extreme values in a data set. Sometimes the term outlier is applied to unusual values as defined above (Triola, 5th edition). More recently, outliers are defined in terms of the hinges or quartiles. Outliers are often differentiated as mild or extreme as defined below. The interquartile range or perhaps D = upper hinge - lower hinge is used. Generally, an outlier should be obvious and not borderline—right next to another element, but lying just outside some arbitrary line of demarcation.
Mild outlier are 1.5D to 3D beyond the corresponding hinge. |
Hinkle terms these demarcation points 1.5 IQR beyond a hinge/quartile reasonable upper boundary (RUB) and reasonable lower boundary> (RLB). In a modified box plot (discussed below), the RUB and RLB replace the max and min, respectively, and outliers are noted with dots.
Extreme outlier are beyond the corresponding hinge by more than 3D. |
Consider as an example the data set: {0, 2, 4, 5, 6, 3, 6, 1, 1, 50}.
Obviously, 50 is a much larger number than any of the other elements.
This outlier will cause the mean and variance to be much higher.
Specifically, without 50, the mean is 3.1 and standard deviation 2.3,
whereas with 50, the mean is 7.8 and standard deviation 15.0.
Note that the quartiles are 1 and 6, whereas the hinges are 1.5 and 5.5
for the unmodified data set.
For any of these definitions, 50 is way away from the other data and is an outlier.
Outliers might be legitimate data values or errors.
This 50 might really have been 5.0 and was miscoded
(historically, punch card input was column sensitive) or poorly
recorded in a lab book, with the decimal point extremely light or missing.
50 may also represent extreme extra credit on a 5 point quiz!
It is not unusual to be tempted to omit such data values.
It is not considered a good practice, but if such are omitted,
be sure to clearly record that fact.
You will have just crossed the line between objective and subjective science.
D_{5} is another name for the median. |
Psycholgists and counselors frequently provide Norm-referenced interpretation of a scores for personality inventory and achievement tests. This typically means correlating a given score with a given percentile.
P_{50} is yet another term for median. |
Other equivalents, such as P_{25}=Q_{1},
P_{75}=Q_{3},
P_{10}=D_{1}, etc.,
should also be obvious.
Once again, the term percentile technically refers to the
99 division points, but is not uncommonly used to refer to the 100 divisions.
For large data sets, one can calculate the locator L to
help find a requested percentile. It is computed as follows.
Percentile Locator Formula |
k is the percentile being sought and n, of course, is the number of elements in our data set. Usual conventions dictate that once L is obtained, it must be checked to see if it is a whole number. If it is a whole number, the value of P_{k} is the mean of the Lth data element and the next higher data element. If it is not a whole number, L must be rounded up to the next larger whole number. The value of P_{k} is then the Lth data element, counting from the lowest. There is an essential difference between rounding up and rounding off. If we round off we get 3. Whereas, if we round up we get 4. Hinkle gives a different formula which is applicable when the data is binned. Since percentiles are ordinal, a limited number of statistical operations are approriate for them.
The percentile rank of a score is a point on the percentile scale that gives the percent of scores at or below the specified score. When percentiles and scores are graphed in a cumulative frequency polygon or ogive, one can read a score on one axis and find percentile on the other or percentile on one and the corresponding percentile rank (a score) on the other.
There is no such thing as P_{100}. |
The whiskers extend from either 1.5 inner quartile range above and below the quartiles or from the minimum to maximum values. The former is termed a modified box plot and will have outliers individually plotted via a symbol of your choice. Note that Hinkle presents box plots vertically while many other authors use a horizontal approach.
BACK | HOMEWORK | NO ACTIVITY | CONTINUE |
---|