Back to the Table of Contents

An Introduction to Statistics - Lesson 8

Summarizing and Presenting Data

Lesson Overview

There are a wide variety of ways to summarize and present data. Most of the common methods will be summarized here, along with the usual conventions and terms for each.

Frequency Tables

A frequency table lists in one column the data categories or classes and
in another column the corresponding frequencies.

A common way to summarize or present data is with a standard frequency table as seen in the salary data in problem one in homework 6. Frequency refers to the number of times each category occurs in the original data. Another example containing current Center student distribution is as follows (below left).
GradeFrequency
 9 (freshmen)30
 10 (sophomores)26
 11 (juniors)28
 12 (seniors)20

Test ScoreFrequency
0 - 192
20 - 3911
40 - 599
60 - 7911
80 - 998
100-1197
120-1392

 

 

 

 

 

 

Often, the category column will have continuous data and hence be presented via a range of values. In such a case, terms used to identify the class limits, class boundaries, class widths, and class marks must be well understood. For the following examples, use the data above right (1998 Algebra Diagnostic score distribution).
Class limits are the largest or smallest numbers which can actually belong to each class.

For this example, the class limits are as displayed above in the left table column. For the largest class they are 120 and 139. Each class has a lower class limit and an upper class limit.

Class boundaries are the numbers which separate classes.
They are equally spaced halfway between neighboring class limits.

For this example, the boundaries would be -0.5, 19.5, 39.5, 59.5, 79.5, 99.5, 119.5, and 139.5. Note that 19.49999... is another name for and identical with 19.50000....

Class marks are the midpoints of the classes.

For this example, the class marks are 9.5, 29.5, 49.5, .... It may be necessary to utilize class marks to find the mean and standard deviation, etc. of data summarized in a frequency table.

Class width is the difference between two class boundaries (or corresponding class limits).

For this example, the class width is 20.0. Following are guidelines for constructing frequency tables.

  1. The classes must be "mutually exclusive"—no element can belong to more than one class.
  2. Even if the frequency is zero, include each and every class.
  3. Make all classes the same width. (However, open ended classes may be inevitable.)
  4. Target between 5 and 20 classes, depending on the range and number of data points.
  5. Keep the limits as simple and as convenient as possible (multiple of width?).

If your limits are not immediately obvious based on the data, try to find an appropriate width by rounding up the range divided by the number of classes. Your lower limit should be either the lowest score, or a convenient value slightly less. Avoid irrelevant decimal places. Large data sets justify having more classes. One published guide is: number of classes = 1 + log2n. This gives you 5 classes for small data sets of 12 to 22 elements and 10 classes for larger data sets of 362 to 724 elements. The seven classes used above for 50 elements is right on target. It is not uncommon to omit empty classes—be alert for such guideline violations! Omitted classes do not change the class width, but can be a real source of confusion!

Relative freqency tables contain the relative frequency instead of absolute frequency.

Relative frequencies can be expressed either as percentages or their decimal fraction equivalents.

Cumulative frequency tables contain frequencies which are cumulative for subsequent classes.

In a cumulative frequency table, the words less than usually also appear in the left column.

Histograms

The term histogram comes from the Greek words meaning web and write. As such it is a way to untangle data. Another name for a histogram is a bar graph or bar chart, although some texts differentiate between the two. In a histogram the vertical axis has the frequency, while the horizontal axis has the intervals. No gaps are allowed between the bars. The distribution of the data: normal, skewed left, skewed right, should be fairly obvious from a bar graph. Histograms are quite commonly used to visually display frequency and relative frequency charts. Again, some texts indicate that a bar graph is used for catagorical data and allows gaps between the bars. Illustrated below are a bar graph and the accompanying TI-83+ settings for the US presidential inauguration data.

[bar graph of inauguration data]           [window settings 42-x-69]           [stat plot 1 on L1 barchart]

A relative frequency histogram has the same shape and horizontal scale as a histogram, but the vertical scale is now the relative frequency.

A Pareto chart is a bar graph for qualitative data.

The bars in a pareto chart should be arranged in descending order of frequency, from left to right.

Frequency polygons are similar to histograms, but use line segments to connect the points.

When construction a frequency polygon, the class marks should be used on the horizontal scale. The graph should also be extended to the left and right so that it begins and ends with a frequency of 0.

Cumulative frequency polygons, also known as ogives, are also commonly encountered.
The line in an ogive (pronouced "oh-jive") will always have nonnegative slope.

Pie Charts, etc.

Pie charts (circle graphs) are a common way to understandably display the relative proportions of the various data elements. This is most commonly used on unranked or qualitative data. If this is done by hand, you should use a protractor to accurately measure your angles. Remember that there are 360° in a full circle. Use proportions to convert frequencies to angles: %/100 = degrees/360°.

A pictograph depicts data by using pictures of an object, such as coins, money bags, airplanes, etc. Those which use multiple objects the same size are ok. Those which use similar objects, scaled linearly to represent data, can easily distort things. There may be many other variations, but those listed above are most common.

Exploratory Data Analysis

A recent trend in statistics has been the use of exploratory data analysis. It is a fundamentally different approach. Historically, statistics were used to confirm final conclusions about data. Some very important assumptions were made, calculations were complex, and graphs often unnecessary. The modern emphasis has been more on exploring data, trying to simplify the way the data are described, and gain deeper insights into its nature. Few assumptions are made, the calculations are simple, as are the graphs. The following plot types are modern in their approach.

Stem-and-Leaf Diagrams

A stem-and-leaf diagram has the advantage of retaining the data in its original form, but providing a visual representation. Illustrated below is the US Presidential Inauguration data. In this case, the stem, the tens portion of the president's age, is given on the left, and the leaf, the units portion of the president's age, is given on the right.
4 | 23667899
5 | 0111112244444555566677778
6 | 0111244589

Please note that the separation line should be continuous, but time constraints limited accomplishing that feat. The following rules should be observed when constructing stem-and-leaf diagrams.

  1. The leaves on the right should be in increasing (or decreasing) order, left to right.
  2. No commas should appear on the right.
  3. If the stem/leaf break occurs at a decimal point, put the decimal point to the left with the stem.
  4. If the leaf is double or triple digit, etc., leave a [half] space between each entry.
  5. There should be at least five but no more than twenty rows.
  6. If a range is used for the stem, an asterisk (*) may be used to separate the corresponding leaves.
Reformatting the above with more rows (called by some books splitting the stem) emphasizes even more its normally distributed nature. Notice how the stem-and-leaf diagram is also somewhat like a histogram, but turned on its side. Normally, data are rounded before being put into such a diagram, but ages, for whatever reason, usually get truncated!
4 | 23
4 | 667899
5 | 0111112244444
5 | 555566677778
6 | 0111244
6 | 589

Boxplots

A boxplot or box and whiskers plot is a visual representation of the 5-number summary. The diagram is a quick way to spot skewed data. Illustrated below is a boxplot from the TI-83+ graphing calculator, along with the window and other settings for the US Presidential Inauguration data.

[box plot of inauguration data]           [window settings 35-x-99]           [stat plot 1 on L1 boxplot]

Please note that you can press TRACE and obtain the 5-number summary of: 42, 51?, 55.5, 58?, 69. The whiskers extend from either 1.5 inner quartile range above and below the quartiles or from the minimum to maximum values. The former is terms a modified box plot and will have outliers individually plotted via a symbol of your choice. They also can be traced. You may want to try the data given in the previous lesson illustrating outliers.

BACK HOMEWORK ACTIVITY CONTINUE