On
the
|
Histograms |
|
Introduction
|
As a
teacher of junior maths and Maths in Society, I used to
think that a histogram was a rather trivial statistical
object, sort of a bar graph with the gaps removed to save
space. I never realised that statisticians actually find
histograms to be useful! A modern data-centred approach to statistics starts with viewing the data in a variety of ways. What is meant by viewing the data? Features of interest to a statistician are the overall shape of the data, symmetry, the location and the spread, existence of outliers and evidence of clusters or gaps. A histogram with a scale on the horizontal axis is generally useful for showing all of these features, though for some distributions the features of a dataset can be disguised or distorted due to a particular choice of bin width. One application of the humble histogram is determining if a set of data is approximately normally distributed, though a histogram is most effective with for this purpose if the dataset is large. Normality is a pre-condition for certain analyses of data, including many hypothesis tests. While there are formal tests of normality, often a quick look at a histogram of the data is sufficient. And no statistican would rely strictly on formal tests without viewing the data also. The STEPS modules are a collection of hypertext-based tutorials covering a wide range of statistics topics, including the graphical display of data. Visit the STEPS page for further information and a list of the modules available. Histograms and Stemplots Compared A histogram shows much the same information as a stemplot, though for a given dataset one or the other of these methods of displaying the data may be preferable. Some points to note:
Matching
Histograms and Boxplots Students will improve their ability to interpret the information given in a boxplot by matching boxplots of sample data drawn from different distributions with their associated histograms. Students will improve their ability to visualise the shape of a distribution given the summary statistics. Bin Width Statistics computer programs and graphical calculators will generate a default histogram if bin width or the number of bins is not specified. It is interesting that there is no clear winner in the choice of algorithm used for choosing the number of bins or the bin width. The article How Wide Is Your Bin? contains an interesting thread (i.e. a discussion topic) from the Ed-Stats mailing list. The Density Trace
The Histogram and Stemplot Compared A histogram is an alternative to a stemplot for displaying data. A stemplot is restricted by our number system to certain bin widths; a histogram is under no such restriction. However, you usually lose the actual data values, and constructing a histogram by hand is a tedious process. When constructing a histogram by hand, a decision about bin sizes and the number of bins has to be made when tabulating the data. A poor decision can result in a histogram that either gives misleading information about the data or fails to inform the viewer about some aspect of the data. A computer is of value here, as a variety of histograms, each with a different bin width, can be constructed. Which histogram is preferred depends upon which aspects of a dataset are to be featured. Beware the Humble Histogram! Ideally a histogram should show the shape of the distribution of the data. For some datasets, the choice of bin width can have a profound effect on how the histogram displays the data. To see this for yourself, have a look at the Histogram Applet, from R. Webster West, Dept. of Statistics, Univ. of South Carolina (you will need a java-enabled browser to see the applet). It is a histogram of the interruption time (i.e. time between eruptions) of the Old Faithful Geyser in Wyoming, USA. Slide the bar to change bin widths, and watch how that effects the shape of the histogram. Will you ever trust a histogram again? As most classrooms dont have Internet access on tap, the Word document Old Unfaithful contains a series of histograms of the interruption time of the Old Faithful geyser. The series nicely shows the effect of bin size on the appearance of the histogram. | Read
Me First! | Introduction | Acknowledgements | |
Assessment | Datasets | Resources | | Linear Regression | Normal
Distribution | |