Normally I Wouldn't Reveal the Plot...

From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

Normally I Wouldn't Reveal the Plot...

I don’t know why they pick on light bulbs. Somewhere in almost all of our traditional textbooks are exercises that start, ‘Assume the lifetime of light bulbs is normally distributed ...’ which then go on to do some statistics using the formula for z-scores. There are a few things wrong with these exercises. First they reduce statistics to plugging numbers into formulas, which is one of the most boring and meaningless thing one can do in statistics. Possibly more important, the assumption is incorrect! The distribution of the lifetimes of light bulbs is NOT normal, or even approximately normal. The article Light Bulbs and Car Batteries follows an interesting thread on this topic from the EdStat-l mailing list.

Given a set of real data the student (or the statistician) may be interested to know if the dataset is at least approximately normal, as this informs the decision about which tests of inference are applicable. The NCSS 97 Help File nicely summarises why normality may be desired.

It allows the data to be represented compactly. A thousand values that happen to come from the normal distribution may be summarised by only two numbers: the mean and variance.
It allows the use of several statistical procedures, such as analysis of variance, t-tests, or multiple regression.
It allows generalisations to be made from the sample to the population. These generalisations usually take the form of confidence intervals and hypothesis tests.
Understanding the distribution of a sample may provide insight into the physical process that created the data.

What are the procedures that help determine normality? There are a number of normality tests that are commonly available in statistics packages, but which are outside the scope of an introductory course and hence won’t be discussed in this article. And even with an understanding of these numerical tests, a statistician would still want to see the data. Often a histogram of the data is constructed, and normality is checked ‘by eye’. This gives a feel for whether the data is approximately ‘bell-shaped’, especially if the dataset is fairly large. But the most common display for testing for normality is the normal plot, sometimes called the quantile plot. Bob Hayden pointed out in an email to the ap-stat mailing list that the normal plot provides information that other plots do not:

"[Other graphical displays] do a good job of telling you whether the distribution is "bell shaped". Unfortunately, that is NOT important. ..... What IS important is

symmetry

lack of outliers

weight of the tails

The theory is that the last of these is much more evident in a normal plot than in [other displays]."

The Normal Plot Revealed

The normal plot is essentially a scatterplot of the actual data values plotted against the ‘ideal’ values from a normal distribution. The nearer that the points of the scatterplot lie to a straight line the more the distribution resembles a normal distribution. A normal plot constructed by a computer statistics program usually consists of the scatterplot of values, the line along which the points should ideally lie and a graphical display of the confidence levels, which are the values between which almost all of the points should lie.

Here are two examples. The plot on the left is of a sample taken from a normal population while the plot on the right is from a sample drawn from a uniform distribution.

The tails of the normal plot of the uniformly distributed data deviate markedly from the straight line. The NCSS 97 Help File succinctly summarises the different ways that the plots may deviate from a straight line. Since this display can be created for distributions other than the normal distribution, the help file uses the more general term, 'probability plot' in place of normal plot.

If the points in the probability plot all fall along a straight line, you can assume that the data follow the probability distribution. At least, the actual distribution is well approximated by the distribution you have plotted. We will briefly discuss the types of patterns that usually coincide with departures from the straightness of this line.

Outliers

Outliers are values that do not follow the pattern of body of the data. They show up as extreme points at either end of a probability plot. Since large outliers will severely distort most statistical analyses, you should investigate them closely. If they are errors or one-time occurrences, they should be removed from your analysis. Once outliers have been removed, the probability plot should be redrawn without them.

Long Tails

Occasionally, a few points on both ends will stray from the line. These points appear to follow a pattern, just not the pattern of the rest of the data. Usually, the points at the top of the line will shoot up, while the points at the bottom of the line will fall below the line. This is caused by a data distribution with longer tails than would be expected under the theoretical distribution (e.g. normal) being considered. Data with longer tails may cause problems with some statistical procedures.

Asymmetry

If the probability has a convex or concave curve to it (rather than a straight line), the data are skewed to one side of the mean or the other. This can usually be corrected by using an appropriate power transformation.

Plateaus and Gaps

Clustering in the data shows up on the probability plot as gaps and plateaus (horizontal runs of points). This may be caused by the granularity of the data. For example, if the variable may only take on five values, the plot will exhibit these patterns. When these patterns occur, you should be sure you know the reason for them. Is it because of the discrete nature of the data, or are the clusters caused by a second variable that was not considered?

Warning / Caution

Studying probability plots is a very useful tool in data analysis. A few words of caution are in order:

These plots emphasise problems that may occur in the tails of the distribution, not in the middle (since there are so many points clumped together there).

The natural variation in the data will cause some departure from straightness.

Since the plot only considers one variable at a time, any relationships it might have with other variables are ignored.

Confidence limits displayed on the plot are only approximate. Also, they depend heavily on a reasonable sample size. For samples of under twenty points, these limits should be taken with a (large) grain of salt. Also, you can change the limits a great deal by changing the confidence level (the alpha value). Be sure that the value you are using is reasonable.

The Normal Plot and the Histogram Compared

This email by Robert Dawson provides a neat explanation as to why the normal plot is better than a histogram for judging normality:

It's not easy to tell if a distribution is normal (or even approximately so) by looking at a histogram. Try this: use a stats package to generate 30 (or more) "data" with a Cauchy distribution and plot them. You'll probably think they look pretty close to normal. [If you can't get Cauchy with what you're using, generate pairs of normally distributed random numbers & use the quotient of each pair.]

But the Cauchy distribution is very far from normal, to the point where most standard stats techniques don't work on it. Among other oddities, it hasn't even got a mean in the usual sense - so the central limit theorem doesn't apply, and sample means don't converge as sample size increases.

The Cauchy distribution is a little extreme perhaps. You could also try Student's T with two degrees of freedom. This has a mean, but no standard deviation - again, trouble looking for a spot to happen if you tried to use standard methods on such a distribution, as the sample standard deviation would be more or less garbage. It looks even more like a normal distribution in raw histogram form.

Now do a normal scores plot and bang! they're visibly non-normal. No question about it.

In the other direction, my feeling is that with small data sets the average person is more likely to conclude falsely that they *don't* come from a normal distribution. A smallish sample from a normal distribution is probably more likely to have a reasonably straight normal scores plot than to look 'normal' to the unaided eye.

The normal scores plot presents data in a way that makes it easier for the human eye to make a good judgement about (approximate) normality.

Constructing a Normal Plot

Bob King describes how to construct a normal plot:

Here's a simple data set that my notes say came from the Minitab Reference Manual, release 10.5. (I don't have the manual here.)

X: .1, .9, 1.1, 1.8, 2.3

Refer to these data points, in order, as x_i with i = 1,2,3,4,5.

Now sketch a picture of the normal curve. If the X values are normally distributed, you might expect them to occur at, say, the 10th, 30th, 50th, 70th, and 90th percentiles (i.e., at the z-values with cumulative probabilities of .1, .3, .5, .7, and .9). Use a normal table or the TI-83 to look up the z-values for these percentiles. I get about -1.28, -.52, 0, .52, and 1.28.

Construct, by hand, an ordinary "x-y plot" of the five points (X,z). That's a normal probability plot.

One last matter... For a data set of size n = 5, as given above, think about a formula that produces the percentiles in this example:

i 1 2 3 4 5

j .1 .3 .5 .7 .9

A little thought shows that j = (i-.5)/n. Various groups have chosen slightly different formulas for j, yielding slightly different normal scores for the plot. According to my notes, Minitab uses (i - 3/8)/(n + 1/4) and Data Desk uses (i - 1/3)/(n + 1/3). Each of these choices yields slightly different cumulative probabilities. Here are three examples

my example above:	-1.28	-0.52	0.52	1.52
Minitab	-1.18	-0.50	0.50	1.18
Data Desk	-1.15	-0.49	0.49	1.15

But I think you will see that each of these three choices gives essentially the same plot. Hope I've got that right, and that it helps...

A TI-82 Program to Construct a Normal Plot

The TI-83 has built-in normal probability plot capabilities; just read the manual to find out how to construct one. Peter Blaskiewicz has shared a program which produces a normal plot on the TI-82 graphing calculator:

It assumes that your data is in L1. The data is copied to L5 and sorted, just in case you still need to preserve the order of the information in L1. The normal quantile scores are then stored in L6. The last few lines turn on a scatterplot for L5 and L6. (You may use other list names instead, and for the scatterplot, you may use the mark of your choice, of course.) Desk-checking this, the median gets a normal probability score of about .000004 instead of 0 (not dreadfully far off) and for several trial data sets, the graph was not discernibly different from the one produced by the TI-83.

: L1 sto> L5
: SortA(L5)
: dim(L5) sto> N
: N sto> dim(L6)
: 1/(2*N) sto> S
: For(J,1,N,1)
: sqrt(-2*ln(.5-abs(J/N-S-.5))) sto> A
: .27061*A+2.30753 sto> B
: A*(A*.04481+.99229)+1 sto> C
: A-B/C sto> Z
: If J<=N/2
: Then
: -1*Z sto> Z
: End
: Z sto> L6(J)
: End
: Plot1(Scatter,L5,L6,+)
: ZoomStat
: Stop

The Confidence Limits, and Other Probability Plots

As part of the normal plot, many statistics programs provide a graphical display of the confidence limits, between which almost all of the points of a normal probability plot should lie. (See the above two diagrams for an example). Because they are there, students ask what they are, and how they are constructed.

The interpretation is straightforward - they are the limits between which almost all of the points of the distribution should lie, if in fact the distribution is normally distributed.

However the construction of these confidence limits is beyond the scope of introductory statistics. For the curious, the NCSS 97 help file provides a clear though technical explanation of how these limits are constructed. It includes a discussion on how probability plots and confidence limits for distributions other than the normal distribution are constructed.