How to Make Statistics Boring
Rex Boggs
Glenmore High School
Rockhampton QLD
Ah! Ill bet some of you thought, Hey, statistics is already boring, it doesnt need to be made boring." Given the sort of statistics to which weve been subjecting ourselves and our students over the years, such an attitude would not be surprising. Consider the following exercise on constructing a boxplot, which is from a popular Math A text.
Construct a box-and-whisker graph for the following data which are the masses in kilograms of 9 Year-11 girls: 35 47 48 50 51 53 54 70 75
This was chosen only because it is a typical example of the statistics that many of us are teaching our students. I am not picking on this particular textbook. All of the Maths A and B texts that I have examined are loaded with similar examples. If this exercise doesnt convince you that statistics is boring, there are many more where this came from.
Actually, boring is not the major point of this article, despite the title. There are other things wrong with this exercise, other than the fact that it is boring. Four things, in fact. It is fake. It is trivial. It is pointless. And the answer is wrong.
It is Fake
How do I know that it is fake? Well, I dont know for sure. But authors of statistics textbooks will usually acknowledge the source of a dataset, and this dataset was not acknowledged. My guess is that most of the datasets in our textbooks are fake, i.e. they have been made up by the authors. I would be happy to be proved wrong. What message about the value of statistics does this give our students?
There are many real datasets available. Real statisticians dont analyse fake data, at least not if they want to earn real money. We dont need to use fake data when teaching our students about statistical concepts. How can we hope to convince our students that statistics is a worthwhile subject when we dont even give them real problems to study?
Real datasets are accessible. My source of choice is the Internet. There are a number of websites devoted to datasets. My favourite is the Data and Story Library (http://lib.stat.cmu.edu/DASL/ ). This site contains a large number of datasets, each with a story about how and why the data was gathered. The site contains a useful search engine, so you can, for example, ask for all of the datasets useful for teaching linear regression. A list of other websites that contain datasets is available from the Links page of the SMARD website - http://cq-pan.cqu.edu.au/schools/smad/hotlinks.html
There are a number of excellent textbooks written in the US for introductory university statistics courses that contain numerous small datasets. In some cases, these datasets are available on disk, so they are easily accessible to any maths teacher with a computer. To illustrate what is available, here is a brief review of a text recommended by many members of the edstat-l mailing list on the Internet -
Introduction to the Practice of Statistics by Moore & McCabe -
" . two things about M&M stand out in my mind. (1) The data sets were chosen thoughtfully and with great care. There's a statistical "lesson" associated with almost all of them. (2) M&M is loaded with sound, practical advice--much of it easy-to-use "rules of thumb"--that you can find almost nowhere else (in elementary textbooks, that is) ."
(Bruce King, Department of Mathematics and Computer Science, Western Connecticut State University)
Not every dataset used to teach statistics needs to be real. For example, Mal Shield in the June 1996 issue of Teaching Mathematics used some fake data to illustrate the Central Limit Theorem. But Mal had a definite purpose for doing so. Similarly, I use a set of random normal data in Maths B and Maths C to illustrate the effects of sample size. I think a good rule of thumb is - use fake data at times, but only if there is a good reason for it.
It is Trivial
Who cares about the masses of a small sample of Year 11 girls? What message are we giving to our students about statistics, when this is supposedly an example of the type of problem that statistics can solve? Statistics is used by government, science, business and industry in making decisions about the allocation of funds, the environment, about medical treatments, about marketing strategies, about land use and in process control in manufacturing. These are the sorts of problems with which we should be confronting our students. We should be using datasets about things that actually matter to people.
It is Pointless
One reason we draw boxplots is to visually compare the distribution of two or more sets of data. In this exercise we are asked to draw a single boxplot. We are not comparing this distribution to anything. This boxplot, drawn by itself, is of limited value. Especially when the solution given in the text is wrong.
It is Wrong
Another reason we construct boxplots is to determine if there are any outliers. If outliers exist, we need to examine the data, and decide what to do with them. They cant just be ignored. This textbook incorrectly teaches the student to draw the whiskers right out to the maximum and minimum values. In doing so, the student is unable to determine from the boxplot how many outliers exist.
Here is an extract from the Help file from a statistics program called NCSS Jr. 6.0 -
"Generally, outliers cause distortion in statistical tests. You must scan your data for outliers (the box plot is an excellent tool for doing this). If you have outliers, you have to decide if they are one-time occurrences or if they would occur in another sample. If they are one-time occurrences, you can remove them and proceed. If you know they represent a certain segment of the population, you have to decide between biasing your results (by removing them) or using a nonparametric test that can deal with them. Most would choose the nonparametric test."
There is general agreement among professional statisticians on how to draw the whiskers of a boxplot. The method described in every Maths A and Maths B text that I was able to check is wrong. We would all be jumping up and down if a similarly fundamental error was made about calculus. Why should we accept it in our statistics?
If you put the above data into a TI-81, TI-82 or even a TI-92 graphical calculator and draw a boxplot, it will draw the boxplot exactly the same way as the textbook does! What is going on here? Are these graphical calculators also wrong?
Well, yes. The reason is because TI listened to high school teachers about what statistics they want in a graphical calculator. This is a very good thing. Unfortunately, they didnt then go to a professional statistician to find out how to properly implement those features.
The new TI-83, which incorporates a good deal of statistics, is interesting. It does boxplots in two ways - correctly and incorrectly. Unfortunately the regular method is incorrect, while the modified method shows outliers. But at least you have the option of drawing a boxplot correctly.
How to Draw a Box Plot
There is a commonly accepted method of drawing the whiskers on a boxplot. However, a number of methods to determine the values of the first and third quartiles exist. The method recommended by Tukey, who invented the boxplot, is as follows:
Find the median. Then find the median of the data values whose ranks are LESS THAN OR EQUAL TO the rank of the median. This will be a data value or it will be half way between two data values.
With a dataset with an odd number of values, include the median in each of the two halves of the dataset and then find the median of each half. This gives the first and third quartiles. If the dataset has an even number of values, just split the data into two halves, and find the median of each half.
Here is an example using the above dataset, which contains an odd number of values:
35 47 48 50 51 53 54 70 75
Split the data into two halves, each including the median:
35 47 48 50 51 and 51 53 54 70 75
Find the median of each half. In this example, the first quartile is 48 and the third quartile is 54. Hence the interquartile range is 54-48 = 6.
Ill add a number to the above dataset to illustrate how to find the quartiles for an even number of values (what the heck, the data is bogus anyway):
35 47 48 50 51 53 54 60 70 75
Split the data into two halves:
35 47 48 50 51 and 53 54 60 70 75
Now find the median of each half. In this example, the first quartile is 48 and the third quartile is 60. Hence the IQR is 60-48 = 12.
Alternative Method for Drawing the Box As Drawn by the TI-82 and TI-83
As I would like my students to get the same answer as the TI-82 and TI-83 graphical calculators, this is the method that I will be teaching in the future.
Find the median. Then find the median of the data values whose ranks are LESS THAN the rank of the median. This will be a data value or it will be half way between two data values.
Here is the same example using the above numbers, which contains an odd number of values:
35 47 48 50 51 53 54 70 75
Split the data into two halves, not including the median:
35 47 48 50 and 53 54 70 75
Find the median of each half. In this example, the first quartile is 47.5 and the third quartile is 62. Hence the interquartile range is 62-47.5 = 14.5.
Note the difference in the answers between the two methods! It is not really surprising when you consider that we are doing a 5 number summary on a set of only 9 numbers.
Drawing the Whiskers
The correct method for drawing the whiskers is marginally more complicated than that described in our texts. The maximum length of each whisker is 1.5 times the interquartile range (IQR). To draw the whisker above the 3rd quartile, draw it to the largest datapoint that is less than or equal to the value that is 1.5 IQRs above the 3rd quartile. Any datapoint larger than that should be marked as an outlier. Some statisticians differentiate between mild outliers and severe outliers. Mild outliers lie between 1.5 and 3 IQRs above (or below) the 3rd quartile, while severe outliers are more than 3 IQRs above the 3rd quartile. I dont believe we need to make this distinction in our courses.
Here is an example, using the first set of numbers above using Tukeys method of determining Q1 and Q3:
35 47 48 50 51 53 54 70 75
The IQR is 6. Now 1.5 times 6 equals 9. This is the maximum length of the whisker. Subtract 9 from the first quartile: 48 - 9 = 39. Note that 35 is an outlier, and the whisker should be drawn to 47, which is the smallest value that is not an outlier.
[diagram needed]
Add 9 to the third quartile: 54 + 9 = 63. Any value larger than 63 is an outlier, so in this instance both 70 and 75 are outliers. Draw the whisker to the largest value in the dataset that is not an outlier, in this case 54. Since this value is the 3rd quartile, we draw no whisker at all! Mark 70 and 75 as outliers. The boxplot is given below:
This gives a markedly different picture of the data than the answer in the text.
If we use the alternative method with the IQR of 14.5 then we have no outliers, and the boxplot looks like this:
[diagram needed]
Beware doing boxplots on a small set of numbers!
Some students ask "Why 1.5 IQRs?" Paul Velleman, from Cornell University, was fortunate enough to study under John Tukey, the inventor of box plots (and numerous other modern statistical tools). When he asked John Tukey this question, Tukey said, "Because 1 is too small and 2 is too large." According to Velleman, " there is a paper [which] shows that the outlier rule is really quite good across a pretty wide array of distributions."
The Case for Using Computers to Teach Statistics
Another disagreement I have with the authors of our current collection of Maths A and Maths B textbooks is the almost total disregard for the power of computers in performing statistical calculations. In no textbook did I see any reference to using computers to assist in analysing data. Many of the interesting datasets are too large to analyse by hand. Without access to computers and statistical software, our students can only be given a watered down version of the power and value of statistics.
Real statisticians use real computers with real statistical software to solve real problems. Why shouldnt our students? Following this argument, we should not exclusively use -
· our textbooks, or
· graphical calculators, or
· spreadsheets
to do statistics. These all have their uses, but students should also use software that is designed for doing statistics. Fortunately there is an excellent statistics software program available for free, at least for Windows-based computers, called NCSS Jr. 6.0.
There are a number of reasons why graphical calculators are not sufficient for teaching statistics. The size of the datasets that can be analysed is restricted, not just by memory, but by the time it takes to input the data. And the data is not stored permanently, so a dataset typed in today will be not available a month later when you want to do further analysis on the same dataset.
As pointed out above, some of the statistics built into some of the graphical calculators is just plain wrong, and other statistics, eg dot plots and stem plots, are just not there. Tukey introduced the 4 Rs of statistics, the first of which is Revelation. By this he means that a statistician should view the data in as many ways as possible. By doing this, such features as normality, the existence of outliers and clumping of the data may be noted. NCSS Jr will produce histograms, box plots, normal probability plots, scatter plots, dotplots, probability density curves and stem and leaf diagrams in the twinkling of an eye. Some of these are not available with graphical calculators, and to produce those that are takes considerable effort on the part of the user.
There has recently been an animated discussion on the edstat-l mailing list on the Internet about using an Excel spreadsheet with a statistical add-on package in place of statistics software. After all, many schools and students already have Excel, and hence there is no additional expense.
The problem with Excel is that it is NOT a statistics program. Doing statistics with Excel is difficult, the documentation is minimal, and Excel does have errors. Bob Hayden, from Department of Mathematics, Plymouth State College, puts the case against using spreadsheets for doing serious statistics very nicely, -
Using Excel for dealing with data is rather like using a $1.99 adjustable wrench from China for dealing with nuts and bolts. I don't mind if consenting adults do things like that in the privacy of their own home, but to offer a course in auto mechanics and actually TEACH students to handle nuts and bolts that way is another matter."
If a low cost alternative didnt exist, then we may be willing to put up with these limitations. But there is an excellent alternative for Windows, and it is free. The program is NCSS Jr. 6.
NCSS Jr. 6.0
NCSS 6.0 is a professional statistics package for Windows 3.1 or Windows 95. NCSS 6.0 Jr is a cut-down version of this program, designed to be used with an introductory university statistics course. It includes every statistical tool that is needed for Mathematics ABC, as well as an excellent help file that explains many of the statistical tools. The program is free to educators and students, and can be downloaded from the Internet by visiting http://WWW.NCSS.com/ This site can also be visited via SMARDs Great Web Sites page - http://cq-pan.cqu.edu.au/schools/smad/hotlinks.html SMARD also contains a number of datasets that have been downloaded from the Data and Story Library and converted to NCSS format.
Summary
The message is simple - to teach statistics as well as possible, we should use real datasets, we should use statistical tools correctly, and we should empower our students to use professional statistical software.