From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

Ticky-Tacky Boxes

Malvina Reynolds was one of my favourite song writers in the 1960s, and her song about ‘little boxes, made of ticky-tacky, and they all look just the same’ was a classic. Well in this context it would be nice if those little boxes in boxplots looked just the same, but they don’t. These emails from members of the apstats mailing list discuss some of the methods adopted by various statisticians, textbook authors and graphical calculator manufacturers.

Background Information

Minitab is a popular statistics program. 'Moore and McCabe' (also referred to as 'M&M') refers to their popular text Introduction to the Practice of Statistics. The TI-83 is a graphics calculator with a strong statistics component. QLP is the Quantitative Literacy Project. Minitab, SAS and Splus are all statistics programs.

From Timothy Brown

Forgive me if this has been discussed already, but I think it hasn't. My class and I discovered today that when MINITAB draws "modified" box plots it identifies outliers by some rule other than the 1.5 x IQR's from Q1 and Q3 that Moore and McCabe and others use. (We got an outlier on the MINITAB plot that didn't show up on somebody's TI-83.). Does anybody know what criteria MINITAB uses to identify outliers? I couldn't find the answer in the documentation (this is Student Minitab for the Mac, v.8).

From Bob Hayden

I am not familiar with this particular version of Minitab, but traditionally Minitab has used the 1.5 x IQR rule for detecting outliers. However, it computes the IQR by subtracting the first quartile from the third. Despite what they say, this is not what Moore and McCabe do. Use "describe" on your data and see if you and Minitab agree on Q1 and Q3.

Minitab approaches quartiles using a general approach that fits all kinds of quantiles -- deciles, percentiles, etc. In this system, Q1 has rank 0.25(n+1) and Q3 has rank 0.75(n+1). You can see that this could involve interpolating one-fourth or three-fourths of the way between adjacent data values, while this could never happen with the procedure used in Moore and McCabe.

When Tukey invented the boxplot he used approximate quartiles which he called "hinges". He wanted a simple technique that could be implemented quickly without any computing machinery. Minitab could use these hinges and get boxplots exactly like Tukey's, or it could take the position that hinges are only approximations to quartiles, and so quartiles are the thing to use and Tukey's handmade boxplots are only approximations to the "correct" ones Minitab produces. They seem to have taken the latter position.

Later, the QLP materials implemented a DIFFERENT approximation to the quartiles, and TI implemented this on the grounds that it was what teachers were familiar with. The confusion caused by QLP's introduction of non-standard definitions and conventions has been a pet peeve of mine for some time. It might be interesting to go back over the history of this list and see how many questions have their roots here.

From John Burnette

As we progress through M&M's first chapter we are noticing a difference in terminology between Minitab and the text. Not that it makes much difference but if Q1 doesn't lie on a data point it interpolates between the two it is between. This isn't a lot different from the way that Moore defines Q1.

To make things more confusing, the boxplots ARE the same as M&M defines them; however Minitab calls those values "upper and lower hinges". HL and HU, as they call the values, seem to be calculated by the process Moore uses to find Q1 and Q3.

From Bob Hayden

That's only the tip of the iceberg! I have seen more than a dozen different ways of defining these points which give slightly different results. In some cases, two methods give the same results for some values of n and different results for other values of n. What Minitab does (or did in the older versions I'm most familiar with) for Q1 and Q3 is part of a general approach to quantiles that is older than boxplots. It can involve some messy interpolation in the general case, so when Tukey invented the boxplot (remember, a lot of his inventions were meant to be done while flying on a plane in the days before laptops) he used approximate quartiles which he called "hinges". Elementary textbooks tend to blur this distinction and give one definition and call the result a "quartile". For example, Siegel uses Tukey's hinges and calls them "quartiles". Later a different approximate quartile was adopted for the QLP materials. It was subsequently adopted in the texts by David Moore and by the TI-82 and TI-83, and is pretty much standard in K-12 as a result. These materials also blur the distinction Minitab is (correctly) making.

Those are my general comments, and this sentence is a note to move on if you do not want more detail on

1. the problems of using non-standard terminology.

2. what the actual different definitions are.

I think the QLP materials are wonderful -- better than 99% of the stuff being used to teach statistics in the colleges. But I do wish they were a little less "creative" in their terminology. Hinges and quartiles were already well established before QLP came along, so I think that would be a good reason to go with one or the other. I'm not sure why they went with a third alternative (and called it a quartile). I also note that they used the term "lineplots" for the things that most people (and Minitab) call "dotplots" . Whatever the reasons, using nonstandard terminology does have the price of confusion sooner or later. I've even been criticised or "corrected" by high school teachers because I used standard terminology for things rather than the nonstandard terminology they were accustomed to. (Was it McCauley who said, "Beware the man who's read but one book"?)

TI did their homework and asked around to find out which of the many definitions of quartiles/hinges they would implement on their calculators. However, they were more interested with usage in high school classrooms rather than in the statistics profession.

Speaking of TI and standard terminology, the TI-82 implemented what I would call "quick" boxplots. These are boxplots without any flagging of outliers. Since these lack one of the two main reasons for doing boxplots in the first place, I was disappointed. I was happier when I saw that the TI-83 does plain old real boxplots as well. I was not so happy when I saw that the manual called the quick boxplots "regular boxplots" and the regular boxplots "modified boxplots", as if the limitations of the 82 were the standard of regularity, and a real boxplot was some kind of aberration.

Here are three flavours of quartiles. Everybody agrees you need to sort the data first. I'll just talk about the first quartile. To get the third, sort your data in the wrong direction and then follow the steps below.

Minitab

The first quartile has rank (n+1)/4. Note that everyone agrees that the median has rank (n+1)/2 and Minitab is just extending the pattern. It does the same sort of thing to get deciles or percentiles. If the rank is not a whole number, Minitab uses linear interpolation between two adjacent data values. Note that in the case of quartiles this may put you one-fourth or three-fourths of the way between two data values.

Tukey/Siegel (Tukey's Hinge)

Find the median. Then find the median of the data values whose ranks are LESS THAN OR EQUAL TO the rank of the median. This will be a data value or it will be half way between two data values.

Moore/QLP

Find the median. Then find the median of the data values whose ranks are STRICTLY LESS THAN the rank of the median. This will be a data value or it will be half way between two data values.

Note that for SOME values of n, SOME of these methods give the same results. No two of them give the same results for all n.

From Dr. Sidney J. Kolpas

Microsoft Excel for Windows 95 occasionally gives different answers for the quartiles of a set of data (Q1, Q2, Q2) than does the TI82, and most statistics books. What's going on here?

From Ron Bremer

There is no unique definition of a percentile (and quartiles in particular). The 1st quartile is any number with at least 25% of the data less than or equal to it and at least 75% of the data greater than or equal to it. In general this defines an interval of values, any of which is a valid quartile.

From Alan Hutson

i) Population percentiles are uniquely defined for a continuous random variable.

ii) Estimators of the percentiles are not uniquely defined, but some are more optimal than others in terms of bias and variability. For example, suppose we want to estimate the median. We could average the three middle observations or just use the intro to stat definition and use the middle observation as the estimator of the median. If the population is symmetric both estimators will be unbiased estimators of the population median, yet by using the three middle observations we would reduce the variability of the estimator of the median substantially.

From Bob Hayden

There are several ways of defining quartiles. There are even more ways of defining the hinges used in boxplots. Some books and software call the hinges quartiles. For the purpose of making boxplots, I prefer hinges as originally defined by Tukey.

From Terry Moore

To be more specific, any value for which at least a quarter of the values are less than or equal to the value, and at least three-quarters of the values are greater than or equal to the value, is a quartile. This generalises to other partition values such as thirds. But it doesn't always give a unique value which makes it difficult to represent graphically (I suppose we could draw a shaded area).

There is no unique way to get a unique value. But I prefer one that generalises to other partition values than the one chosen by Tukey. Tukey divides the data into two groups using the median. He counts the median in both groups when there are an odd number of data whereas Moore & McCabe leave it out of both groups. Both conventions then take the medians of the groups as the hinges.

I don't know why Tukey did this - it doesn't generalise to thirds (and Tukey uses thirds for his Rline method) or to other partition values.

One method is to use 0.25 (or 0.75 or other proportions) *(n + 1) where n is the sample size and count this number of observations, interpolating if it is not an integer. Another is to use 0.25 (etc.) n + 1/2.

Alternatively you can aim to be consistent with the first definition, and choose a point half way between two adjacent observations if the value isn't unique.

But to reiterate my main point, I believe it would be simpler if we stuck to one rule for all partition values, rather than making a special case of hinges.

From Bob Hayden

Generally, Tukey's EDA methods were designed to be done quickly without computing machinery, so interpolating 0.75 or 5/12 (as one paper recommends), would not be viable choices. That's why they are "hinges" rather than quartiles. His choice has for me the metaphysically satisfying property that the five number summary of a batch of five numbers consists of the numbers themselves. The other methods would summarise five numbers with artificial constructs.

Using such precise techniques, or worrying about consistency with other -tiles is very Platonic, while data analysis is profoundly Aristotelian.

From Terry Moore

One method avoids tricky interpolation and still has this nice metaphysical property. Take 0.25(n+1). If it's an integer add 1/2 and interpolate. If not round up to the next integer. The interpolation needed occasionally is also needed with hinges - and twice as often. I find that weak students have trouble when there are a number of special cases.

Tukey's approach is nice and simple, but we still need other quantiles so students have two things to learn instead of one.

From David Irvine, responding to:

I know that there are a variety of ways of calculating percentiles, but is it possible that Excel calculates interpolated values wrongly? It certainly disagrees with Minitab. If you take the integers 1, 2, 3, 4, 5, 6, 7, 8 and calculate the first Quartile Excel returns 2.75 whereas Minitab returns 2.25.

I'm not sure I like either the Minitab or the Excel result for the first quartile for the data you give. SAS 6.11 (Win95) gives the first quartile as 2.5, which I personally find more acceptable.

For the p-quantile (in your case p = 1/4), rules with interpolation that are symmetric (so that the (1-p)-quantile in the reverse ranking is the same) are of the form p(n+a) + b where b = (1-a)/2.

Minitab uses a = 1, b = 0. Splus uses a = 0, b = 1/2. Excel seems to be using a = -1, b = 1. I have never seen it recommended to take a outside the range 0 <= a <= 1. Some people recommend a = 1/4, b = 3/8 which apparently satisfies some sort of optimality criterion. I'm not sure what. But a = b = 1/3 is easy to remember."

***

And there is even more....

From Bob Hayden

Responding to this message from John Burnette:

For those of you using Moore/McCabe together with Minitab - you need to be aware of a possible disconnection for your students. Moore describes the creation of boxplots with the "box" bounded by Q1 and Q3. Q1 is defined as the median of the "lower half" of the data points. Q3 is likewise defined as the median of the "upper half" of the data."

The "describe" command in Minitab defines Q1 as the (n+1)/4 element in a sorted data list - using interpolation as necessary. Regrettably this results in SOMETIMES being the same as Moore's Q1, sometimes not. To round out the confusion, the boxplots created by Minitab use "box limits" called "hinges", which are defined by, you guessed it, Moore's definition for Q1 and Q3.

Oh, it's much worse than that!-) In the beginning there were quantiles, and a variety of ways of defining them for discrete distributions (which includes all finite data sets). With, say, five numbers, there is NO number with the property that exactly 25% of the numbers fall below that value. So various interpolations schemes were employed. Minitab uses one of these for the values of Q1 and Q2 reported by the DESCRIBE command. In this context it is advantageous to adopt a general definition that works for quartiles, deciles, dodecahedriles, etc.

Then John Tukey invented boxplots. These used hinges, which you could think of as rough-and-ready quartiles. The first quartile is the median of the data with ranks less than or equal to the rank of the median. Minitab bases its boxplots on Tukey hinges. So does DataDesk. I prefer this because then the five number summary for five numbers is the numbers themselves.

Then David Moore and/or the QLP people (I'm not sure who started it) made boxplots with the box starting at the median of the data values with ranks less than (but not equal to) the rank of the median. This is neither the hinge nor the first quartile, but they called it the first quartile. The TI-83 follows this. However, virtually no one else does it this way.

Finally, to really confuse matters, you can often find pairs of definitions that agree for some n but not others, so you THINK they are the same but they are not ALWAYS the same. You need to look at n's with each of the four possible remainders on division by 4.

EXAMPLES

12 20 28 36
Tukey gets 16 for the hinge, Minitab gets 14 for Q1, Moore gets 16 for the first quartile.

10 20 30 40 50
Tukey gets 20 for the hinge, Minitab gets 15 for Q1, Moore gets 15 for the first quartile.

10 20 30 40 50 60
Tukey gets 20 for the hinge, Minitab gets 17.5 for Q1, Moore gets 20 for the first quartile.

10 20 30 40 50 60 70
Tukey gets 25 for the hinge, Minitab gets 20 for Q1, Moore gets 20 for the first quartile.

(I trust someone will check my arithmetic!-)

I'm guessing the Moore/QLP approach was based on the idea of ignoring the distinction between hinges and quartiles and using a compromise common definition for both. This simplifies matters, but only if you never read other books or use software!-).

***

There now, does that clear it all up?