From the Exploring Data website - http://curriculum.qed.qld.gov.au/kla/eda/
© Education Queensland, 1997

How Wide Is Your Bin?

Following a chat with Geoff McLaughlin, Reader in Statistics at the University of Queensland, I developed an interest in how the various statistics program choose the 'automatic' settings for bin widths of a histogram. So I sent an email to the EdStat-L mailing list.

Below is an edited version of the original email and some of the replies. By the way this is an example of the high level advice that is available from a mailing list populated by experts willing to share their knowledge. As an aside, note that there is some nice high school algebra in these algorithms. I’m sure there is a great assignment linking functions and statistics in this interesting topic.

Statistics programs exist that allows the user to change the bin width dynamically, just by dragging with a mouse. Datadesk, which is the statistics program underlying ActivStats is probably the most well-known of these.

The Original Question

I have been asking a few questions about how computer statistics programs choose their automatic settings for histograms and have been given these two 'rules of thumb' -

A. the sample size n, is related to the number of bins, N, by the formula n=2(N-1).

B. bin width, w = 2 * IQR / n1/3.

Are these formulas used? Are they widespread? What is the underlying theory behind them?

From Paul Velleman

As is clear from the variety of postings about histogram scaling, and as is known by those of us who have wrestled with this problem over the years,

a) there are an almost unlimited number of algorithms.

b) They have only one thing in common - they don’t work.

"By (b) I mean that any algorithm I know of fails to do what I want it to on some dataset or other. The reason for that is that what I want may depend on information not available to the algorithm, such as whether I am hunting outliers or whether I believe there is a smooth underlying density shape that I would like to see approximated.

"The only workable solution is to allow statistics package users to rescale histograms dynamically once they are made. Data Desk and JMP both offer ways to do this by dragging some part of the display with the mouse. Other packages may do so as well, but I'm not aware of the details.

From Rasmus Tamstorf

I have an intro text to statistics (in Danish) stating (without reference) that

N = 1.0 + 3.3 * log10(n) = 1 + ln2(n)

is a good choice. This seems to be the same rule of thumb as the one you cite, so apparently it has found its way around the world and it is being used, but I'll leave the question about the underlying theory to some of our gurus on this list.

From Peter ? (I couldn’t find his surname)

The notes are turning yellow, but I found the following in the notes I took in my stat methods course. No explanation accompanied it, other than the reference. Widespread? I can't say that I've seen or heard any rule, yours or these, more than once.

*Rough guidelines* for picking the number of cells [bins]:

1) Number of cells is approximately 1 + 3.3*log(n) for data set size n, with n >= 15

(Sturgis' Rule, JASA, 1926)

2) Number of cells >= (Range / Cell width) >= (2n)1/3

(Terrell's Rule, Rice U, 1983)

From Donald F. Burrill

The rest of Rex's post, and the responses I've read so far, present formulae based on the sample size. This may be reasonable for either (1) constructing histograms "by hand", or (2) automatic computer scaling of bin widths when the total width is not bounded as for histograms where the bars are shown horizontally, and the number of lines can be as many as one wants; in both cases when the distribution to be displayed is essentially continuous.

But for automatic scaling of histograms where the bars are shown vertically, there is a further constraint in the physical width of the line used: in the number of characters available if one is producing a character-plot version of a histogram, or in the default physical width of a bin if one is using high-resolution graphics. And this can have unanticipated consequences when the distribution is not continuous for example, when the only legitimate values are integers within a certain range, as in scores derived from tests or survey instruments).

A couple of years ago a published paper contained histograms nicely drawn with vertical bars, showing an interesting set of "spikes" at regular intervals. The author spent a paragraph or so explaining how those spikes (about twice the height of the adjacent histogram bars) could have come about (having to do with presumed preferential patterns of responses to the instrument whose scores were being reported), evidently in the belief that each histogram bar represented a single score (a bin of width 1). But the automatic scaling had produced a histogram with eight (8) intervals for every ten (10) scores (which were integers); with the result that every 4th interval contained the combined frequency of two scores, while the other three intervals contained the frequency for only one score. The "spikes", which seemed to demand a theoretical explanation of some kind, were in fact artifacts of the default automatic scaling of the histogram-producing program.

I conclude that the various formulae cited by other respondents are quite possibly inherently dysfunctional for Diophantine problems -- especially if one does not view the histogram(s) produced with adequate skepticism.

From Bob Hayden

Rex asked what statistics packages actually DO. Most responses have been about advice given for handmade plots. The issues in computer implementation are somewhat different. Andy Siegel's text considers stem and leaf plots to be a species of histogram. There is code and discussion of same for stem and leaf plots in ABCs of EDA by Velleman and Hoaglin (1981), Duxbury. The code ended up in Minitab. Obviously this is not the latest word, but if you just want a general idea of the issues involved, I have found it a helpful reference.

By explaining EDA to people who could not understand Tukey's books and putting EDA into Minitab, this book had a major impact on statistics.

Summary

As Paul Velleman wrote in a subsequent email, there is current research being done on the optimal bin width of the default histogram generated by statistics programs! The message to the student, though, is clear - don't just automatically accept the default histogram generated by your statistics program. Especially if the data is skewed or multi-modal or has some unusual feature in its distribution, the student should generate a number of histograms and choose that which seems to best give a true picture of the data, or that which best displays the particular aspect of the data in which the student is interested.

And of course for a student to generate this collection of histograms efficiently they must at a minimum have access to a graphing calculator.