Letter-value plots: Boxplots for large data

Heike Hofmann, Karen Kafadar, Hadley Wickham.
Letter-value plots: Boxplots for large data.

Conventional boxplots are useful displays for conveying rough information about the central 50% and the extent of data. For small-sized data sets (n < 200), detailed estimates of tail behavior beyond the quartiles may not be trustworthy, so the information provided by boxplots is appropriately somewhat vague beyond the quartiles, and the expected number of {"}outliers{"} of size n is often less than 10. Larger data sets (n ~ 10,000-100,000) afford more precise estimates of quantiles beyond the quartiles, but conventional boxplots do not show this information about the tails, and, in addition, show large numbers of extreme, but not unexpected, observations.

The letter-value plot addresses both these shortcomings: (1) it conveys more detailed information in the tails using letter values, but only to the depths where the letter values are reliable estimates of their corresponding quantiles and (2) {"}outliers{"} are labeled as those observations beyond the most extreme letter value. All features shown on the letter-value plot are actual observations, thus remaining faithful to the principles that governed Tukey's original boxplot. We illustrate letter-value plots on real data (univariate and bivariate) that demonstrate their usefulness, particularly for large data sets. All graphics are created using R, and code and data are available in the supplementary materials.

@TechReport{letter-value-plot,
  author = {Heike Hofmann and Karen Kafadar and Hadley Wickham},
  institution = {had.co.nz},
  journal = {The American Statistican},
  title = {Letter-value plots: Boxplots for large data},
  year = {2011},
}