A histogram is a graphical chart, such as a bar chart, representing a frequency distribution of data elements where the heights of objects in the chart represent observed frequencies of the data elements. There is often a great variability among the many possible histograms of a data sample that are produced.
Histograms have been in use for approximately 300 years, by some accounts since 1662, and perhaps were the first and now are the most widely used graphic for quantitative data. The histogram is the most common graph of the distribution of one quantitative variable. Every year millions of individuals look at and may be influenced by histograms.
However, just as a data sample does not necessarily represent a population, a histogram does not necessarily represent a data sample. The appearance of a histogram of a data sample can be misleading. To make informed use of histograms for a presentation, an analysis or a decision, a choice among many possible histograms is required.
When a histogram appearance is used, if it matters, experts may consider all of the others, with certain knowledge that by using this method and system that the palate has of all the possibilities. Selection and optimality criteria may be applied to the finite set of possible appearances. A clearer understanding is obtained than from simply allowing location and width to vary continuously or haphazardly or according to a procedure unrelated to location and width level sets for the different appearances. It may be of interest to consider issues of human cognition in the context of data grouped with uniformly wide intervals. And in practice, it is, of course, impossible to continuously vary any parameter.
For most samples of data with n data elements, many histogram appearances are possible and many are not. One problem is to determine well defined subsets of all histogram appearances that are possible for a given data sample and to display those histogram appearances and a typical or preferred histogram having an appearance.
Another problem is that for small data samples an error in uniform bin width histograms arises from sampling error and from histogram appearance variability.
Another problem is that is difficult to determine maximum likelihood histogram density estimators for data samples. In 1990, Professor James R. Thompson, presently of Rice University, and Professor Richard A. Tapia published a proof that the well known histogram density for a given sample and arbitrary set of bins, not simply the uniform width bins, is the maximum likelihood density function estimator for a true but unknown density, from among all other step function approximations based on the given set of bins.
Professors Thompson and Tapia did not present a global maximum likelihood among a subset of histograms, such as those with uniform bin widths, using a procedure similar to the procedure that Professor Scott uses to approximate MISE UCV histograms.
Regarding the method of moments, see, for example, Lindgren, 1968, p 278. (Lindgren, B. W. 1968, p 278; Statistical Theory, 2nd Ed. MacMillan Company.) Essentially, to fit a density function or other distributional law to a sample, if the density or other distributional law involves k parameters, then the first k sample moments are equated to the first k density or other distributional law moments expressed in terms of the density or other distributional law parameters. This leads to k equations or constraints, in k unknowns. Usually these can be solved for the k parameters values. The density or other distributional law defined by these method of moments parameter values is the method of moments density or other distributional law estimate based on the sample.
Uniform bin width histogram shape level sets to obtain chi-squared minimizing bin parameter values and normal distribution parameter values to make and to implement the estimators were proposed by K. Smith in her circa 1916 Ph.D. dissertation supervised by Karl Pearson and commented upon by R. A. Fisher. However the mathematical details are very demanding and not easily understood.
Thus, it is desirable to provide a new method and system to determine blended histogram shape identifiers for data samples.