Statistics have been used for centuries to quantify data. Today specific statistical measures and characteristics of database and schema objects and other forms of data presentation, such as the data distribution and storage characteristics of tables, columns, indexes, and partitions, are valuable to users and analysts and can be presented in a plurality of forms. One such example is as a histogram. In viewing information and characterizing a set of samples, a histogram can provide a more complete picture of the distribution of the data than statistical measures such as the mean and standard deviation, etc. This is done by partitioning the data into a collection of buckets and reporting the number or percentage of samples that fall into each bucket. This report can take on various forms. Commonly used forms include tables, line, bar, and pie charts.
The histogram has become a popular tool used in graphing data from databases and other data sources. The histogram is used to summarize discrete or continuous data that are measured on an interval scale. In a line or bar chart presentation of a histogram, an independent variable (usually a bucket or range of data) is plotted along the horizontal axis of the histogram, and the dependent variable (usually a percentage) is plotted along the vertical axis of the histogram. The independent variable is capable of attaining only a finite number of discrete values (for example, five) rather than a continuous range of values. However, the dependent variable can span a continuous range.
Histograms are also often used to illustrate the major features of the distribution of data in a convenient form. A histogram divides up the range of possible values in a data set into classes, groups, or buckets. In a bar chart histogram, for each class, group, or bucket a rectangle is constructed with a base lengths being equal and the height proportional to the number of observations falling into that class, group, or bucket.
Generally, a bar chart histogram will have bars of equal width, although this is not the case when class, group, or bucket intervals vary in size. The intervals do not have to be equal. For example, one bucket could be 0-5 while a second bucket is 6-15. Histograms can have an appearance similar to a vertical or horizontal bar graph. When the variables are continuous (i.e., a variable which can assume an infinite number of real values . . . e.g., an individual can walk 2.456721 . . . miles) there no gaps are present between the bars. However, when the variables are discrete (i.e., a numeric value that takes only a finite number of real values . . . e.g., X can equal only 1, 3, 5, and 1,000) gaps should be left between the bars. In general, FIG. 5 provides a good example of a histogram.
To analysts, the strength of a histogram is that it provides an easy-to-read picture of the location and variation within a data set. There are, however, various weaknesses in histograms. The first is that histograms can be manipulated to show different pictures. In such manipulations if too few or too many bars are used, the histogram can be very misleading. This is an area which requires some judgment, and perhaps various levels of experimentation, all based on the analyst's experience.
Another weakness is that histograms can also obscure differences among data sets. For example, if you looked at data for the number of births per day in the United States in 2003, you would miss any certain variations (e.g. births to single parents, born as twins, mortality information etc.). Likewise, in industry applications, a histogram of a particular process run can usually tell only one part of a long story. There then evolves a need to keep reviewing the histograms and control charts for consecutive similar process runs over an extended time to gain useful knowledge about the specific process.
The analysis of the shape or the clustering of statistical data within histograms also lends useful information to analysts. Clustering, in one definition, deals with finding a structure in a collection of unlabeled data. Clustering could also be further defined as the process of organizing objects into groups whose members are similar in some way. A cluster is, therefore, a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data.
Cluster analysis is data analysis with an objective of sorting categories or cases (people, things, events, etc) into groups, or clusters, so that the degree of association is strong between members of the same cluster and weak between members of different clusters. Each cluster thus describes, in terms of the data collected, the class to which its members belong; and this description may be abstracted through use from the particular to the general class or type.
Frequency information, as it relates to statistical data, is also an important analysis tool. The frequency of a particular observation is defined as the number of times the observation occurs in the data. The distribution of a variable is the pattern of frequencies of the observation. Frequency distributions can be portrayed as frequency tables, histograms, or polygons. Frequency distributions can show either the actual number of observations falling in each range or the percentage of observations. In the latter instance, the distribution is called a relative frequency distribution.
Frequency distribution tables can be used for both categorical and numeric variables. Numeric variables may be either continuous or discrete.
A continuous variable is said to be continuous if it can assume an infinite number of real values. Examples of a continuous variable are distance, age and temperature. Continuous variables should only be used with class intervals, which will be explained below. The measurement of a continuous variable is restricted by the methods used, or by the accuracy of the measuring instruments. For example, the height of a student is a continuous variable because a student may be 5.5321748755 . . . feet tall. However, when the height of a person is measured, it is usually measured to the nearest half inch. Thus, this student's height would be recorded as 5 ½ feet.
Discrete variables can only take a finite number of real values. An example of a discrete variable would be the score given by a judge to a gymnast in competition: the range is 0 to 10 and the score is always given to one decimal (e.g., a score of 8.5). Discrete variables may also be grouped. Again, grouping variables makes them easier to handle.
What follows below is an explanation of constructing a series of different types of frequency distribution tables. Each example is shown to depict the various, but unlimited, types of data that is compiled for use in histograms.