In the field of data mining, large amounts of discrete data must be analyzed. For example, in the field of weather modeling, precipitation can be represented as an amount of water molecules per unit of time (e.g., number of inches of rain per day). Researchers who study precipitation typically set multiple places of data sampling stations across the area for which the researcher is interested. Each discrete data point gathered at the data sampling station (i.e., a number representing the amount of water at a specific unit of time) is sent to a central computer at a configurable frequency. Based on all discrete data points, a researcher can forecast future precipitation. The more data points gathered, the greater likelihood that the forecast is correct. Accordingly, researchers gather large amounts of data in the form of discrete data points before making the forecast.
Multiple tools exist to help individuals understand large amounts of data. One such tool is the histogram. A histogram shows a relative frequency of elements or discrete data points within a data set. Specifically, a histogram shows how many elements fall within a certain external bucket. For example, suppose the data set represents the ages of individuals visiting a theme park. When representing the ages, external buckets of the ages, such as 0-4 years old, 5-9 years old, etc. are created. The histogram for the theme park shows the number of individuals having ages within the external buckets.
To construct a histogram, a complete data set is required. In particular, all elements are obtained before construction. Because the data set is complete, the histogram can give a more accurate representation of the data. For example, a histogram in which 99% of the elements are within a single external bucket is typically not helpful. However, with the complete data set, the external buckets can be distributed across the range of the data set to provide a more useful representation of data.
Often, a large volume of data is submitted as a data stream. Specifically, often a complete data set is not known prior to construction because data is constantly being sampled. In such scenarios, the complete data set is often stored in such a manner to maintain all information about each value of each data element. When a histogram is requested, then the histogram can be created using the complete data set as if the data set arrived at one time.