In the field of data mining, large amounts of discrete data must be analyzed. For example, in the field of weather modeling, precipitation can be represented as an amount of water molecules per unit of time (e.g., number of inches of rain per day). Researchers who study precipitation typically set multiple places of data sampling stations across the area for which the researcher is interested. Each discrete data point gathered at the data sampling station (i.e., a number representing the amount of water at a specific unit of time) is sent to a central computer at a configurable frequency. Based on all discrete data points, a researcher can forecast future precipitation. The more data points gathered the greater likelihood that the forecast is correct. Accordingly, researchers gather large amounts of data in the form of discrete data points before making the forecast.
Multiple tools exist to help individuals understand large amounts of data. One such tool is the histogram. A histogram shows a relative frequency of elements or discrete data points within a data set. Specifically, a histogram shows the distribution of elements (i.e., the number of elements that have values within a certain bucket). For example, suppose the data set represents the ages of individuals visiting a theme park. When representing the ages, buckets of the ages, such as 0-4 years old, 5-9 years old, etc. are created. The histogram for the theme part shows the number of individuals having ages within the buckets.
To construct a histogram, a complete data set is required. In particular, all elements are obtained before construction. Because the data set is complete, the histogram can give a more accurate representation of the data. For example, a histogram in which 99% of the elements are within a single bucket is typically not helpful. However, with the complete data set, the buckets may be distributed across the range of the data set to provide a more useful representation of data.
Often, a large volume of data is submitted as a data stream. Specifically, often a complete data set is not known prior to construction because data is constantly being sampled. In such scenarios, the complete data stream is stored. Upon receiving a request for a histogram, the histogram is statically calculated from the stored data stream using the snapshot of data in storage.
Further, data sets often contain noise and/or outliers. Noise corresponds to faulty values in the data sets that, for example, are the product of faulty measuring. Outliers correspond to the values that are on the extreme (i.e., fall outside the generally collected data). For example, suppose that the data set represents the temperature in Houston during the summer months and a data element is received for the temperature with value of 45 degrees Fahrenheit (° F.) because of a highly unusual cold front. In such scenario, an outlier is 45° F. because the general temperature during the summer months is 75° F. to 106° F.