Some machine learning algorithms require all attributes to be discrete. Even if they did not mandate discrete attributes these machine learning algorithms do not work very well when the attributes are continuous. This is particularly true for non-parametric methods that explore non-linear relations between attributes. Given these limitations, most data scientists, prior to applying such machine learning algorithms on a classified dataset, discretize all continuous attributes of that dataset. The discretization of continuous attributes is a process of finding a set of subintervals without overlap, which constitutes a partition of that attribute, and mapping those subintervals into buckets or discrete values. Different intervals may be mapped into same bucket, but one interval cannot be mapped in to several buckets.
The discretization method is also called cardinality reduction of continuous attributes where the total number of unique values that the attribute takes is reduced. The performance of such machine learning algorithms depends on right methods of discretization which minimize information loss in grouping continuous values in to buckets. There are two types of discretization methods, unsupervised when no classified data set is available and supervised where a classified dataset is available. Most of the machine learning algorithms are used to build class prediction models for a given dataset. Supervised discretization methods consider class variable while discretizing the continuous values and thus improve performance of model prediction compared to other discretization methods that do not consider the class variable.
The methods available at present use either Chi Square tests or any statistical significance tests which use contingency tables for two consecutive subintervals (when arranged in an ascending or descending order) to merge them into one subinterval. These merged intervals will be tested by Chi Square test or any statistical significance tests to merge further with next consecutive interval. In this process, there is no guarantee that these merges minimize for the loss of information. Existing methods merge adjacent subintervals only. There may be several subintervals which are not adjacent but mutually insignificant (statistically equal class proportions) with respect to class distribution.
At present, most of the methods initially create many subintervals either based on uniform scaling (fixed width) or by putting sequential values into one bucket to maintain a minimum frequency and then compare sequentially these subintervals for statistical significance. Sequential buckets which are not significant with one another are merged together. This method has a flaw.
For example using the existing methods, three subintervals I1, I2 and I3 are such that (I1, I2) and (I2, I3) are found statistically insignificant as pairs but there is no guarantee that pair (I1, I3) will be statistically insignificant. In this case, present methods merge all three buckets I1, I2 and I3 though I1and I3 are statistically significant. To overcome this flaw, one must check all the buckets mutually for their statistical significance to be merged together. In a case of n subintervals to be merged together there will be up to O(n2) statistical significance tests to be done. Moreover, the system must scan dataset several times if the data doesn't fit into local memory.