Conventional data analysis techniques perform well when all of the data attributes or features are of numerical type, that is all the data points of the data set have only numerical (continuous) attributes. For example, most neural network algorithms and most regression algorithms are restricted to data points with numerical data attributes.
H. Ralambondrainy, “A Conceptual Version of the K-Means Algorithm,” Pattern Recognition Letters 16, 1995, pp. 1147–1157, discloses a data clustering technique for converting data having categorical attributes to 1-of-p representations that are then combined with data of the same data set having real attributes. The combined 1-of-p representations and real attributes are used directly in a clustering algorithm such as the k-means algorithm.
Typically, data records comprising both categorical data fields and numerical data fields are processed in one of two methods.
A given categorical data field has a number of m different categories. The first approach recodes the m different categories by mapping the m categories to corresponding m new binary columns. In these new binary columns, exactly one column has the code “1” for the actual value of the category and the others have the code “0”.
The second approach arbitrarily recodes the categorical values as numbers, e.g. by encoding the first category as “1” the next category “2”, and so on.
Both of these methods have severe disadvantages. Introducing a number of m new binary columns can create a very large number of additional data fields in the records, making it difficult for most analysis methods to process the records.
In addition, arbitrarily recoding the categorical values introduces inadequate relations between the values. This arbitrary recoding introduces an implied small distance between the first occurring value and the second occurring value and an implied large distance between the first occurring value and the last occurring value. Consequently, arbitrarily recoding the categorical values leads to an incorrect analysis as the arbitrarily assigned numbers suggest similarities or dissimilarities between the categories when such similarities do not exist.
There is therefore a need for a system and an associated method that efficiently and accurately determine numerical representations for categories of categorical data fields. The need for such system and method has heretofore remained unsatisfied.