Decision tree generators (called decision trees) are systems that are very much in vogue in automated data analysis. They can be used to analyze large quantities of data. They enable the building of a test tree used to describe the way in which to determine a variable (called a target variable) as a function of a set of explanatory variables.
In a decision tree (also called a classification tree), each node is a test of the value of an attribute of the data to be classified. Each leaf of the decision tree indicates the value of the variable to be explained, known as a target variable.
The most widely known decision trees are the CART, C4.5, C5 and CHAID.
FIG. 1 shows the principle implied when the data is referred to. The data may be represented, for example, by the rows of a table 1001 to 100N, and the attributes, also called variables, or properties by the columns of the table 1011 to 101N.
A first known technique is that of the CART decision tree system (see Breiman L, Friedman J. H., Olshen R. A. and Stone C J, “Classification and Regression Trees”, Chapman and Hall, 1984). This method of data classification, which is ready well known in “data mining”, is used to generate a decision tree as a function of one and only one target variable. This is a supervised method with one target variable. The attributes of the data processed may be continuous or discrete. The target variables may be continuous or discrete.
A second prior art technique is that of the unsupervised DIVOP system (see Chavent M, “A monothetic clustering method”, Pattern Recognition Letters 19, p. 989-996, 1998). This unsupervised classification method is used to generate a decision tree without target variable, for continuous variables only. It proposes an alternative to the very widely known unsupervised classification method called the “upward (or ascending) hierarchical classification method” in generating a decision tree instead of a dendrogram (a dendrogram is an upward grouping binary tree that uses no test nodes). This system works only for continuous variables. It cannot be used to manage a target variable.
The first known technique (the CART technique) cannot be used to perform unsupervised classification without target variable. Indeed, the CART technique needs one (and only one) target variable to generate a decision tree. It therefore cannot be used for unsupervised classification (also known as clustering). Furthermore, it cannot generate a decision tree in taking account of several target variables.
The drawback of the second prior art technique (known as the DIVOP technique) is that it is limited to integrally continuous data, and to unsupervised classification. The DIVOP method therefore cannot be used for supervised classification to explain a target variable. It cannot be used on data comprising discrete attributes (with qualitative values) which, however, are very numerous in real cases.
Furthermore, the two known techniques (the CART and DIVOP techniques) cannot be used to generate a decision tree based on multivariate rules, namely a tree using several explanatory attributes at a time in the test nodes. However, other methods allow such generation but they do so within the restricted framework of supervised analysis for numerical explanatory attributes, performed chiefly by linear regression.
Furthermore, the modeling work of the statistician or data analyst always consists of the generation, by statistical tools for example, of an unsupervised model (in the case of the DIVOP technique) for exploratory analysis for example, or a supervised model (in the case of the CART technique), for a single target variable. The statistician or data analyst wishing to perform either of these tasks does not have a single type of tool at his disposal but must use one type of tool or the other. Furthermore, if several target variables are to be studied, it is generally necessary to make several supervised models, each dedicated to the modeling of one of the variables. This is particularly true for models made by means of decision trees.