Companies have a plethora of data linking objects and their measurements to an ultimate outcome. They want to use this data to make better future decisions. For example, baseball scouts have information about bat speed, batting order, batting average, and many other statistics linked to players they would like to duplicate or avoid. They use this information to help them draft better players. Credit card companies gather information about potential customers to assess if a potential customer is a credit risk based on the credit history of customers with similar attributes. Medical researchers gather information for various settings of parameters in hopes of identifying combinations of the parameters that lead to a typically positive outcome.
A difficulty in making good decisions is that objects with similar attributes often produce different outcomes. With enough attributes, any two objects with different outcomes can be placed in separate groups, but the groups might be so specific that very little data exists. A confident decision cannot be made on such little information.
Therefore, what is needed is a system and method for determining how to optimally separate objects in a data set into groups with a similar outcome.
One way in which this problem has been solved in the past is by examining all possible sets of attributes and their associated outcomes, assuming possession of a set of attributes from which at least some groups with clear outcome can be identified. However, such a list is often infeasible to make, so a tool known as a decision tree is often employed instead.
Decision trees divide objects into sets according to attributes. They cut down the list of groups mentioned above to a feasible size. In order to do so, they give preference to some attributes over others, that is, decision trees consider more groups for attributes examined early in the decision tree. Decision trees also cut down the list of potential groups by “pruning branches” with a “clear” outcome.
While pruning branches and giving preference to attributes cuts down the list of groups that need to be considered to a manageable size, they also constitute the well-known flaws of decision trees. The order in which attributes are considered significantly affects the utility of a decision tree, and it is often not clear what is the best order in which to consider attributes. Further, branches are often pruned before all information is available in order to save work, and so groups with different outcomes that might be separated are often lumped together.
Decision trees also have a flaw fundamental to their design. Traditionally, the branches of decision trees partition objects into disjoint sets. As soon as objects are split, they cannot be reunited. It is often the case that the unclassified objects, upon completion of a decision tree, are part of a group with a clear outcome, but were split off from the members of their group. Multiple decision trees are the only way to rectify the issue, which quickly grows and become infeasible to build.
Therefore, in addition to the foregoing, what is further needed is a system and method for grouping data that takes the order of attributes, the lack of precision due to pruning, and the requirement that objects be partitioned into disjoint sets out of the equation, so that better decisions can be made.