1. Field of the Invention
This invention relates generally to data mining. More particularly, it relates to a binary tree-structured classification and prediction algorithm and methodology for supervised learning on complex datasets, particularly useful in large-scale studies.
2. Description of the Related Art
In many real-world situations, multiple factors combine to determine outcome. Those factors have complicated and influential interactions, though perhaps small and insignificant individual contributions. Most of these individual contributions are so small that their effects could be masked easily by noise if they were considered only separately. One good example of such a situation is the association between polygenic disease and many genetic and environmental risk factors.
The affected status of an individual with such a polygenic disease is a result of interactions among multiple genes and environmental risk factors such as smoking, social stress, and diet. Asthma, hypertension (high blood pressure), type II diabetes and most cancers are polygenic diseases. They affect many people, yet their pathologies remain largely unknown. Limited knowledge tends to preclude researchers from creating effective screening tests or treatments.
One reason for such limited knowledge of pathologies is because, until now, most research on polygenic diseases focuses on identifying and evaluating effects of individual mutations. Each mutation is studied separately, ignoring the fact that the influence of multiple mutations can play a much larger role than the marginal contribution of any single mutation. The net result is a lack of understanding of complicated gene-gene and gene-environmental interactions that underlie most polygenic diseases. There is a continuing need in the art for a new and robust methodology for complex supervised learning where traditional approaches focusing on studying effects of individual factors have proven to be inadequate.