The present invention relates to a method and apparatus for modeling the variables of a data set by means of a probabilistic network including data nodes and causal links.
Probabilistic networks are graphical models of cause and effect relationships between variables in a data set. Such networks are referred to in the literature as Bayesian networks, belief networks, causal networks and knowledge maps. In this specification, the term probabilistic networks will be used generically to refer to all such networks and maps. The graphical model in each case includes data nodes, to represent the variables, and causal links or arcs, to refer to the dependencies connecting between the data nodes. A given set of nodes and arcs defines a network structure.
Once a network structure has been found that accurately models a set of data, the model summarizes knowledge about possible causal relationships between the variables in the data set. Such a model allows knowledge about relationships between variables in a large set of data to be reduced to a concise and comprehensible form and is the primary goal of data mining.
One of the difficulties with modeling a set of data using a probabilistic network is to find the most likely network structure to fit a given input data set. This is because the search space of possible network structures increases exponentially with the number of data nodes in the network structure. An exhaustive evaluation of all the possible networks to measure how well they fit the input data set has been regarded as impractical even when limited to modest sized network structures.
The present invention has the aim of more efficiently generating a representation of a probabilistic network which models the variables of an input data set.
According to the present invention, there is now provided a method of modeling the variables in an input data set by means of a probabilistic network including data nodes and causal links, the method comprising the steps of;
registering the input data set,
generating a population of genomes each individually modeling the input data set by means of chromosome data to represent the data nodes in a probabilistic network and the causal links between the data nodes,
performing a crossover operation between the chromosome data of parent genomes in the population to generate offspring genomes,
performing an addition operation to add the offspring genomes to the said population,
performing a scoring operation on genomes in the said population to derive scores representing the correspondence between the genomes and the input data set,
performing a selecting operation to select genomes from the population according to the scores,
repeating the crossover, scoring, addition and selecting operations for a plurality of generations of the genomes,
and selecting, as an output model, a genome from the last generation.
Further according to the present invention there is provided apparatus for modeling the variables in an input data set by means of a probabilistic network including data nodes and causal links, the apparatus comprising;
data register means to register the input data set,
generating means for generating a population of genomes each individually modeling the input data set by means of chromosome data to represent the data nodes in a probabilistic network and the causal links between the data nodes,
crossover means for performing a crossover operation between the chromosome data of parent genomes in the population to generate offspring genomes,
adding means to perform an addition operation to add the offspring genomes to the said population,
scoring means for performing a scoring operation on genomes in the said population to derive scores representing the correspondence between the genomes and the input data set,
selecting means for performing a selecting operation to select genomes from the population according to the scores,
control means to control the crossover, scoring, addition and selecting means to repeat their operations for a plurality of generations of the genomes,
and output means to select, as an output model, a genome from the last generation.