In many environments, a large amount of data can be or has been collected which records experience over time within the environment. For example, a healthcare environment may record clinical data, diagnoses and treatment regimens for a large number of patients, as well as outcomes. A business environment may record customer information such as who they are and what they do, and their browsing and purchasing histories. A computer security environment may record a large number of software code examples that have been found to be malicious. Despite the large quantities of such data, or perhaps because of it, deriving useful knowledge from such data stores can be a daunting task.
The process of extracting patterns from such data sets is known as data mining. Many techniques have been applied to the problem, but the present discussion concerns a class of techniques known as genetic algorithms. Genetic algorithms have been applied to all of the above-mentioned environments.
Evolutionary algorithms, which are supersets of Genetic Algorithms, are good at traversing chaotic search spaces. According to Koza, J. R., “Genetic Programming: On the Programming of Computers by Means of Natural Selection,” MIT Press (1992), incorporated by reference herein, an evolutionary algorithm can be used to evolve complete programs in declarative notation. The basic elements of an evolutionary algorithm are an environment, a model for a genotype (referred to herein as an “individual”), a fitness function, and a procreation function. An environment may be a model of any problem statement. An individual may be defined by a set of rules governing its behavior within the environment. A rule may be a list of conditions followed by an action to be performed in the environment. A fitness function may be defined by the degree to which an evolving rule set is successfully negotiating the environment. A fitness function is thus used for evaluating the fitness of each individual in the environment. A procreation function generates new individuals by mixing rules among the fittest parent individuals. In each generation, a new population of individuals is created.
At the start of the evolutionary process, individuals constituting the initial population are created randomly, by putting together the building blocks, or alphabets, that form an individual. In genetic programming, the alphabets are a set of conditions and actions making up rules governing the behavior of the individual within the environment. Once a population is established, it is evaluated using the fitness function. Individuals with the highest fitness are then used to create the next generation in a process called procreation. Through procreation, rules of parent individuals are mixed, and sometimes mutated (i.e., a random change is made in a rule) to create a new rule set. This new rule set is then assigned to a child individual that will be a member of the new generation. In some incarnations, known as elitist methods, the fittest members of the previous generation, called elitists, are also preserved into the next generation.
A common problem with evolutionary algorithms is that of premature convergence: after some number of evaluations the population converges to local optima and no further improvements are made no matter how much longer the algorithm is run. In one of a number of solutions to this problem, known as the Age-Layered Population Structure (ALPS), an individual's age is used to restrict competition and breeding between individuals in the population. In the parlance of ALPS, “age” is a measure of the number of times that an individual's genetic material has survived a generation (i.e., the number of times it has been preserved due to being selected into the elitist pool).
When using genetic algorithms to mine a large database, it may not be practical to test each individual against the entire database. The system therefore rarely if ever knows the true fitness of any individual. Rather, it knows only an estimate of the true fitness, based on the particular subset of data samples on which it has actually been tested. The fitness estimate itself, therefore, varies over time as the individual is tested on an increasing number of samples.
In a data mining environment with multiple solution landscapes, the evolutionary data mining system might generate some stepping stone individuals. Stepping stone individuals are individuals that do not necessarily have a high fitness estimate, but can have one or more critical parts of a future optimal individual. Despite their potential value, there is always a risk that before the stepping stone individual can be effectively utilized during procreation to create better individuals, they may get displaced by some other individuals that do not have the stepping stone individuals' critical parts but have marginally better fitness estimate. Considering only the fitness estimates of individuals during the evolution cannot ensure a diverse set of patterns or emergence of new patterns.
For example, in a healthcare embodiment, an individual diagnosing low blood pressure will have a lower fitness score than individuals diagnosing high blood pressure when tested on a subset of high blood pressure data samples. Therefore, if high blood pressure data samples are used for testing early in the testing process, there is a possibility that the competition module may prematurely discard the individual diagnosing low blood pressure from the candidate individual pool based on its low fitness score.
Novelty search has shown promise in evolutionary data mining system by collecting diverse stepping stones. Novelty search ranks individuals based on how different one individual is from other individuals. A novelty estimate of an individual is estimated in the space of behaviors, i.e., vectors containing semantic information about how an individual achieves its performance when it is evaluated. However, novelty search can become increasingly unfocused in an environment where a large number of possible behaviors can exist.
Therefore, a behavior-driven search is desired that promotes individual diversity without the search becoming unfocused in a large, complex data mining environment. It is in this kind of environment that embodiments of the present invention reside.