The invention relates generally to data mining, and more particularly, to the use of genetic algorithms to extract useful rules or relationships from a data set for use in controlling systems.
In many environments, a large amount of data can be or has been collected which records experience over time within the environment. For example, a healthcare environment may record clinical data, diagnoses and treatment regimens for a large number of patients, as well as outcomes. A business environment may record customer information such as who they are and what they do, and their browsing and purchasing histories. A computer security environment may record a large number of software code examples that have been found to be malicious. A financial asset trading environment may record historical price trends and related statistics about numerous financial assets (e.g., securities, indices, currencies) over a long period of time. Despite the large quantities of such data, or perhaps because of it, deriving useful knowledge from such data stores can be a daunting task.
The process of extracting patterns from such data sets is known as data mining. Many techniques have been applied to the problem, but the present discussion concerns a class of techniques known as genetic algorithms. Genetic algorithms have been applied to all of the above-mentioned environments. With respect to stock categorization, for example, according to one theory, at any given time, 5% of stocks follow a trend. Genetic algorithms are thus sometimes used, with some success, to categorize a stock as following or not following a trend.
Evolutionary algorithms, which are supersets of Genetic Algorithms, are good at traversing chaotic search spaces. According to Koza, J. R., “Genetic Programming: On the Programming of Computers by Means of Natural Selection”, MIT Press (1992), incorporated by reference herein, an evolutionary algorithm can be used to evolve complete programs in declarative notation. The basic elements of an evolutionary algorithm are an environment, a model for a genotype (referred to herein as an “individual”), a fitness function, and a procreation function. An environment may be a model of any problem statement. An individual may be defined by a set of rules governing its behavior within the environment. A rule may be a list of conditions followed by an action to be performed in the environment. A fitness function may be defined by the degree to which an evolving rule set is successfully negotiating the environment. A fitness function is thus used for evaluating the fitness of each individual in the environment. A procreation function generates new individuals by mixing rules with the fittest of the parent individuals. In each generation, a new population of individuals is created.
At the start of the evolutionary process, individuals constituting the initial population are created randomly, by putting together the building blocks, or alphabets, that form an individual. In genetic programming, the alphabets are a set of conditions and actions making up rules governing the behavior of the individual within the environment. Once a population is established, it is evaluated using the fitness function. Individuals with the highest fitness are then used to create the next generation in a process called procreation. Through procreation, rules of parent individuals are mixed, and sometimes mutated (i.e., a random change is made in a rule) to create a new rule set. This new rule set is then assigned to a child individual that will be a member of the new generation. In some incarnations, known as elitist methods, the fittest members of the previous generation, called elitists, are also preserved into the next generation.
A common problem with evolutionary algorithms is that of premature convergence: after some number of evaluations the population converges to local optima and no further improvements are made no matter how much longer the algorithm is run. A number of solutions to the problem have been proposed. In one solution, convergence is slowed by increasing the mutation rate, mutation size or population size. Other solutions involve modifying the replacement strategy, modifying the fitness of individuals based on similarity to each other, and by spatially distributing individuals and restricting them to interact only with spatial neighbors. In yet another solution, known as the Age-Layered Population Structure (ALPS), an individual's age is used to restrict competition and breeding between individuals in the population. In the parlance of ALPS, “age” is a measure of the number of times that an individual's genetic material has survived a generation (i.e., the number of times it has been preserved due to being selected into the elitist pool). All of these techniques have benefits and detriments, and may or may not work well in a data mining environment.