The invention relates generally to improved genetic algorithms to extract useful rules or relationships from a data set for use in controlling systems.
In many environments, a large amount of data can be or has been collected which records experience over time within the environment. For example, a healthcare environment may record clinical data, diagnoses and treatment regimens for a large number of patients, as well as outcomes. A business environment may record customer information such as who they are and what they do, and their browsing and purchasing histories. A computer security environment may record a large number of software code examples that have been found to be malicious. A financial asset trading environment may record historical price trends and related statistics about numerous financial assets (e.g., securities, indices, currencies) over a long period of time. Despite the large quantities of such data, or perhaps because of it, deriving useful knowledge from such data stores can be a daunting task.
The process of extracting patterns from such data sets is known as data mining. Many techniques have been applied to the problem, but the present discussion concerns a class of techniques known as genetic algorithms. Genetic algorithms have been applied to all of the above-mentioned environments. With respect to stock categorization, for example, according to one theory, at any given time, 5% of stocks follow a trend. Genetic algorithms are thus sometimes used, with some success, to categorize a stock as following or not following a trend.
Evolutionary algorithms, which are supersets of Genetic Algorithms, are classifiers which are good at traversing chaotic search spaces. According to Koza, J. R., “Genetic Programming: On the Programming of Computers by Means of Natural Selection”, MIT Press (1992), incorporated by reference herein, an evolutionary algorithm can be used to evolve complete programs in declarative notation. The basic elements of an evolutionary algorithm are an environment, a model for a genotype (referred to herein as an “individual”), a fitness function, and a procreation function. An environment may be a model of any problem statement. An individual may be defined by a set of rules governing its behavior within the environment. A rule may be a list of conditions followed by an action to be performed in the environment.
A fitness function may be defined by the degree to which an evolving rule set is successfully negotiating the environment. A fitness function is thus used for evaluating the fitness of each individual in the environment.
A procreation function generates new individuals by mixing rules of the parent individuals. In each generation, a new population of individuals is created.
At the start of the evolutionary process, individuals constituting the initial population are created randomly, by putting together the building blocks, or alphabets, that form an individual. In genetic programming, the alphabets are a set of conditions and actions making up rules governing the behavior of the individual within the environment. Once a population is established, a set of individuals within the environment are selected to create the next generation in a process called procreation. Through procreation, rules of parent individuals are mixed, and sometimes mutated (i.e., a random change is made in a rule) to create a new rule set. This new rule set is then assigned to a child individual that will be a member of the new generation. In some incarnations, known as elitist methods, the fittest members of the previous generation, called elitists, are also preserved into the next generation.
A homogeneous population can cause a genetic algorithm to converge prematurely to local optima. Once such convergence occurs, the search typically terminates without having found optima which may be more global. Aspects of the present invention address this problem.