1. The Field of the Invention
The field of the invention is computer-implemented genetic algorithms. More specifically, the field is genetic algorithms useful for problem solving. The field spans the range of problems wherein a fit composition of functions may be found as a solution to the problem.
2. The Prior Art
The Natural Selection Process in Nature
The natural selection process provides a powerful tool for problem solving. This is shown by nature and its various examples of biological entities that survive and evolve in various environments. In nature, complex combinations of traits give particular biological populations the ability to adapt, survive, and reproduce in their environments. Equally impressive is the complex, relatively rapid, and robust adaptation and relatively good interim performance that occurs amongst a population of individuals in nature in response to changes in the environment. Nature's methods for adapting biological populations to their environment and nature's method of adapting these populations to successive changes in their environments (including survival and reproduction of the fittest) provides a useful model. This model can be used to develop methods to solve a wide variety of complex problems that are generally thought to require "intelligence" to solve.
In nature, a gene is the basic functional unit by which hereditary information is passed from parents to offspring. Genes appear at particular places (called gene "loci") along molecules of deoxyribonucleic acid (DNA). DNA is a long thread-like biological molecule that has the ability to carry hereditary information and the ability to serve as a model for the production of replicas of itself. All known life forms on this planet (including bacteria, fungi, plants, animals, and humans) are based on the DNA molecule.
The so-called "genetic code" involving the DNA molecule consists of long strings (sequences) of 4 possible gene values that can appear at the various gene loci along the DNA molecule. For DNA, the 4 possible gene values refer to 4 "bases" named adenine, guanine, cytosine, and thymine (usually abbreviated as A, G, C, and T, respectively). Thus, the "genetic code" in DNA consists of a long strings such as CTCGACGGT.
A chromosome consists of numerous gene loci with a specific gene value (called an "allele") at each gene locus. The chromosome set for a human being consists of 23 pairs of chromosomes. The chromosomes together provide the information and the instructions necessary to construct and to describe one individual human being and contain about 3,000,000,000 genes. These 3,000,000,000 genes constitute the so-called "genome" for one particular human being. Complete genomes of the approximately 5,000,000,000 living human beings together constitute the entire pool of genetic information for the human species. It is known that certain gene values occurring at certain places in certain chromosomes control certain traits of the individual, including traits such as eye color, susceptibility to particular diseases, etc.
When living cells reproduce, the genetic code in DNA is read. Sub-sequences consisting of 3 DNA bases are used to specify one of 20 amino acids. Large biological protein molecules are, in turn, made up of anywhere between 50 and several thousand such amino acids. Thus, this genetic code is used to specify and control the building of new living cells from amino acids.
The organisms consisting of the living cells created in this manner spend their lives attempting to deal with their environment. Some organisms do better than others in grappling with (or opposing) their environment. In particular, some organisms survive to the age of reproduction and therefore pass on their genetic make-up (chromosome string) to their offspring. In nature, the process of Darwinian natural selection causes organisms with traits that facilitate survival to the age of reproduction to pass on all or part of their genetic make-up to offspring. Over a period of time and many generations, the population as a whole evolves so that the chromosome strings in the individuals in the surviving population perpetuate traits that contribute to survival of the organism in its environment.
3. Gene Duplication and Deletion in Nature
In nature, chromosomes (molecules of DNA) are linear strings of nucleiotide bases. During reproduction, chromosomes are frequently modified by naturally occurring genetic operations, such as mutation and crossover (sexual recombination). Mutation, for example, occasionally alters the linear string of nucleiotide bases that are translated and manufactured into work-performing proteins in the living cell. A small mutational change in one nucleiotide base in a gene may lead to the manufacture of a slight variant of the original protein. The variant of the protein may then affect the structure and behavior of the living thing in some advantageous or disadvantageous way. If the slight change is advantageous, natural selection will tend to perpetuate the change.
As Charles Darwin stated in On the Origin of Species by Means of Natural Selection (1859),
I think it would be a most extraordinary fact if no variation ever had occurred useful to each being's own welfare . . . . But if variations useful to any organic being do occur, assuredly individuals thus characterised will have the best chance of being preserved in the struggle for life; and from the strong principle of inheritance they will tend to produce offspring similarly characterised. This principle of preservation, I have called, for the sake of brevity, Natural Selection. PA1 . . the true character of natural selection . . . is not so much an advocator or mediator of heritable changes, but rather it is an extremely efficient policeman which conserves the vital base sequence of each gene contained in the genome. As long as one vital function is assigned to a single gene locus within the genome, natural selection effectively forbids the perpetuation of mutation affecting the active sites of a molecule. (Emphasis in original). PA1 . . while allelic changes at already existing gene loci suffice for racial differentiation within species as well as for adaptive radiation from an immediate ancestor, they cannot account for large changes in evolution, because large changes are made possible by the acquisition of new gene loci with previously non-existent functions. Only by the accumulation of forbidden mutations at the active sites can the gene locus change its basic character and become a new gene locus. An escape from the ruthless pressure of natural selection is provided by the mechanism of gene duplication. By duplication, a redundant copy of a locus is created. Natural selection often ignores such a redundant copy, and, while being ignored, it accumulates formerly forbidden mutations and is reborn as a new gene locus with a hitherto non-existent function. (Emphasis in original). PA1 prospective analysis of the nature of the problem, PA1 seemingly sufficient capacity, PA1 affordable capacity, and PA1 retrospective analysis of the results of actual runs.
In addition to mutational changes, chromosomes are also modified by other naturally occurring genetic operations, such as gene duplication and gene deletion. A description of gene duplication and gene deletion appears in the seminal book Evolution by Gene Duplication (1970) by Susumu Ohno.
In gene duplication, there is a duplication of a portion of the linear string of nucleiotide bases of the DNA that would otherwise be translated and manufactured into work-performing proteins in the living cell. When such a gene duplication occurs, there is no immediate change in the proteins that are translated and manufactured. The effect of a gene duplication is merely to create two identical ways of manufacturing the same protein. However, over a period of time, some other genetic operation, such as mutation, may change one or the other of the identical genes. Over short periods of time, the changes accumulating in the changing gene may be of no practical effect or value. In fact, the changing linear string of nucleiotide bases of the DNA may not even produce a viable protein. As long as one of the two genes remains unchanged, the original protein manufactured from the unchanged gene continues to be manufactured.
Natural selection exerts considerable force in favor of maintaining a gene that manufactures a protein that is important for the successful performance and survival of the living thing. However, after a gene duplication has occurred, there is no disadvantage associated with the loss of the second way of manufacturing the original protein. Consequently, natural selection usually exerts little or no pressure to maintain a second way of manufacturing the same protein. The second gene may, over a period of time, accumulate additional changes and diverge more and more from the original gene. Eventually the now-changed gene may lead to the manufacture of a new and different and viable protein that affects the structure and behavior of the living thing in some advantageous or disadvantageous way. If and when a changed gene eventually leads to the manufacture of a viable and advantageous protein, natural selection again begins to work to preserve that new gene.
Ohno (1970) points out the highly conservative role of natural selection in the evolutionary process:
Ohno continues,
Ohno concludes, "[t]hus, gene duplication emerges as the major force of evolution."
Ohno's provocative thesis is supported by the discovery of pairs of proteins with similar sequences of DNA and similar sequences of amino acids, but different functions. Examples include trypsin and chymotrypsin; the protein of microtubules and actin of the skeletal muscle; myoglobin and the monomeric hemoglobin of hagfish and lamprey; myoglobin used for storing oxygen in muscle cells and the subunits of hemoglobin for transporting oxygen in red blood cells of vertebrates; and the light and heavy immunoglobin chains.
Gene deletion also occurs in nature. In gene deletion, there is a deletion of a portion of the linear string of nucleiotide bases that would otherwise be translated and manufactured into work-performing proteins in the living cell. When a gene deletion occurs, some particular protein that was formerly manufactured is no longer being manufactured. Consequently, there is often some change in the structure or behavior of the biological entity.
4. Gene Duplication and Gene Deletion and Evolutionary Algorithms
The genetic algorithm provide a method of improving a given set of objects. The processes of natural selection and survival of the fittest provide a theoretical base. Genetic algorithms in their conventional form can solve many problems.
The Prisoner's Dilemma is a well-researched problem in game theory (with numerous psychological, sociological, and geopolitical interpretations) in which two players can either cooperate or not cooperate. The players make their moves simultaneously and without communication. Each player then receives a payoff that depends on his move and the move of the other player.
The payoffs in the Prisoner's Dilemma game are arranged so that a non-cooperative choice by one player always yields that player a greater payoff than a cooperative choice (regardless of what the other player does). However, if both players are selfishly non-cooperative, they are both worse off than if they had both cooperated. The game is not a "zero sum game" because, among other things, both players are better off if they both cooperate. For more information on the Prisoner's Dilemma game, see Robert Axelrod (1987).
Lindgren (1991) analyzed the Prisoner's Dilemma game using an evolutionary algorithm that employed an operation referred to as gene duplication. In this work, strategies for playing the game were expressed as fixed-length binary character strings of length 2, 4, 8, 16, or 32. Strings of length 2 were used to represent game-playing strategies that considered only one previous action by the opponent. Specifically, the string 01 instructed the player to make a non-cooperative move (indicated by a 0) if the opponent had made an uncooperative move on his previous move and to make a cooperative move (indicated by 1) if the opponent had made a cooperative move on his previous move. This particular strategy is called "tit-for-tat" in this game since the player mimics his opponent's previous move. The string 10 is called "anti-tit-for-tat" because it instructs the player to do the opposite of what the opponent did on the previous move.
The string 11 is the strategy that instructs a player to be cooperative regardless of what the opponent did on his previous move. The string 00 calls for unconditional non-cooperation.
The 16 strategies represented by strings of length 4 took both the opponent's previous move and the player's own previous move into account. Similarly, strings of length 8, 16, and 32 took additional previous moves of the opponent and/or the player himself into account.
Lindgren used an evolutionary algorithm to evolve a population of game-playing strategies. Lindgren started with a population of 1,000 consisting of 250 copies of each of the 4 possible strings of length 2. At each generational step of the process, strings were copied (reproduced) in proportion to the score they achieved when playing the prisoner's dilemma game interactively against all other strings in the population. A mutation operation that randomly altered a single bit in a string was occasionaly applied to a string in the population. In addition, an operation called "gene duplication" was occasionally used to double the length of a string. In this operation, the existing string was copied, so, for example, the string 01 would become 0101. An operation called "split mutation" cut the length of a string in half by deleting the first or second half of the string. For example, when this operation was applied to the string 1100, the result was either the string 11 or 00 (with equal probability). Lindgren's "split mutation" operation applied to linear strings bears a resemblance to gene duplication applied to chromosome strings in nature.
Lindgren's evolutionary algorithm did not contain the crossover (recombination) operation usually found in the genetic algorithm.
5. Background on Genetic Programming
Genetic programming is capable of evolving computer programs that solve, or approximately solve, a variety of problems from a variety of fields. Genetic programming starts with a primordial ooze of randomly generated computer programs composed of available programmatic ingredients and then genetically breeds the population of programs using the Darwinian principle of survival of the fittest and an analog of the naturally occurring genetic operation of crossover (sexual recombination).
Genetic programming (also called the "non-linear" or "hierarchical" genetic algorithm) is described in the book Genetic Programming: On the Programming of Computers by Means of Natural Selection (Koza 1992) and in U.S. Pat. Nos. 4,935,877 and 5,136,686.
6. Problem of Determining the Architecture of the Overall Program
Before applying genetic programming to a problem, the user must determine the terminals and functions of the genetically evolving programs, a fitness measure by which to compare performance of multiple programs being evolved, parameters and variables for controlling each run (e.g., number of generations to be run), and a result designation method and termination criterion to terminate the process.
Before applying genetic programming to a problem where a multi-part program is to be evolved, it is also necessary to specify the architecture of the program. The architecture of a program consists of the number of result-producing branches (which is usually just one), the number of function-defining branches and the number of arguments possessed by each function-defining branch, and the number and structural characteristics of other specialized branches (if any) that belong to a program. Many programs consist of just one result-producing branch and no other branches. Determining the architecture for an overall program may facilitate or frustrate evolution of the solution to the problem. For example, a 6-dimensional problem may have a natural decomposition into 3-dimensional subproblems. If 3-dimensional subprograms are readily available during the evolutionary process, the problem may be relatively easy to solve by means of the evolutionary process; however, if they are not available, the problem may be difficult or impossible to solve. Thus, the question arises as to how to determine the architecture of the programs that participate in the evolutionary process. The present invention provides a means for determining a suitable architecture dynamically during the run of the evolutionary process by means of new architecture-altering operations.
The existing methods for making these architectural choices include the methods of
Sometimes these architectural choices flow so directly from the nature of the problem that they are virtually mandated. However, in general, there is no way of knowing a priori the architecture of the program corresponding to the solution to the problem.
6.1. Method of Prospective Analysis
Some problems have a known decomposition involving sub-problems of known dimensionality. For example, some problems involve finding a computer program (e.g., mathematical expression, composition of primitive functions and terminals) that produces the observed value of a dependent variable as its output when given the values of the a certain number of independent variables as input. Problems of this type are called problems of symbolic regression, system identification, or simply "black box" problems. In many instances, it may be known that a certain number of the independent variables represent a certain subsystem. In that event, the problem may be decomposable into subproblems based on the known lower dimensionality of the known subsystem.
6.2. Method of Providing Seemingly Sufficient Capacity (Over-Specification)
For many problems, the architectural choices can be made on the basis of providing seemingly sufficient capacity by over-specifying the number of functions and terminals. Over-specification often works to provide the eventual architecture, at the expense of processing time and waste of resources.
6.3. Method of Using Affordable Capacity
Resources are required by each part of a program. The practical reality is that the amount of resources that one can afford to devote to a particular problem will strongly influence or dictate the architectural choice. Often the architectural choices are made on the basis of hoping that the resources that one could afford to devote to the problem will prove to be sufficient to solve the problem.
6.4. Method of Retrospective Analysis
A retrospective analysis of the results of sets of actual runs made with various architectural choices can determine the optimal architectural choice for a given problem. That is, in retrospective analysis, a number of runs of the problem are made with different combinations of the number of functions and terminals, to retrospectively compute the effort required to solve the problem with each such architecture, and to identify the optimal architecture. If one is dealing with a number of related problems, a retrospective analysis of one problem may provide guidance for making the required architectural choice for a similar problem.
What is needed is a process that allows architectures to be created during the genetic process. In other words, it is desirable to create new architectures during the genetic process when solving a problem, such that a population of programs being evolved does not have to contain a program having the architecture of the program designated as a solution to the problem, yet being capable of providing such an architecture while the genetic process is underway.