1. Field of the Invention
The present invention is generally related to systems and methods implementing computationally and data set intensive genetic algorithms and, in particular, to a computationally efficient genetic algorithm capable of processing substantially sized populations.
2. Description of the Related Art
Genetic algorithms (GAs) are increasingly if not already widely used to solve a variety of computational problems that are of a scale that are not readily solvable, at least as a practical matter. Such problems typically occur in the field of multi-variate analysis as applied to, for example, discovering complex drug interactions in massed clinical trial data and trend-spotting in broad-based, high-volume economic data. Alternate known methods, such as stochastic and bivariate analysis methods, will tend towards identifying localized, rather than optimal solutions. In many cases, the data sets are so large and the cross-correlations between variate fields too uncertain to practically consider application of any conventional methodology other than those based on genetic algorithms.
There are, however, a number of known limitations in current implementations of conventional genetic algorithms. These limitations are particularly significant in that they directly constrain the number of variate data fields that can be considered simultaneously, the size of the data population that can be processed, and the overall throughput of the computer systems implementing the genetic algorithms.
A known limitation of conventional genetic algorithms is frequently described as convergence or selection pressure stall. Where the population is large in relation to the variation of parameters of interest, conventional genetic algorithms will encounter difficulties in reliably distinguishing variations of significance. The genetic algorithm will tend to overly focus on insignificant distinctions in the population data set and fail to make meaningful progress towards identifying a population-wide optimal solution. In effect, the genetic algorithm will prematurely identify and hold to a nearly arbitrary local maximum as a final problem solution. Although stalling can occur with any population size whenever the data set features of interest are nearly homogenous, the stalling phenomenon is most significantly encountered whenever a conventional genetic algorithm is applied to any overly large population data set. Real world applications unfortunately tend to require analysis of extremely large populations and correspondingly large population data sets. Subdivision of the population for purposes of GA analysis results in the loss of significant information in the form of unanalyzed cross-correlations between the subpopulations. Therefore, conventional GA implementations will require many independent GA runs over arbitrarily cross-cut subpopulations and a statistical analysis of the resulting family of potentially optimal solutions. This approach is very time consuming and does not preclude the loss of epistasis or other cross-correlation dependent information among the subpopulations.
Another limitation of conventional genetic algorithms is a fundamental difficulty in scaling computer implementations to concurrently process larger population data sets or to increase the throughput processing of a given population data set. As a practical matter, genetic algorithms progressively carry forward knowledge about potential optimal solutions to a problem in the evolving composition of the population data being processed. While an effective mechanism for storing the knowledge in an efficiently processable manner, there is little ability to share the knowledge in a manner that does not fundamentally disrupt the operation of the GA or loose significant information.
A conventional approach to performance scaling relies on a shared population data space, in effect a shared memory representation of the current population data set, accessible by multiple GA processors. The data and computationally intensive nature of GAs, however, typically results in significant contention for memory access. The intended benefits of parallelization are substantially lost. Alternately, full parallel processing architectures are used, though with the necessity of subdividing the population data set. As before, population subdivision inherently results in the undesirable loss of cross-correlation information.
Relatively recent developments in GA theory, specifically the advancement of competent genetic algorithms, have produced substantial performance improvements by evolving the implementation of qualified linkage learning. A linkage learning GA attempts to concurrently perform genetic pattern search and allele or attribute evaluation. Competency imposes a necessary constraint that pattern search complete first. One approach to delaying final attribute selection involves a complex, cyclic chromosomal system used to implement a probabilistic expression and preservation of attributes that would otherwise be eliminated under normal competition. Preserved attributes are expressed in probabilistically determined locations, resulting in reordered chromosomal patterns. The reordering function thus permits linkages between fields to be effectively searched with the most fit linkages being retained through competition.
GA systems modeled on cyclic chromosomes coupled with probabilistic expression operators represent, at best, artificial genetic systems. While such artificial systems have been experimentally validated against known population sets, including population sets seeded with known problematic data patterns, the algorithms largely exist without a guiding biological model. Current GA theory may not yet be adequate to permit reliance on such artificial algorithms, or at least determine the degrees of uncertainty, when analyzing real population data sets for practical ends.
Still another known limitation of conventional genetic algorithms is the deficient recapture of knowledge through use of the mutation operator. The fundamental operation of the selection and cross-over GA operators serve to drive innovation, or knowledge discovery. Even using a probabilistic expression or equivalent operator that tends to preserve attribute value knowledge, knowledge potentially significant to a final optimal solution can be prematurely lost from the current population data set through the progression of competition. Excessive knowledge loss, typically arising from use of an overly aggressive cross-over rate, leads to GA instability.
To maintain stability, standard GA mutation operators are used to progressively prompt the rediscovery of potentially prior lost knowledge. The mutation rate must be sufficient to assure that any prematurely lost fields and values are reintroduced into a current population data set to permit inclusion, as appropriate, into the eventual optimal solution data set. Single point mutations, as represented by a single instance of a field, are rather inefficient at reintroducing lost knowledge. The likelihood that a single point mutation will survive and propagate sufficient to affect the eventual optimal solution is rather low. Mutation rates must therefore be sufficient not only to reintroduce single instances of fields, but of sufficient instances to present a sufficient variety of values that may be determined significant in the determination of fitness and thus participate in the final optimal data set.
Conventionally, the GA cross-over rate, set high enough to achieve the desired innovation, must be suitably balanced by the mutation rate to maintain stability. Conversely, the mutation rate, desirably set higher to assure an adequate recapture of lost knowledge before closure, cannot be set too high due to the generally randomizing effect of mutation on convergence. Thus, conventional cross-over and mutation rates are limited, thereby limiting the rate of convergence on a reliably obtained optimal solution as a practical matter, in significant part due to the limited knowledge recapture possible through single-point mutations.
Consequently, there is a clear need for an improved GA system capable of handling large, high-order multi-variate populations, achieve high-throughput, facilitate parallelization, and ensure the effective retention and recapture of relevant knowledge throughout the GA processing cycles.