Algorithms for fitting experimental data to linear equations or to other predetermined functions of one or more variables are widely used in applied science and engineering. In fitting data to a predetermined function, parameters (e.g., coefficients) of the predetermined function, which are a priori unknown, are determined. These parameters, which may represent theoretical constants (e.g., the mass of an electron), or merely empirical values that characterize a phenomenon, are determined in fitting data to the function. In such situations, the appropriate function to fit to the data is selected by a person based on technical knowledge or preexisting evidence. For example, certain types of data may be known by experts in the relevant field to be described by certain mathematical functions. The discovery of what mathematical functions describe what type of functions comes through the painstaking progress of science and engineering.
Similarly, in the field of statistics, statistical data may be fit to an appropriate distribution function such as the Gaussian Distribution, or the Binomial Distribution, in order to determine a mean and variance of measured data. The selection of an appropriate distribution function to fit to any given set of data is based on consideration of whether the type of random variation associated with each type of distribution corresponds to the random variations that characterize the collected data. In other words, selection is ordinarily the work of person skilled in statistics.
Certain statistical packages attempt to assist the statistician by automatically trying to fit a set of data to a predetermined set of distribution functions, and selecting the distribution function which best fits the data.
In the cases mentioned above the functions to which data are fit are predetermined, and it remains a task of the scientist or engineer to discover through conjecture or ab initio derivation entirely new functions that may apply to new types of data. In other words the work of discovering mathematical functions is left to human intellect.
The field of artificial intelligence includes the sub-field of genetic algorithms. In the field of genetic algorithms, an attempt is made to mimic the role of genetics in evolutionary biology, in computing the solution of engineering or other problems. In genetic algorithms a population of postulated solutions is ‘evolved’ in a way that mimics Darwinian theories of evolution.
The field of genetic algorithms includes an area of study known as genetic programming. In genetic programming the population being evolved includes individuals that are themselves programs. In genetic programming the fitness of each individual program is judged based on its ability to solve a certain problem when it is executed.
Genetic programming has been used to perform what is known as ‘symbolic regression’. In symbolic regression, an effort is made to supplant human intellect by using genetic programming to discover a mathematical expression that best describes a data set. The individual programs that are evolved in genetic programming based symbolic regression represent mathematical equations that give the value of a dependent variable based on the input values of one or more independent variables.
Predominant prior art genetic programming algorithms were implemented in the LISP programming language which was judged by the implementers to be especially suited to the task. In such algorithms, the S-expression construct of the LISP programming language was used to represent mathematical expressions. These S-expressions, which played the role of members of a population being evolved, were directly manipulated in the course of performing the evolution. A drawback of such prior art approaches is that the size of the mathematical expressions in the population was not limited, which lead to so called ‘expression bloating’ in which the mathematical expressions in the population become unduly large. Another drawback of such prior art approaches is that such bloated expressions tend to over fit the data that the genetic programming algorithm is using to check the correctness of mathematical expressions. By over fit it is meant that the expression conforms very closely to the data including measurement errors in the data, and does not conform to additional data from the same source that is later used to test the correctness of the expression. A further drawback is that such S-expression constructs are not available in modern program languages such as Java, or C++ that are currently preferred for use in the scientific and engineering programming.
Another type of genetic algorithm used for symbolic regression Gene Expression Programming (GEP). In Gene expression programming expressions are represented by strings of symbols in which each symbol represents a token (e.g., operand, operator) of a mathematical expression. In using gene expression programming the value of constants that are to appear in an expression that the genetic programming algorithm is seeking may not be known ahead of time. Therefore the GEP algorithm may have to create a program that performs an inordinate number of operations on a limited set of constants that it has been given to work with (e.g., zero and one). The latter necessity increases the time required for the gene expression programming algorithm to converge and also unduly increase the size of solution programs that are found. Moreover, in as much as the expressions produced by gene expression programming algorithms are limited to a finite size, the operation required to obtain needed constants may consume a substantial portion of a maximum expression size and limit what is available for other needed operators and variables.
In gene expression programming a variety of actions that mimic the natural processes involved in the evolution of a population are performed. These include one-point and two-point crossover and mutation. These processes involve random selection of crossover points and random selection of new tokens to replace pre-existing tokens (operands or operators) in a representation of an expression (chromosome). Due to their random nature these operations, which are important in adaptation through evolution, may, unfortunately, in the case of gene expression programming, lead to syntactically incorrect expressions (programs). Such syntactically incorrect are unsuitable as solution candidates, and have the potential to generate a program execution error in the gene expression programming algorithm.