A recurring problem in the computer field is the implementation of computer systems which may substitute for often scarce human workers, in particular workers having particular knowledge and skills or expertise, ranging from well trained and experienced clerical workers to highly skilled professionals, such as medical experts. The essential problem is to implement the knowledge and skills of the human expert into a computer system in such a manner that the computer system, when provided with the same fact pattern as the human expert, will reach the same conclusion or decision as the expert.
The first implementations of such systems used conventional, sequential computers characterized in that such systems are data sequential, that is, they perform a sequence of operations on a very limited number of data elements at a time, such as an add or compare of two data elements. A sequential system which was to work with extremely large numbers of data elements would thereby require prohibitively long computation times, even for very fast systems.
In order to reduce the number of data items to be dealt with, the first forms of "expert" systems thereby implemented human "expertise" as rule based systems wherein "knowledge engineers" would attempt to elicit from the experts a set of rules by which the experts reached decisions given a fact pattern. Such a set of "rules" would attempt to codify, for example, the knowledge, methodology and reasoning process used by a medical diagnostician to reach a diagnosis given a set of symptoms, the results of medical tests, and so on. The "rules" would then be programmed as a sequence of decision steps and, given, a fact pattern, the system would execute the programmed sequence of rule decisions to reach the same conclusion as the expert.
Rule based "expert" systems, however, require a very substantial investment of knowledge engineer and expert time, and programmer time, in determining the appropriate set of rules and encoding those rules. For example, a rule based system for performing the census return classification operations used as an illustration in describing the present invention required four person-years to construct. Also, much expertise is based upon knowledge of a very large number of individual fact patterns and the codification of this expertise into a set of rules, which are of necessity and purpose more general than individual cases, results in the loss of much information. For example, a medical diagnostician may recognize a second case of legionnaire's disease after learning of only a single previous case. A rule based system generally will not since a single case is not sufficient to require rewriting the rules of the system to accommodate the new case; the rewriting of the rules in a rule based system is an extremely difficult and time consuming task because the rules interact, and a change in one rule typically requires corresponding chages in many related rules. In addition, it is very difficult to determine whether the correct set of rules have been implemented; many experts do not consciously know and understand their own methodology and reasoning processes and may unconsciously create "rules" that do not in fact reflect their methodology. Lastly, it is very difficult to change a rule based system once it has been designed and implemented.
The development of data parallel systems has lead to the development of memory based reasoning systems to perform functions analogous to rule based systems, but very different in actual implementation and operation. Data parallel systems exploit the computational parallelism of many data intensive operations by performing single operations on thousands of data elements in parallel and are comprised of a single instruction engine controlling a very large number of data processing units, with one processor unit being associated with each data element to be operated upon.
Memory based reasoning systems differ from rule based systems by operating directly from historic data to reach a conclusion or decision with regand to a new fact pattern by directly comparing the new data to the historic data and the decisions or conclusions reached for each of the previous sets of data. Such systems include a training database comprised of historic records, wherein each record contains a set of related data fields containing data values, that is, information, relating to a previous fact pattern. The data fields in a given record of the training data base are in turn comprised of a number of "predictor" data fields, containing the originally known information pertaining to the previous fact patterns, and one or more "target" data fields containing the results, conclusions or decisions reached from the originally known information. It should be noted that the "data values" appearing in the data fields may be comprised of numeric data or text or both, depending on the particular database.
A new record is similarly comprised of predictor data fields and target data fields but, while the new record predictor data fields contain data, the contents of the new record target data fields is determined by comparison of the new record predictor fields with the historic record predictor fields. To illustrate by reference to the previous example, a medical database will contain historical patient records wherein the predictor data fields contain information such as symptoms, test results and patient characteristics and the target data fields contain the diagnosis and treatment plan for each patient. A new patient record will contain information in the predictor data fields, but the diagnosis and treatment plans for the new patient will be determined by comparison of the new patient record predicator data fields with the historic patient record predictor data fields. It should be noted that it is possible to make an identification in a new case from a single previous case because all of the data in each of the historic patient records is retained in the database, rather than being abstracted into rules with consequent loss of the data particular to individual cases.
In general, memory based reasoning systems of the prior art determine the contents of the target fields of a new record, that is, the values to appear in the target fields, by first identifying the predictor data fields of the training database which are more relevant or significant in determining the target data field values and assigning "weights" to the predictor fields according to their relative significance in performing the determination. This step is generally performed by determining, for each target field and each predictor field in the training database, the probability that a given target field will have a specified value given that a predictor field has a given value.
This general approach to determining the probabilities that a target field has a specified value given that a predictor field has a given value is based upon the assumption that all occurrences of that predictor field in the training database have the same data value, which is rarely the case. Because of this, the "probability" for each predictor field is modified according to the differences, or "distances", between the values of each predictor data field which may be used to predict the value of a given target field. The distances between predictor field values are determined over the range of possible values for the predictor field and used to generate a difference measurement. The probability assigned a predictor field is then modified by the predictor field's difference measurement to determine the final "weight" assigned to the predictor field. Because these "weights" are based upon the distances between field values, these weights will hereafter be referred to as "distance weights". It should be noted that a high "distance weight" corresponds to a small "distance" between field values, that is, the field values are near one another and that these weights are assigned to the predicator fields themselves, and not the values in the fields.
It is apparent that the various predictor fields will differ greatly in their importance in determining target field values and it is further noted that the predictor fields often do not operate independently in constraining the values appearing in the target fields, that is, the combined effect of two predictor fields is often quite different than their individual effects. Accordingly, the relative distance weights assigned to the predictor data fields are used to select, or limit, the training database predictor data fields which will actually be used in comparison with the data fields of the new sample, thereby restricting the training database to a subset of the fields in the training database.
The predictor data field values of a new sample are then identified and compared to the selected predictor data fields of the examples in the training database. The matches between new sample data field values and the corresponding values of the selected training database data fields are identified and their distance weights are accumulated according to a "metric", or measure. The values appearing in the target data fields of the nearest matching training database record, as determined by the metric, are then used as the contents of the target data fields of the new sample.
It is apparent from the above discussion that the memory based reasoning systems of the prior art depend upon determinations of the relative "distances", between the data in the fields in each step of the above described process, such as "57" is closer to "100" than to "3", or "violet" is closer to "red" than to "green". In an exact match system, the required "distance" is, of course, zero.
This in turn requires that the data values be well behaved. That is, there must be a defined set or range of possible values for the data, the data values must inherently have some relative order or ranking, and the data values must be comparable across at least the fields which are to be compared to one another.
Conversely, memory based reasoning systems of the prior art are not suitable for use with data which is not well behaved, that is, where there is an open-ended range of possible values for each data field or where the possible data values do not have some relative order or ranking and, consequently, where the data contained in the various fields of a database are not comparable across the fields. The most common example of such data is information expressed in "natural language" terms, also referred to as "free text", wherein there are a very large number of possible ways in which to express the information. Certain databases, such as the census return database discussed herein as an example of an implementation of the present invention, may contain up to 50,000 or more different words.
Because of the very large number of possible ways in which the same data may be expressed in "natural language", the likelihood of finding a match between the natural language data fields of a new sample and the natural language data fields of the training database, or even a match between the data fields of the training database, are much reduced. This factor alone significantly reduces the ability of the memory based reasoning systems of the prior art to deal with natural language data.
For the same reason, the natural language data contained in the various fields of the training database and the new samples are not comparable across the fields, so that it is very difficult, if possible at all, to compare the fields of the training database with one another or with the fields of a new example, that is, to determine the "distance" and accordingly the "distance weights" between any training database example and a new sample. To illustrate by reference to the example which will be used below to illustrate the present invention, the free text data that may appear in a given single word data field of the industry description fields of different census returns may include the words "car", "big", "retail", and "factory". It is apparent that there is no order or ranking that can be placed on these examples that would aid in classification of the terms, nor is the "distance" between these terms ascertainable, that is, for example, it cannot be determined whether "big" is closer to "retail" than to "factory", or whether " retail" or "factory" is closer to "car". In further example, a census return text field may contain the response "The Computer Industry" and two examples in the training database may contain the responses "A Computer Business" and "The Automobile Industry". Based upon a distance comparison of these three text fields, "The Automobile Industry" is closer to "The Computer Industry", with a match between two words, than is "A Computer Business", which provides a match for only one word.
The above example also illustrates that the relative significance of predictor data fields per se in determining matches between training database examples and a new sample is significantly reduced in the case of natural language data. That is, it is difficult to assign a predictive distance weight to a predictor data field in itself based upon the probability of a target field having a certain value given that a predictor field has a given value because the range of values that may appear in that predictor data field is open ended. Similarly, a distance weight based upon the "difference" between the values of two fields is meaningless when the values contained in the two fields are not comparable. In the example used just above, a single predictor data field may, in four different returns, contain the values "car", "big", "retail", and "factory" and this single data field may more accurately be represented as four different data fields.
It should be noted that data in the form of text does not necessarily comprise "natural language" data, or "free text". Referring to the previous example of a medical database, much of the data fields therein will contain text. If the database records are constructed by medically trained personnel, however, the terms used will be drawn from a limited vocabulary of well defined technical medical terms. That is, the textual data will not be open-ended as to the values appearing therein, the terms will have certain well-defined values and an order or ranking relative to one another across corresponding data fields, and the data will be comparable across the fields. If, however, certain of the data fields were provided by the patients, for example, in responding to a medical questionnaire, those data fields would contain "natural language" text as a patient would not normally use well defined technical medical terms and two patients would not necessary express their symptoms in the same way, even if the symptoms were in fact identical.
Conversely, and although the terms natural language data or free text will be used herein for convenience, it is to be understood that other forms of data exhibit the same characteristics as free text and that the present invention applies equally to other forms of data wherein there is an open-ended range of possible values for the data or the data values are not ordered or comparable across the data fields.
The present invention provides a solution to these and other problems of the prior art memory based reasoning systems.