Exponential increases in computer power, including processing speed and memory capacity, since the mid twentieth century have dramatically increased the usefulness of computing in every sector of society and indeed in our daily lives. One of the main uses of computers is the generation and storage of ever increasing volumes of data. However, by itself, raw data has only limited value. In most instances, its true value can only be obtained once it has been interpreted by someone with the requisite understandings and insights. This interpretation process is a value-adding process converting “data” to “knowledge” and then often to “judgements”. This knowledge or judgement is often expressed in a textual report.
While computer-driven processes are useful to extract, collate and store both numeric and textual data, the ability to effectively interpret this data, either by a human or a computer, may be limited by the large data volumes and associated complexity.
For a human, the ability to make a judgement so as to correctly interpret a body of data in a timely fashion will require that the data be pre-processed and reduced sufficiently so that the significant features are evident.
For a rules based expert system, there is a further but related requirement that each rule be as general as possible in order to avoid a proliferation of rules needed to take into account all the specificities of large and complex data sets. More general rules are built using higher-level abstractions from the data set, so that variations in the specifics of the underlying data do not necessarily invalidate those rules. These higher-level abstractions are precisely the significant features that a human expert building the rules based expert system will use.
That is, just like a human expert, an expert system needs complex data to be reduced to a form where the inferencing can be based on a smaller set of significant features, rather than the large set of original data values.
The task is therefore to find ways to reduce the data complexity of the data to be interpreted by pre-processing the data into a smaller, less complex set of significant values which can then be presented to the human or computer for subsequent interpretation.
There are two key factors contributing to data complexity.
The first is the sheer number of data item values that may need to be interpreted—that is, when there are a large number of elements in a given system that need to be analysed.
For example, in order to generate a patient test report for a referring physician, the laboratory pathologist may have to interpret the results of hundreds of protein biomarkers used in the diagnostic instrument that has analysed the patient's blood sample.
The second factor driving complexity is the size and possibly unstructured ('freeform') format of individual data values themselves. A single numeric or enumerated value (i.e. a text code), by itself, may be relatively simple to interpret as there is a clear association of this ‘atomic’ value with its corresponding data item, e.g. a troponin value of 3.4 mmol/L.
However, a large freeform piece of text may contain ambiguities, misspellings, abbreviations, more than one data value, or one of many different possible representations of the same data value, making it much harder to interpret.
For example, in order to generate a patient test report for a referring physician, the laboratory pathologist may have to interpret the machine generated test results in the context of a lengthy textual clinical history of the patient provided by the referring physician. The clinical history is complex because it is a large and unstructured data item and relatively minor variations in the text can completely change the resulting interpretation. For example, the shorthand phrases “DM” (known diabetes mellitus), “FH DM” (family history of diabetes mellitus), “? DM” (query diabetes mellitus) “not DM” (not diabetic) will all change the pathologist's interpretation of a given set of glucose test results. Note also that synonyms (“DM”, “Diabetic”, “Diab”, “Diabetes noted”), misspellings (“Diabetes Mellitis”) and variations in word ordering (“? DM”, “DM ?”) in the clinical notes all need to be understood by the pathologist when they make their interpretations.
A clinical history may also contain the phrase “on Zocor” or “on lipid lower treatment”, both phrases representing a second concept which indicates to the pathologist whether the patient is on some heart medication. This sort of phrase will likewise affect the pathologist's interpretation of the test results and the resulting report to the referring physician.
Taking a specific example “DM, on Zocor”, there is no clear association between the ‘clinical history’ data item and an atomic value. Rather, the clinical history as a complex data item implicitly contains two simpler, atomic data items, e.g. Diabetic (yes) and On Treatment (yes).
Another example of this second type of complexity due to the size and lack of structure of a data item value is where the primary laboratory performs some of the patient tests ‘in house’, but sends away the blood sample to a second laboratory for some more specialised tests. The second laboratory will return their findings in a textual report. From the perspective of the pathologist at the primary laboratory, the report received from the second laboratory is a complex data item. The pathologist will have to interpret both this report plus the results done at the primary laboratory in order to make the final report to the referring physician.
Another example of a clinical domain with complex data is the allergy domain, in which potential allergens needs to be tested in a blood sample, then matched against symptoms in a potentially lengthy and free-form patient history in order to identify the relevant allergen(s). Infectious diseases (identification of a pathogen), multisystem illnesses (e.g. identification of an underlying cause in neurology, endocrinology, oncology) are other examples.
Similar difficulties in the interpretation of complex data arise in the non-medical fields such as fraud detection (e.g. in re-issuing airline tickets, driver's licences and passports, credit card purchases, and electronic commercial transactions), auditing in logistics, inventory management, serial numbering (e.g. in detection of counterfeiting, or for product recall purposes), or IT support services.
In the example of airline fraud detection, a large number of events containing unstructured or semi-structured data on ticket sales and passenger flights need to be recorded then matched against pricing faresheets and other criteria for airline ticket re-issue to identify whether the correct pricing has been applied for a specified airline ticket.
This is a laborious task since information contained in faresheets and airline tickets is either unstructured or only semi-structured, and each set much be individually interpreted by human experts to determine if the conditions expressed in the faresheet have, in fact, been followed.
To enable efficient and accurate interpretation by a human expert, complex data on a faresheet needs to be reduced to a set of conditions that are applicable to the specific ticket (in this example). The relevant characteristics of that ticket (start and destination cities, date of travel, class of travel, price) also need to be extracted. Once the data on the faresheet and ticket has been pre-processed into these significant features, a human expert can make the judgement as to whether there has been a fraudulent or incorrect ticketing event.
The task of real estate valuation is another area where interpretation of complex data is required. In this domain, the interpretation required is a valuation comprising of a dollar amount with a supporting narrative. The data on which the interpretation is made consists of a variety of complex and disparate data including house and land size, house orientation, postcode and recent valuations of nearby or other comparable properties. Freeform textual notes describing various characteristics of the property (e.g. a view blocked by an adjacent high-rise apartment block), may contain important factors impacting the valuation, and so need interpretation.
Another example of a non-clinical domain requiring the interpretation of complex data is the field of IT support services. Consider an online-transaction processing system where a company provides regular value-added outputs to its subscribing customers such as news feeds or other reports.
The reliability of the company's online-transaction processing system is critical to the performance of this service. To achieve a very high level of reliability, the system must be continuously monitored for all factors that could impact on its reliability.
These factors include transaction rates, user activity, resource usage such as memory, disk, and CPU, as well as operating system generated alerts and warnings, and alerts and warnings generated by the transaction-processing application itself. A standard way of recording these factors is to continuously log all this information to a central facility, e.g. a log file, where it can be analysed by the company's IT support staff on a regular basis. The goal is for IT support staff to act upon any serious alerts or concerning trends recorded in the log file before the online transaction system fails.
As the log entries are generated by various operating system or application system components, often from different vendor products, they are not formatted according to a universal coding system but are essentially free text. For a large online-transaction processing system, the log file can be very large, e.g. tens of Mbytes per day, which is beyond the scope of IT support staff to examine manually. Furthermore, certain classes of alerts may require immediate action, in which case the determination of the alert and the corresponding remedial action may need to be identified promptly.
As in the previous examples, to enable efficient and accurate interpretation by a human expert, complex data in a log file needs to be pre-processed into a set of significant features such as alert or trend status conditions from which a human expert can make the judgement as to whether any remedial action needs to be taken.
A computer-based expert system attempts to mimic the human interpretive process. For example, RippleDown is a computer-based expert system (decision engine) that is taught by a domain expert how to make highly specific interpretations on a case-by-case basis, as described in U.S. Pat. No. 6,553,361.
Similarly to a human expert, a rules-based expert system needs to have the data presented to it in terms of the relevant significant features so that it can inference from these features. If it were to inference from the complex raw data (e.g. data in the fare sheets and tickets themselves), the number of specific rules required would not only be unmanageable, but once built it would fail to interpret any newly encountered variations in the fare sheets or tickets.
In a high transaction environment, expert systems can perform an essential role in leveraging human expertise to provide rapid interpretations of raw data. For example, a pathology laboratory may need to provide interpretive reports for tens of thousands of patients per day, far beyond the manual capability of the few pathologists who might be employed at that laboratory.
However, the ability of an expert system to interpret data is limited by the same factor that limits a human expert—data complexity. Complex data needs to be pre-processed into a form so that rules can be built using the higher-level concepts that a human expert would use, and to avoid the proliferation of rules and report definitions that would otherwise result.
Two more detailed and specific examples of the data complexity problem are now given.
The first more specific example is in the field of medical pathology where complicated investigations commonly performed by professionals, such as medical pathologists, often require a large number of tests. The interpretation of the test results is often difficult and requires the skill of an expert or expert system. The expert or expert system will generate text for inclusion in a report containing a useful analysis and interpretation of the test results, sometimes in a highly condensed form, to be forwarded to the referring doctor (e.g. the family physician) who may not have the expertise to interpret the raw test results themselves. To date, the knowledge bases of expert systems have been built in domains in which tests are relatively independent of each other. For example, a knowledge base for thyroid reporting principally considers results of thyroid function testing (namely, TSH, FT3 and FT4). Other patient demographic data such as age and sex also generally needs to be taken into account, as well as the observations recorded in clinical notes from a physical examination or from an oral history. Reports generated using these knowledge bases refer to these individual tests and their values, as well as providing a diagnosis and often a recommendation for treatment and follow-up testing. Typically in these domains, there are less than 20 tests to consider, plus patient demographic data like age and sex, plus observations in clinical notes provided by the medical practitioner. While test results may interact and so be related to some extent (e.g. if one test is abnormal, another is also likely to be abnormal), the low number of tests and test interactions to be considered means that the rules in the knowledge base can refer to the individual test results themselves and still maintain its generality. That is, the test results do not have to be reduced by some pre-processing step to a smaller set of significant features before interpretation.
Specific rules comprising of a textual comment given under certain conditions can be written by considering each individual test result, or by considering the relatively few significant combinations of test results. For example, for a thyroid panel of tests, the comment may be generated “Consistent with primary hypothyroidism” if the TSH test result is elevated.
Traditional clinical domains such as the thyroid example above have just a few Attributes. However, for newer clinical domains with potentially hundreds or even thousands of possible investigations, the application of specific rules to each type of investigation becomes infeasible. For example, the medical practitioner may request a number of food allergy tests such as peanut, soya, milk, wheat and egg. If soya and milk return very high positive values (e.g. 24.3 and 30.1 respectively) and the other tests are negative, the pathologist will want the report sent back to the doctor to include a comment like:                “Very high results were detected for milk (30.1) and soya (24.3)”        
The rule that allows the interpretation of the test data to give this comment is along the lines of:                10<=milk<=50, indicating a very high result, and        10<=soya<=50, indicating another very high result, and        milk>soya, indicating that the milk value should be before the soya in the report, and peanut=0, and        wheat=0, and egg=0        
In this simple example with just 5 allergens tested, the number of combinations of the above comment is 25=32 (neglecting order of importance). Corresponding to each combination of test results there needs to be a different rule.
It is clearly not practical to separately define each of the 32 possible combinations of this comment and corresponding rules even for this simple comment—and real-world examples are far more complex than this.
In the case of an allergy knowledge base there are literally hundreds of possible tests that can be performed in an investigation, each measuring the same chemical (IgE), with the value of each test indicating the patient's response to a particular allergen. In cases where there are hundreds of tests in an investigation it would be impossible for an expert to define all the possible interactions between the test results and provide the multitude of comment variations that an accurate report would require. Before an interpretive knowledge base could be defined, the data complexity of this domain would have to be substantially reduced.
However, the computational challenge of generating a report that takes into account highly complex data is beyond the capability of traditional expert systems. For example, if there were four hundred tests and each test had only a binary output, such as “positive” or “negative”, then there would be 2400 possible combinations of test results, each combination requiring a unique reporting text conclusion that had been previously generated and stored on a computer system. This does not even account for possible interactions between the test data or other relevant inputs such as clinical notes which greatly complicates the situation. The traditional approach of attempting to interpret complex data is not feasible when there are hundreds or more observations. In the clinical setting, the variety of cases and their corresponding reports even with a modest number of tests can be huge, and even more so when the patient's historical information and clinical notes are also taken into account.
The second more specific example is an airline ticketing application where tickets may be issued directly by the airline, or indirectly through travel agents, airline consolidators or online travel websites. If a ticket needs to be re-issued (e.g. due to a change in the itinerary, or to replace a lost or destroyed ticket), the details of the original transaction need to be verified against faresheets (a document of terms and conditions governing airline tickets) and against the original transaction details (e.g. amount paid, number of tickets purchased, currency of transaction, names of passenger(s), date and location of purchase). A particular difficulty is that airline faresheets are complex textual data items. They do not follow any definite format but nevertheless contain certain important information—often expressed as a number of Key Terms, such as “cancellation”, “before travel”, “lost ticket”, and so on, plus monetary values and dates. Within a single faresheet, and between faresheets, each Key Term can appear in a variety of forms. For example, “free of charge”, “foc”, and “no penalty” all mean the same thing.
As well as containing Key Terms, each of the faresheets specifies certain information, such as the penalty for cancellation before travel, the penalty for a lost ticket, and so on. Each of these Key Concepts is expressed in a variety different ways using the Key Terms.
Therefore, it is necessary in the above example to analyse blocks of free text containing relevant information expressed in a variety of ways, then to analyse information from the free text along with other data to reach a conclusion. An analogous problem arises in the context of medical diagnosis, where clinical notes may contain important information expressed in free text and must be interpreted in conjunction with pathology tests and demographic data.
The difficulties in interpreting blocks of free text include:                (a) the difficulty in extracting one or more significant features from a block of free text so that rules can be built using these significant features;        (b) the difficulty for a knowledge base to deal with minor variants of the block of free text. if the textual data in a block of free text is not quite the same as the text on which the rules were built, those rules may not be sufficiently general to still apply to the new free text block;        (c) the difficulty for a knowledge base to deal with different representations of the significant features themselves, both within the one free text block or between free text blocks; and        (d) the need to build rules based on a block of free text containing multiple Key Terms and encapsulating possibly several higher-level Key Concepts. A ‘Key Concept’ is a significant feature embedded in the free text that will be used by the expert or expert system when making an interpretation. A Key Concept is a unique higher-level code referring to a sequence of Key Terms. Several variants of Key Term sequences may map to a single Key Concept;        
In summary, traditional computer-enabled expert systems that are used to mimic the human interpretive process in interpreting data suffer a number of limitations when used to interpret complex data, including:                (a) difficulty in interpreting very large volumes of data values, since the rules that drive the interpretive process become overly complex and unwieldy when very large numbers of data values need to be taken into account in order to reach a conclusion or express a judgement (e.g. a definitive diagnosis); and        (b) difficulty in dealing with large and unstructured data item values, resulting in the inability to interpret such complex data. Reducing complex data items to a canonical form where simpler, atomic data items and values can be extracted and used in rules and conclusions is an unwieldy process and poses long term difficulties in maintaining a knowledge base.        
Therefore, traditional expert systems suffer limitations in interpreting ever increasing volumes of complex data and in converting such data to knowledge or a judgement (the knowledge or judgement being expressed in a textual report). There is a need for a computer-enabled method and system for generating text (such as a textual report) that is capable of interpreting large volumes of complex data, including numeric and textual data obtained from disparate sources and presented in various forms, including as freeform text, or alternatively, structured text as in a ‘synoptic’ report.
It is an object of the present invention to provide a method and system for overcoming the described limitations of traditional expert systems in interpreting complex data and in converting such data to knowledge or a judgement expressed in a textual report.