This application contains a microfiche appendix consisting of three microfiche comprising 127 frames.
The present invention is directed to a computer-based method and apparatus for classifying textual data.
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by any one of the patent disclosure, as it appears in the U.S. Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
The present invention is directed to a computer-based method and apparatus for classifying textual data. One application of the invention is a computer-based method and apparatus for classifying clinical trial adverse event reports. In the field of pharmaceuticals intended for the treatment of humans, clinical trials are used to validate the efficacy and safety of new drugs. These clinical trials are conducted with the participation of physicians, who monitor the health of persons involved in the trial.
Any symptom of a disease or malady, or degradation in the health of a patient participating in a clinical trial is termed an adverse event. Once such adverse events are observed or reported to the physician responsible for monitoring the health of a patient, an adverse event report is generated. These adverse event reports typically include short descriptions of the symptoms or health effect that resulted in the report. The reports generally omit all but those terms that are significant to the description of the adverse event being reported. However, given the nature of language, it is possible to describe one event in a very large number of ways. Accordingly, one patient who experiences headaches during a trial may have their symptoms described by a physician as xe2x80x9cheadache, migrainexe2x80x9d, while another patient who experiences headaches may have their symptoms described as xe2x80x9cmigraine headachexe2x80x9d or simply as xe2x80x9cheadache.xe2x80x9d In addition to the variations in describing adverse events due to differing word combinations, the physicians who prepare the adverse event reports may resort to synonyms (e.g. describing pain in the abdomen as xe2x80x9cpain in the stomachxe2x80x9d or xe2x80x9cpain in the bellyxe2x80x9d) or to abbreviations. Additionally, the reports are abbreviated in their syntax (e.g. xe2x80x9callergy, arms and legsxe2x80x9d rather than xe2x80x9cskin allergy on the arms and legsxe2x80x9d). Adverse event reports are also often collected from all over the world. Therefore, adverse event reports can be in a number of languages, in addition to English.
The text that comprises an individual adverse event report is known as a verbatim. The verbatims must be collected and the information they contain must be sorted, so that the significance of the various symptoms and health effects reported in the verbatims can be considered. Traditionally this work has been carried out by humans who read the verbatims and assign them to predefined categories of adverse events. A number of systems exist for categorizing verbatims. These include WHOART, COSTART, and MedDRA. However, coding verbatims is tedious and human coders introduce a certain amount of error. Furthermore, a physician is often required to interpret verbatims and to put them into their proper classification. For reasons such as cost, however, physicians are generally not employed for such work.
Computer programs that exist for assisting human coders in properly and easily classifying verbatims in accordance with the above-mentioned systems suffer from a number of drawbacks. In particular, existing systems are often incapable of automatically coding verbatims that do not conform to a verbatim that has been coded before. Therefore, existing automated systems generally cannot code a verbatim that varies from previously coded verbatims in the significant terms it employs. Although existing systems may sometimes include the capability of memorizing new verbatims, so that coding will be automatic if a previously coded verbatim occurs again, such a capability has limited use. This is because, as described above, similar adverse events can be described in an almost infite number of ways. The possible combinations of words in a language number in the billions, even when the length of the combinations is restricted to those of five words or less.
Another impediment to producing reliable automated coding systems is that it is difficult to obtain real world verbatims that can be used as the basis for the automatic coding provisions of existing systems. This is because such data is usually proprietary, and even if not proprietary, is difficult to obtain. As a result, existing automated systems have typically been developed using the English language definitions of the categories set forth in the various classification schemes, rather than actual verbatims produced in the course of clinical trials.
As a result of the above-described limitations and difficulties, existing automated systems are rarely successful in identifying and classifying a verbatim that has not been seen before. Where a verbatim cannot be automatically coded, existing automated systems provide machine assistance in hand coding the verbatim. This is done by means of weak pattern matching functions, such as spelling normalization and stemming. Following pattern matching, the system typically offers the coder categories that the program has determined the verbatim may properly fall into.
Another difficulty in the field of clinical trial adverse event reporting is the translation of study results coded according to one classification scheme to another classification scheme. A system and method for performing such a translation would be useful because there is only limited correspondence between the categories of the various classification schemes. Translation is often desirable to compare the results obtained by different trials. However, at present, no program exists that can perform this function effectively and without large amounts of human assistance.
The present invention is capable of automatically coding the majority of the verbatims it encounters. In addition, the present invention is capable of reliably flagging those verbatims that still require coding by a human. Experiments have shown that existing auto coding systems are capable of coding only about one quarter of the verbatims in a study with high confidence. In contrast, the present invention is capable of auto coding approximately two thirds of the verbatims in a study with an error rate that is comparable to the error rate encountered using human coders. For those verbatims that the system of the present invention is incapable of auto coding, the human coder is presented with a list containing up to ten categories from which to choose the proper code. These categories are ordered, with the most likely ones appearing at the top of the list. By intelligently limiting the codes presented, the present invention is capable of improving the error rates encountered when coding verbatims. In addition, the automated system of the present invention is capable of coding a large number of verbatims in a short period of time, and of reducing the amount of time and level of intervention required of human coders. Furthermore, the present invention allows verbatims classified in one coding scheme to be translated to another coding scheme in a highly automated process.
The present invention uses a sparse vector framework to enable the system to, in effect, employ vectors of sizes up to 10300. This is done by performing the multiplication steps necessary to evaluate information-containing dimensions in the matrix only on those dimensions that have a non-zero value. This method allows the program to effectively use a large amount of knowledge gained from training data to evaluate natural language text that has not been seen by the system before.
According to an embodiment of the present invention, a count vector is constructed for each verbatim. The size of the matrix comprising this count vector is typically about 1032, but can be much larger if desired. A count vector is constructed for each verbatim that has been input into the system. The count vector contains a dimension for each n-gram or combination of n-words of up to a certain length that could occur in the verbatim. Only those dimensions that correspond to n-grams found in the verbatim will contain a non-zero numerical value. The count vector for each verbatim is then saved by the system. However, in the step of saving, the present invention rejects those dimensions that equal zero. Therefore, only dimensions that correspond to the n-grams found in the verbatim for which the count vector has been constructed are stored. This step eliminates the problem of storing such a large matrix that would be encountered by a conventional system or program.
A similar step is taken by this embodiment of the present invention with respect to weight vectors. Like count vectors, the size of weight vectors are very large because they ideally contain dimensions for all possible combinations of words up to a determined length. Weight vectors are generated for each classification within a classification system. Using prior art techniques, even one weight vector would be incapable of being used in a computation, much less the hundreds of weight vectors needed to properly code adverse event reports.
Weight vectors collect the numerical weights that have been associated with each n-gram for a particular class. An n-gram is insignificant to a classification if it does not tend to direct a verbatim towards or away from being coded within the subject classification. Therefore, a significant n-gram will have a non-zero value, which is assigned to the dimension of the vector that corresponds to the particular n-gram. Although the size of each vector is very large, the system and method of the present invention can store and perform calculations using such a vector because it stores only the significant dimensions. This not only eliminates the storage problems encountered by prior art systems, but also enables existing computer hardware running a computer program written according to the present invention to perform calculations using the data.
The apparatus and method of the present invention generally includes a system that is trained to correctly code verbatims. The trained system is then used to code verbatims collected during real world clinical trials. The present invention, therefore, describes a computer-based system and method for classifying natural language text data.
The method and apparatus of the present invention features two modes, training and classification. In training mode, input consisting of verbatims and associated truth values is introduced to the system to produce an output consisting of weight vectors. Truth values are simply classifications to which the verbatims correspond. Training generally includes the steps of normalization, parsing, initialization, n-gram analysis, addition of frequent n-grams, and iterative refinement. When in classification mode, the weight vectors produced during training are used to generate a list of truth values and confidences for each verbatim that is read into the system. The classification mode generally includes the steps of reading the weight vectors, normalization, multiplying the test vectors by the weight vectors, confidence ranking the resulting output, and producing the output.