The field of the invention is computer methods for determining precursors of health abnormalities from processing patient medical records.
In mammography, much effort has been expended to characterize findings in the radiology reports prepared by doctors of radiology. Various computer-assisted technologies have been developed to assist radiologists in detecting cancer; however, the algorithms still lack high degrees of sensitivity and specificity, and must undergo machine learning against a training set with known pathologies in order to further refine the algorithms with higher validity of truth. In a large database of reports and corresponding images, automated tools are needed just to determine which data to include in the training set. Radiologists disagree with each other over the characteristics and features of what constitutes a normal mammogram and the terminology to use in the associated radiology report. Abnormal reports follow the lexicon established by the American College of Radiology Breast Imaging Reporting and Data System (Bi-RADS), but even within these reports, there is a high degree of text variability and interpretation of semantics. The focus has been on classifying abnormal or suspicious reports, but even this process needs further layers of clustering and gradation, so that individual lesions can be more effectively classified.
The knowledge to be gained by extracting and integrating meaningful information from radiology reports will have a far reaching benefit, in terms of the refinement of the classifications of various findings within the reports. In the near-term, the overall goal of k is to accurately identify abnormalities reported in radiology reports amid a massive collection of reports. The challenge in achieving this goal lies in the use of natural language to describe the patient's condition.
Therefore, an automated method is needed for learning the characteristic cue phrase patterns of the natural language used in the radiology reports and using those learned patterns as a basis for automatically categorizing, clustering, or retrieving relevant data for the user.
In the paper entitled “Analysis of Mammography Reports using Maximum Variation Sampling,” Patton, R. M; Beckerman, B., and Potok, T. E. presented at 4th GEECO Workshop on Medical Applications of Genetic and Evolutionary Computation (MedGEC) Atlanta, Ga. July, 2008, ACM Press, New York, N.Y. 2061-2064, the maximum variation sampling algorithm (MVSA) for analyzing radiological medical reports was described.
In a test data set, the data set comprised approximately 120,000 reports. Within this data are numerous reports that simply stated that the patient canceled their appointment. These reports are very small in length and are exceptionally distinct from all other reports (similarity values approaching zero). Unfortunately, the MVSA as proposed there gravitated toward these cancellation reports as the best solution for a maximum variation sample. In an effort to effectively characterize the phrase patterns of the mammography reports, it is necessary to examine reports that are longer in length, so that more language can be examined for patterns. In addition, abnormal reports tend to be longer in length than normal reports since the radiologist is describing the abnormalities in more detail. Consequently, the MVSA in this paper needed to be improved upon.
Other techniques of reducing the number of reports and better evaluating their significance over time were also sought after.