With integrated information becoming available through large patient repositories, newer decision support systems are emerging that enable physicians to benefit from consensus opinions of other physicians who have looked at similar patients. These are based on techniques of content-based retrieval using underlying similarity in patients' diagnostic data to infer the similarity in their diagnosed diseases. An important source of diagnostic information is the measurement reports. These documents summarize the findings in many diagnostic settings and record important measurements taken from devices under various tests. Such reports also may contain written descriptions of the various structures and document concrete findings that point to diagnosis conclusions.
While complete natural language understanding of such reports is a challenging problem, often what needs to be extracted from these reports is sufficient clinical information to complete a longitudinal clinical record of the patient. Although electronic medical record systems capture clinical data, the information useful for diagnosis often lies in other systems and in unstructured form so that completing a full longitudinal record of a patient can require analysis of the unstructured data. The clinical reports, particularly those available in transcription systems, radiology systems, cardiology systems, form an important source of clinical data, such as demographic information (immunization, allergies), family history (relative who had a disease), diagnostic exam measurements (e.g. area of left ventricle), medications, procedures and other treatments and their outcomes. Extracting these types of information can be reduced to two basic types of operations—namely, finding textual phrases that are indicative of the type of clinical information being extracted, and finding name-value pairs indicating the pairing of measurements to their values.
An example includes inferring diagnosis labels from reports, which is an important preprocessing step for many evidence generation activities in healthcare. Knowing the diagnosis label helps classify the data and use it for straightforward lookup of patients with specific diseases. It also allows the grouping of patients with similar diseases for decision support and enables a consistency checking of recorded diagnosis in electronic medical record (EMR) systems. It can also have implications for quality control and revenue cycle management because missing or incorrect diagnosis codes can lead to lost revenue from inadequate billing as well as liabilities and quality of care issues due to missed diagnoses.
Inferring diagnosis labels from reports can be quite challenging since doctors rarely use the same phrase as the definition of a diagnosis code (ICD9). For example, a diagnosis code of mitral stenosis (394.0) may have to be inferred from a description in text such as ‘There is evidence of a stenosis of the mitral valve in the patient.’
Although free text search engines can find exact matches to words in phrases in such reports, they cannot easily handle variations in the formation of the phrase such as above that still preserve the overall meaning. Finding textual phrases corresponding to a desired piece of information (such as a diagnosis label) requires (a) knowing the relevant vocabulary terms, (b) pre-cataloging possible variations in their appearance in medical text, (c) spotting reliable negations that imply the opposite in meaning, and (d) robust algorithms for finding matching phrases that tolerate the variations in usage of terms.