Document analysis and retrieval has become exceedingly difficult due to a large number of available documents and a lack of uniformity in the way in which documents are prepared. Even if a set of documents conform to a standard document format, there may still be difficulty in comparing those documents to each other because document preparers may use different words and/or terminology during document preparation. This may be the case in specialized documents such as, for example, clinical documents using Health Level 7 (HL7) Clinical Document Architecture (CDA).
Clinical documents typically summarize care and services given to a patient and health conditions of that patient. For instance, a discharge summary may summarize a specific hospitalization event and a report note may summarize a surgery a patient has undergone. Considering the large number and numerous types of clinical documents available, a practitioner can not efficiently compare one clinical document to a database of clinical documents. Under conventional techniques, if a practitioner is interested in comparing a given clinical document to a database of clinical documents the practitioner must either: (1) manually and systematically compare the clinical documents of his/her patient to a large database of clinical cases, which is unrealistic; and/or (2) rely on conventional document comparison techniques.
Conventional techniques inefficiently evaluate documents because they simply compare text strings (e.g., basic keyword search). Searching for documents by keyword may be especially ineffective when analyzing medical documents because medical practitioners may use different words and/or terms to describe similar events. As a result, conventional comparison systems may not recognize a relationship between two documents, which use different words but are substantively the same. For example, generic medications which are based on the same drug formula may have different names. Another example includes the names of diseases. A disease like Hepatitis B may be written as initials (i.e., “HBV”), a full name (i.e., “Hepatitis B Virus”), and/or any other variation (e.g., “Hep B”). Additionally, beyond the precise identification of a term (e.g., medication, disease, symptom, etc.), conventional techniques are unable to measure the overall similarity between two documents.