Many medical records have been converted into electronic medical records (EMRs), and electronic medical records of cooperative hospitals are available. Information in traditional paper medical records can be extracted through image character recognition techniques such as optical character recognition (OCR). Conventional OCR techniques can produce errors based on misinterpretations of positions and lengths of strokes or by imperfect penmanship by the author of an original manual document, such as a medical diagnosis.
Traditional medical record analysis is based on the experience of doctors to manually understand and analyze information in the medical records. In some simple cases, preliminary analysis of the medical records can be done using artificial intelligence (AI) technology, automated intelligent operations, or input provided by doctors. For example, the analysis can determine that a term in the medical records such as “rectal” is associated with an anatomical site and a term such as “tumor” is a symptom description. These types of associations applied to a medical diagnosis description in a medical record can be used to identify corresponding entities (for example, a sigmoid colon) and categorize or classify the entities (for example, as anatomical site). Medical entity identification and categorization (or classification) can be part of an entity identification process used on medical record data. However, some medical records can include typical problems with regard to the data, such as the presence of typos (typographical errors), new terms, or unknown words.
It would be desirable to identify associations in languages that are stroke-based, such as Chinese, particularly in cases where conventional OCR techniques produce erroneous results.