Generally, large-scale electronic medical record systems are known to offer a wealth of data for comparative effectiveness research and clinical decision support. Rich data in such systems include information about diagnosis, demographics, vitals, diagnostic exams and tests, treatment history and outcomes of patients which can be used to explore many scientific questions.
One such question involves finding statistically meaningful co-morbidity relations between diseases using statistics from the population. Traditionally, co-morbidities have been studied through targeted research on candidate disease pairs, such as heart disease with diabetes within population of patients diagnosed with those diseases where the nature of the relationship is at least known. Next, obtaining meaningful results from large-scale disease-association studies usually depends on the accuracy and completeness of information recorded in an electronic medical record (EMR). Diagnosis codes often subsume many conditions (e.g., congestive heart failure), or are generic in nature (e.g., relate generally to mitral valve disorders), so that they alone are not very reliable indicators of the actual underlying disease. Additional diagnosis inferences must be made from the clinical information recorded in EMR such as in textual reports or using multiple sources of evidence for a disease such as through medications prescribed. Extracting such information from free text in textual reports using natural language processing techniques has only met with limited success. Finally, the choice of the data mining algorithm can also affect the co-morbidities that can be discovered.
Popular association mining methods, such as a priori methods, can generate many spurious associations as they exhaustively search through combinations. Methods that count primarily the frequency of co-occurrences of diseases can sometimes lead to incorrect causative inferences, such as those for diseases that persist in time and are common across many patients, e.g., hypertension. Attempts to handle a time-varying nature of relationships in data mining methods has also been limited in using discrete interval combinations.