I. Field of the Invention
The present invention relates generally to a data analysis computer program and, more particularly, to a data analysis program for analyzing sets of temporal data such as temporal health care surveillance data, and especially epidemiological data.
II. Description of the Prior Art
There are many health care databases, e.g. epidemiology databases, containing temporal data, i.e. data which is collected at periodic time intervals. Such databases, furthermore, typically include bacterial antimicrobial data, resistance data and the like at hospital, regional and national levels. Domain experts in epidemiology and laboratory medicine currently review the antimicrobial susceptibility data at half year, yearly or even longer intervals in an effort to discover significant new patterns, information and trends of the data. This time deferred and late discovery of such trends results in increased inefficiency and increased cost of treatment in the medical field.
Additionally, at present domain experts perform only manual analysis of the data in an effort to discover trends and patterns of health care or epidemiological data. Such manual analysis includes database queries and confirmatory statistics to specific questions in an effort to test specific hypotheses. These traditional methods of data analysis, however, offer no way to discover patterns and trends that are not suspected by the investigators of the data. Consequently, such unsuspected trends and patterns are simply ignored and remain undiscovered even though such trends and patterns may be significant.
The present invention provides a method for analyzing sets of temporal data, especially epidemiological data, which automatically identifies significant trends and patterns in the data and does so in a timely fashion.
In brief, the method of the present invention analyzes sets of temporal data wherein each set of temporal data comprises a plurality of records collected at a time period unique to each such set. Each record has a plurality of data items including, for example, patient characteristics, the organism isolated, source of the sample, date reported, location of patient and one or more antimicrobials used to test the sample against.
The method of the present invention includes the first step of creating data association rules for at least a plurality of sequential data sets, i.e. sequential temporal data sets, wherein each such data set includes at least some common data items. Each association rule is only considered if it has precondition support in some predetermined number of records. Otherwise, the association rule is discarded as statistically insignificant.
After determining the data association rules, the confidence factor for each such association rule is determined where the confidence factor for the association rule AB represents the likelihood or probability of B given A.
For example, given a data item A and a data item B where the intersection of A and B is empty, the confidence factor Conf(R, P) where rule R=(AB) in partition P, Conf(R, P), is S(A∪B)/S(A) where S(X) is the Support of X in P. Such association rules, together with the confidence factor for each such rule, is stored in a history.
In order to determine significant patterns or trends over time, the association rule and confidence factors for the current data set are first determined. The confidence factors for each association rule are then compared with the confidence factors for the corresponding association rule, if present in the history, from previous data partitions. A change in confidence of a particular association rule, such that the probability that the change occurred by chance is less than some predefined percentage (e.g. 5%) as determined by a chi-square test of two proportions or some other applicable statistical test, generates an alert signal to the operator. Following analysis of all of the data in the current data set, the alert signals are displayed or otherwise conveyed to the operator user who then takes whatever action is appropriate.
In the preferred embodiment of the invention, the alerts are clustered into events prior to displaying such alerts to the operator. Such alert clustering groups descendant association rules with the parent association rule into an event. An association rule A1B1 is defined as a dependent of association rule A2B2 if the set of items in A2 is contained in A1 and, likewise, B2⊂B2. Dependent also contains that a descendent association rule accounts for the change detected in the parent association rule.
A primary advantage of the present invention is that it rapidly identifies related clusters of high support association rules whose confidences change significantly over time. Using traditional methods, these clusters might be overlooked.