The present invention is directed to a method and system for evaluating biological or physical data. More particularly, the present invention is directed to a system and method for evaluating biological or physical data for detecting and/or predicting biological anomalies.
The recording of electrophysiological potentials has been available to the field of medicine since the invention of the string galvanometer. Since the 1930's, electrophysiology has been useful in diagnosing cardiac injury and cerebral epilepsy.
The state-of-the-art in modern medicine shows that analysis of R-R intervals observed in the electrocardiogram or of spikes seen in the electroencephalogram can predict future clinical outcomes, such as sudden cardiac death or epileptic seizures. Such analyses and predictions are statistically significant when used to discriminate outcomes between large groups of patients who either do or do not manifest the predicted outcome, but known analytic methods are not very accurate when used for individual patients. This general failure of known analytic measures is attributed to the large numbers of false predictions; i.e., the measures have low statistical sensitivity and specificity in their predictions.
It is usually known that something “pathological” is going on in the biological system under study, but currently available analytic methods are not sensitive and specific enough to permit utility in the individual patient.
The inaccuracy problems prevalent in the art are due to current analytic measures (1) being stochastic (i.e., based on random variation in the data), (2) requiring stationarity (i.e., the system generating the data cannot change during the recording), and (3) being linear (i.e., insensitive to nonlinearities in the data which are referred to in the art as “chaos”).
Many theoretical descriptions of dimensions are known, such as “D0” (Hausdorff dimension), “D1” (information dimension), and “D2” (correlation dimension).
D2 enables the estimation of the dimension of a system or its number of degrees of freedom from an evaluation of a sample of data generated. Several investigators have used D2 on biological data. However, it has been shown that the presumption of data stationarity cannot be met.
Another theoretical description, the Pointwise Scaling Dimension or “D2i”, has been developed that is less sensitive to the non-stationarities inherent in data from the brain, heart or skeletal muscle. This is perhaps a more useful estimate of dimension for biological data than the D2. However, D2i still has considerable errors of estimation that might be related to data non-stationarities.
A Point Correlation Dimension algorithm (PD2) has been developed that is superior to both the D2 and D2i in detecting changes in dimension in non-stationary data (i.e., data made by linking subepochs from different chaotic generators).
An improved PD2 algorithm, labeled the “PD2i ” to emphasize its time-dependency, has been developed. This uses an analytic measure that is deterministic and based on caused variation in the data. The algorithm does not require data stationarity and actually tracks non-stationary changes in the data. Also, the PD2i is sensitive to chaotic as well as non-chaotic, linear data. The PD2i is based on previous analytic measures that are, collectively, the algorithms for estimating the correlation dimension, but it is insensitive to data non-stationarities. Because of this feature, the PD2i can predict clinical outcomes with high sensitivity and specificity that the other measures cannot.
The PD2i algorithm is described in detail in U.S. Pat. No. 5,709,214 and 5,720,294, hereby incorporated by reference. For ease of understanding, a brief description of PD2i and comparison of this measure with others are provided below.
The model for the PD2i is C(r,n,ref*,)˜r expD2, where ref* is an acceptable reference point from which to make the various m-dimensional reference vectors, because these will have a scaling region of maximum length PL that meets the linearity (LC) and convergence (CC) criteria. Because each ref* begins with a new coordinate in each of the m-dimensional reference vectors and because this new coordinate could be of any value, the PD2i's may be independent of each other for statistical purposes.
The PD2i algorithm limits the range of the small log-r values over which linear scaling and convergence are judged by the use of a parameter called Plot Length. The value of this entry determines for each log-log plot, beginning at the small log-r end, the percentage of points over which the linear scaling region is sought.
In non-stationary data, the small log-r values between a fixed reference vector (i-vector) in a subepoch that is, say, a sine wave, when subtracted from multiple j-vectors in, say, a Lorenz subepoch, will not make many small vector-difference lengths, especially at the higher embedding dimensions. That is, there will not be abundant small log-r vector-difference lengths relative to those that would be made if the j-vector for the Lorenz subepoch was instead in a sine wave subepoch. When all of the vector-difference lengths from the non-stationary data are mixed together and rank ordered, only those small log-r values between subepochs that are stationary with respect to the one containing the reference vector will contribute to the scaling region, that is, to the region that will be examined for linearity and convergence. If there is significant contamination of this small log-r region by other non-stationary subepochs, then the linearity or convergence criterion will fail, and that estimate will be rejected from the PD2i mean.
The PD2i algorithm introduced to the art the idea that the smallest initial part of the linear scaling region should be considered if data non-stationarities exist (i.e. as they always do in biological data). This is because when the j-vectors lie in a subepoch of data that is the same species as that the i-vector (reference vector) is in, then and only then will the smallest log-r vectors be made abundantly, that is, in the limit or as data length becomes large. Thus, to avoid contamination in the correlation integral by species of data that are non-stationary with respect to the species the reference vector is in, one skilled in the art must look only at the slopes in the correlation integral that lie just a short distance beyond the “floppy tail”.
The “floppy tail” is the very smallest log-r range in which linear scaling does not occur due to the lack of points in this part of the correlation integral resulting from finite data length. Thus, by restricting the PD2i scaling to the smallest part of the log-r range above the “floppy tail,” the PD21 algorithm becomes insensitive to data non-stationarities. Note that the D2i always uses the whole linear scaling region, which always will be contaminated if non-stationarities exist in the data.
FIG. 1A shows a plot of log C(r,n,nref*) versus log r. This illustrates a crucial idea behind the PD2i algorithm. It is only the smallest initial part of the linear scaling region that should be considered if data non-stationarities exist. In this case the data were made by concatenating 1200 point data subepochs from a sine wave, Lorenz data, a sine wave, Henon data, a sine wave, and random noise. The reference vector was in the Lorenz subepoch. For the correlation integral where the embedding dimension m=1, the segment for the floppy tail (“FT”) is avoided by a linearity criterion of LC=0.30; the linear scaling region for the entire interval (D2i) is determined by plot length PL=1.00, convergence criterion CC=0.40 and minimum scaling MS=10 points. The species specific scaling region where the i- and j-vectors are both in the Lorenz data (PD2i ) is set by changing plot length to PL=0.15 or lower. Note that at the higher embedding dimensions (e.g. m=12) after convergence of slope vs embedding dimension has occurred, the slope for the PD2i segment is different from that of D2i . This is because the upper part of the D2i segment (D2i-PD2i) is contaminated by non-stationary i-j vector differences where the j-vector is in a non-stationary species of data with respect to the species the i-vector is in.
This short-distance slope estimate for PD2i is perfectly valid, for any log-log plot of a linear region; it does not matter whether or not one uses all data points or only the initial segment to determine the slope. Thus, by empirically setting Plot Length to a small interval above the “floppy tail” (the latter of which is avoided by setting the linearity criterion, LC), non-stationarities can be tracked in the data with only a small error, an error which is due entirely to finite data length, and not to contamination by non-stationarities.
Thus, by appropriate adjustments in the algorithm to examine only that part of the scaling region just above the “floppy tail”, which is determined by, (1) the Linearity Criterion, LC, (2) the Minimum Scaling criterion, MS, and (3) the Plot Length criterion, PL, one skilled in the art can eliminate the sensitivity of the measure to data non-stationarities.
This is the “trick” of how to make the j-vectors come from the same data species that the i-vector is in, and this can be proven empirically by placing a graphics marker on the i- and j-vectors and observing the markers in the correlation integral. This initial part of the scaling region is seen mathematically to be uncontaminated only in the limit, but practically speaking it works very well for finite data. This can be proven computationally with concatenated data. When the PD2i is used on concatenated subepochs of data made by sine-, Lorenz-, Henon-, and other types of known linear and nonlinear data-generators, the short scaling segment will have vector-difference lengths made only by i- and j-vector differences that are stationary with respect to each other; that is, the errors for 1,200-point subepochs are found to be less than 5.0% from their values at the limit, and these errors are due to the finite data length, not scaling contamination.
FIG. 1B illustrates a comparison of the calculation of the degrees of freedom of a data series by two nonlinear algorithms, the Point Correlation Dimension (PD2i) and the Pointwise Scaling Dimension (D2i). Both of these algorithms are time-dependent and are more accurate than the classical D2 algorithm when used on non-stationary data. Most physiological data are nonlinear because of the way the system is organized (the mechanism is nonlinear). The physiological systems are inherently non-stationary because of uncontrolled neural regulations (e.g., suddenly thinking about something “fearful” while sitting quietly generating heartbeat data).
Non-stationary data can be made noise-free by linking separate data series generated by mathematical generators having different statistical properties. Physical generators will always have some low-level noise. The data shown in FIG. 1B (DATA) were made of sub-epochs of sine (S), Lorenz (L), Henon (H) and random (R) mathematical generators. The data series is non-stationary by definition, as each sub-epoch (S, L, H, R) has different stochastic properties, i.e., different standard deviations, but similar mean values. The PD2i and D2i results calculated for the data are seen in the two traces below it and are very different. The D2i algorithm is the closest comparison algorithm to PD2i , but it does not restrict the small log-r scaling region in the correlation integral, as does the PD2i . This scaling restriction is what makes the PD2i work well on non-stationary data.
The PD2i results shown in FIG. 1B, using default parameters, (LC=0.3, CC=0.4, Tau=1, PL=0,15), are for 1,200 data-point sub-epochs. Each sub-epoch PD2i mean is within 4% of that known value of D2 calculated for each data type alone (using long data lengths). The known D2 values for S, L, H, and R data are, respectively, 1.00, 2.06, 1.26, and infinity. Looking at the D2i values, one sees quite different results (i.e., spurious results). Note that the D2i is the closest algorithm to PD2i , because it too is time-dependent. However, D2i it requires data stationarity, as does the D2 value itself. For stationary data, D2=D2i=PD2i. Only the PD2i tracks the correct number of degrees of freedom for non-stationary data. The single value of D2 calculated for the same non-stationary data is approximated by the mean of the D2i values shown.
For analysis by the PD2i, an electrophysiological signal is amplified (gain of 1,000) and digitized (1,000 Hz). The digitized signal may be further reduced (e.g. conversion of ECG data to RR interval data) prior to processing. Analysis of RR-interval data has been repeatedly found to enable risk-prediction between large groups of subjects with different pathological outcomes (e.g. ventricular fibrillation “VF” or ventricular tachycardia “VT”). It has been shown that, using sampled RR data from high risk patients, PD2i could discriminate those that later went into VF from those that did not.
For RR-interval data made from a digital ECG that is acquired with the best low-noise preamps and fast 1,000-Hz digitizers, there is still a low-level of noise that can cause problems for nonlinear algorithms. The algorithm used to make the RR-intervals can also lead to increased noise. The most accurate of all RR-interval detectors uses a 3-point running “convexity operator.” For example, 3 points in a running window that goes through the entire data can be adjusted to maximize its output when it exactly straddles an R-wave peak; point 1 is on the pre R-wave baseline, point 2 is atop the R-wave, point 3 is again on the baseline. The location of point 2 in the data stream correctly identifies each R-wave peak as the window goes through the data. This algorithm will produce considerably more noise-free RR data than an algorithm which measures the point in time when an R-wave goes above a certain level or is detected when the dV/dt of each R-wave is maximum.
The best algorithmically calculated RR-intervals still will have a low-level of noise that is observed to be approximately +/−5 integers, peak-to-peak. This 10 integer range is out of 1000 integers' for an average R-wave peak (i.e., 1% noise). With poor electrode preparation, strong ambient electromagnetic fields, the use of moderately noisy preamps, or the use of lower digitizing rates, the low-level noise can easily increase. For example, at a gain where 1 integer=1 msec (i.e., a gain of 25% of a full-scale 12-bit digitizer), this best noise level of 1% can easily double or triple, if the user is not careful with the data acquisition. This increase in noise often happens in a busy clinical setting, and thus post-acquisition consideration of the noise level must be made.
There is thus a need for an improved analytic measure that takes noise into consideration.