1. Field of the Invention
The present invention relates generally to the field of assessing hearing capacity in humans. More particularly, the invention is a method for quantifying the probability that an auditory brainstem response (ABR) is present in an electrophysiologic recording from a human infant.
2. Background of the Invention
The ABR is a waveform of fluctuating electrical potential over time, which may occur in response to a brief, transient acoustic stimulus such as a click. The ABR originates in the neurons of the auditory nerve and its higher connections in the brain stem. When recorded from electrodes on the scalp or neck, it is less than one microvolt in size, and is obscured by much larger ongoing random potentials that arise elsewhere in the brain and the musculature of the head and neck. Computer summation or averaging of the responses to several thousand stimuli presented at rates typically in the range of 20-50 per second is required to enhance the ABR "signal" relative to the background electrical "noise", and to render it visually detectable in the summed or averaged response.
The presence or absence of an ABR for a specific type and intensity of stimulus can be used as a proxy for overt behavioral response, indicating whether or not the stimulus was audible. This is the basis for an electrophysiologic hearing screening test, of particular value in subjects such as infants who are unable to give reliable, behavioral responses to sound.
In the newborn population, it is widely acknowledged that it is important to detect and manage hearing loss as early as possible, and preferably in the first six months, to facilitate development of speech, language and cognitive skills. In 1993, the National Institute on Deafness and Other Communication Disorders sponsored a Consensus Conference on Early Identification of Hearing Impairment in Infants and Young Children. That conference recommended screening for identification of hearing impairment in the newborn period for all infants regardless of the presence or absence of risk factors for hearing loss, that is, universal infant hearing screening. These recommendations were endorsed and reiterated soon after by the American Academy of Audiology and the Joint Committee on Infant Hearing, 1984. Many states have recently implemented, or are in the process of implementing, such screening programs. This widespread endorsement of mass hearing screening of neonates and infants has created a challenge for scientists and clinicians to have fast and accurate tools ready for evaluating potential hearing loss in infants.
ABR testing is well established as a core part of most screening protocols. The clinical utility of ABR-based hearing screening tests depends critically on the accuracy of the ABR detection decisions. Such decisions are intrinsically prone to error, because they involve the detection of a signal in random noise that may obscure a genuine signal or masquerade as a signal when none is present. False-positive ABR detection leads to a false-negative screening test: the hearing-impaired child passes the screening test and receives no intervention. Other manifestations of disorder may be ignored, given that the test was passed, so the screening does active harm. False-negative ABR detection tests cause false-positive screening tests; this precipitates needless follow-up diagnostic assessment costs, as well as indirect costs of mislabeling a normal child.
A distinction must be drawn between detection tests that are empirical and those that are analytic. Empirical tests are based upon experimental studies of the distributions of a given test statistic when response is thought to be present or absent. Usually the determination of response presence or absence is based on expert subjective assessment of the average records obtained in a set of subjects. There are two major difficulties with this approach. First, the expert judgments may be wrong, which clearly confounds the assessment of the accuracy of the test statistic. Second, there is no proof that the results observed in one set of subjects will necessarily apply to a different set of subjects or to a situation in which any feature of the data recording or analysis is changed. This is a failure of generalizability of the empirical validation process.
Analytic methods, in contrast, do not appeal to experimental validation datasets. They are based upon known properties of known statistical distributions relating to the chosen test statistic. Thus, rather than relying on empirical experimental data, analytic methods capitalize upon the vast body of statistical distribution theory and statistical tables of distributions. It is necessary to show that real data satisfy certain assumptions that are required for certain distributions to pertain, but these assumptions may be weak, easily satisfied, and easily proven to hold. Such methods are both highly quantitative, yielding known and specifiable rates of decision error, and are also highly generalizible across datasets and measurement conditions.
A crucial characteristic of a good statistical response detection test is that it has the highest possible statistical power. Power is the probability that the test will correctly detect a response that is genuinely present. Less than optimal test power is very disadvantageous in practical terms. A loss of power translates directly to longer test time than necessary to reach the statistical criterion for response detection. This is a major practical disadvantage because some babies yield satisfactory measurement conditions for only brief periods of time, they may be untestable due to test inefficiency. Also testable babies will take longer to test than necessary which increases costs and decreases throughput. This factor will be especially crucial in light of the implementation of universal newborn hearing evaluation protocols now mandated in many states. Third, the use of a test that is less powerful than necessary will result in larger rates of detection decision error than would be possible with a more powerful test.
Prior Art Detection Systems
Current approaches to automated detection of ABRs include techniques that evaluate the time-domain waveform and those that assess spectral characteristics (frequency domain). Automated detection of neonatal click-evoked ABR to low-level stimuli for mass screenings have primarily involved analysis in the time domain, although one known system includes both time and frequency domain analysis. At present, four systems have been used or sold as "automated infant ABR screening" devices. By that we refer to those devices in which decisions regarding ABR presence or absence or test "Pass" or "Fail" (sometimes called "Refer") is made by the system itself (not by the examiner) based on some predetermined criteria that are discussed below.
The general approach of the detection algorithm employed by the most commonly used system for automated ABR detection in infants appears to be as follows: A set of sample points are weighted according to their relative magnitude in the standard infant ABR waveform. It is not clear how the position or number of the data points are selected. The polarities of the amplitude of each point in a standard or template are compared with those observed at the corresponding latency in each sweep during averaging. Each time a sweep is sampled, the correspondence of polarity between the data and the template at each of the selected time points yields a count of +1. After every 500 sweeps, the template points are shifted in increments of 0.25 ms over a 3 ms range to locate the position of maximum polarity correspondence. Presumably, this is done using an accumulated average of some kind, but this is not clear. Each sample in each sweep constitutes a trial and running counts of the numbers of polarity matches and trials are accumulated. Because the probability of a polarity match for each point is 0.5 if the response is absent, a quantitative hypothesis test can be constructed based on a binomial model. This technique appears to be a combination of template cross-correlation with a multi-point amplitude-based detection criterion after a one-bit conversion.
The detection algorithm used in this system is statistically based but is far from analytic. Specific disadvantages are as follows:
Lack of validity: The algorithm effectively counts the number of times the polarity of the recorded activity matches the expected polarity of an ABR template waveform. Several points are tested per sweep and the number of polarity coincidences is also summed over many sweeps.
Because the successive data points within each sweep are not statistically independent, the sampling distribution of the number of coincidences will not have the binomial distribution that is assumed. This means that the actual error rates of the test are not represented accurately in statistical tables. Therefore, the method is substantially empirical. Actual error rates may only be determined by experiment with quantitivity and generalizability limitations noted earlier.
Power Sub-optimality: This detection algorithm counts correspondence events between observed and expected polarity of activity at specific times. The actual amplitude of the observed signal is not fully utilized, only the polarity. An analogy can be drawn to the use of the Ordinary Sign Test instead of the Student t-test to examine the hypothesis that the true mean of a sample of n observations is zero. The sign test uses only the polarity of the data, whereas the t-test, which is the most powerful test possible under the assumption of normal error distributions, uses all of the amplitude information. The asymptotic relative efficiency of the sign test is 2/pi, or 64%. This implies that any sign-based detection method will sustain a substantial loss of power.
Another commercially available instrument for ABR hearing screening includes automatic detection that is based on the following algorithm: For any particular stimulus level, the system acquires two ABRs with a fixed stimulus level. The averages are stopped if the estimated signal-to-noise ratio exceeds one or after 1,024 sweeps. If both averages have SNR&gt;1, the response is deemed present. If not, a cross correlation analysis is performed. The latency region of 5 to 12.5 ms post-stimulus is sectioned into seven overlapping `windows`, each of 2 ms duration. For each window position, a Pearson correlation coefficient of the data values in the two averages for each and every successive time point in the given window is calculated. The test variable is the maximum absolute value among the seven correlations covering all window positions. If the test variable exceeds 0.9, the ABR is deemed to be present. This approach to automatic detection is an adaptation of a simple, correlation-based detection method first reported for the ABR by Weber, B. A and Fletcher, G. L., 1980 A Computerized Scoring Procedure for Auditory Brainstem Response Audiometry. Ear and Hearing, 1, 233-236. (1980).
The detection algorithm of this device is highly empirical. The primary detection statistic is a cross correlation coefficient between two independent averages using the region of anticipated response. The test statistic is the absolute maximum of the observed correlations. Because of the extensive correlation (autocorrelation) between successive data values in each of the averages, the statistical distribution of the test statistic is unknown, and detection error rates cannot be derived from statistical tables. The critical values for the test statistic, and the error rates, can only be estimated by experiment. Indeed, they were selected using empirical data with expert subjective judgment as the gold standard for response presence or absence. The serious limitations of this method were described earlier.
Details of response detection in a third prior art system are proprietary, although the manufacturer has released a non-detailed description of the decision-making system. Briefly, the system evaluates three aspects of the infant ABR in the decision process. First, the system determines presence or absence of an ABR in a record by evaluating (a) the presence of a predetermined spectral component of the response, using a multivariate analysis simultaneously assessing both real and imaginary components of a specified Fourier component and (b) an F.sub.SP -like signal to noise estimate. If those criteria are met, the waveform morphology is checked with a type of template match that evaluates certain features of the waveform (peak number and placement). It appears that all three aspects of response detection algorithm must be satisfied for an infant to receive a "pass" from this system.
Limited information is available about this proprietary detection algorithm. It is based on a combinatorial approach using four types of measure: template (wave shape) and non-template features in both the time and frequency domains. The frequency domain algorithm involves examining the distribution of sine and cosine parts of several harmonics of the Fourier spectrum of the recorded activity, and a comparison with the expected values for both noise and ABR signals. This is combined in an unknown manner with a "modification of the so-called F.sub.SP technique". The detection stage is followed by a verification stage that examines the extent to which the detected and estimated waveform matches expected waveshape characteristics.
The performance of this approach is not known and not derivable from statistical distribution theory. The multi-component nature of the method virtually guarantees that it is not of analytic strength, but that it will be empirical. An alleged advantage is its exploitation of both time-domain and frequency-domain features. This is highly questionable, because the time-history and Fourier spectrum of any activity are linear transformations of each other and contain identical underlying information.
F.sub.SP
A fourth prior art technique that has been applied to infant ABR detection for screening is the F.sub.SP (This technique is described in Elberling, C. & Don, M. (1984). Quality Estimation of Averaged Auditory Brainstem Responses. Scand Audiol, 13, 187-197 and Don, M., Elberling, C. & Waring, M. (1984). Objective Detection of Averaged Auditory Brainstem Responses. Scand.Audiol 13, 219-228.). This technique is not applied commercially for specific use in infant screening but is available on some commercial evoked potential systems for general use (Neuroscan and Nicolet "Spirit") and was applied to automated newborn hearing screening by the first named inventor of the present invention in a multi-center study funded by the National Institute on Deafness and Other Communication Disorders. F.sub.SP involves calculation of a variance ratio (hence the F) the numerator of which is essentially the sample variance of the average and the denominator of which is the variance of the set of data values at a fixed single point (hence the "SP") in the time window across a group of sweeps.
F.sub.SP is used to estimate the "quality" or the signal-to-noise ratio of an auditory evoked potential. Calculation of F.sub.SP is based on the fact that any ABR recording is background noise (random brain and muscle activity not related to the auditory signal) and, if the signal is audible to the subject, each recording also contains neural activity from the auditory system that is systematic in scalp recorded morphology and time-locked to the onset of the eliciting auditory signal. For any given single, digitized time point in the averaged ABR waveform, the neural contribution to the amplitude measured at that point is constant from sweep to sweep whereas the noise contribution to amplitude should be random. Consequently, the neural response will contribute nothing to the variance of the amplitude at any single point and the sweep to sweep variance of a single point in the analysis window can be used as an accurate estimator of the variance of the background noise in the recording. This is referred to as VAR(sp).
Calculation of F.sub.SP is illustrated in FIG. 1. The magnitude of the averaged response can be characterized by the point to point variance of the digitized amplitude measures for a specified window of the average. In the standard F.sub.SP calculation, each point across a specified time window is used in a standard variance calculation referred to as VAR(s). This value is comprised of the energy of the ABR (if present) as well as the energy of the averaged noise. Every 256 sweeps the averaging process is halted momentarily and VAR(s) and VAR(sp) and the ratio of the two (F.sub.SP) is calculated. The numerator or VAR(s) includes signal and noise and the denominator or VAR(sp) estimates noise. When no signal (ABR) is present the expected value of the ratio is close to 1. The ratio of variances has the known statistical F distribution, indexed by a parameter known as the degrees of freedom (dof). Consequently, when the degrees of freedom are known, the probability of false positive detection for any F.sub.SP value associated with an evoked potential recording can be determined by look up on an F table.
In a standard paradigm, F.sub.SP values are updated after each 256 sweeps. As the averaging process reduces background noise, the F.sub.SP value associated with a recording containing a true ABR, will grow. A priori rules can be established for halting of the averaging process based on a comparison of achieved and desired probability of true response detection. For example, in the article cited above, Elberling and Don, (1984) used a conservative estimation of degrees of freedom and determined that F.sub.SP of 3.1 would correspond to true-positive detection confidence of 99%. In that case, the F.sub.SP value was used as the stopping criterion for the averaging process, indicating that the desired signal to noise ratio had been achieved. Because any given recording or subject will vary dramatically in the level of the background noise and the amplitude of the evoked potential, using a target F.sub.SP as a stopping rule optimizes the use of averaging time, averaging shorter periods of time in good SNR and longer in poor SNR conditions.
The disadvantages of the F.sub.SP technique include:
Excessive window length: The standard response analysis window has length 1000/HPF ms where HPF is the high-pass cutoff frequency of the recording amplifier. For a typical case of HPF of 100 Hz, the length is 10 ms. This is generally greater than the length of the region of significant response amplitude. Thus, time regions that contribute little or nothing to the numerator variance estimate are included. This reduces the expected value of the numerator, resulting in a less sensitive test (a test with lower statistical power) than if the window were delimited to regions of substantial response amplitude.
Sub-optimal test points: Even given a response-focused window, some time points within the window contribute more to the response variance than do others. In general, there will exist some subset of all the points in the window that develops maximum variance for a given response waveform, and there will be many other subsets that develop variance substantially greater than the variance of the entire window. It follows that even for a focused window, to select all points in the window as is done in the standard F.sub.SP is sub-optimal with respect to statistical power. Both of these disadvantages result in a detection test that is less powerful than necessary.
Conventional F.sub.SP can be classed as semi-empirical or semi-analytic. The approach is vastly more quantitative and reproducible than is subjective judgment of response presence or absence. The limitation arises from the fact that the statistical degrees of freedom in the numerator variance estimate are known only approximately. This is due to the fact that the effective degrees of freedom in a time series that has correlation between successive data points, as is the case for ABR data, are not equal to the number of data points used in calculating the variance estimate. For example, a time window containing 100 data points is normally assumed to have 99 degrees of freedom, but may actually only have 10. This means that the distribution of the sample variance of such a set of points will follow chi-square with 10 dof, not chi-square with 99 dof. The distribution of the F.sub.SP statistic will change accordingly. Experimental studies have shown that the effective dof in, say, a 10 ms window of ABR data vary slightly across subjects and measurement conditions. Since the dof in a individual subject are not known exactly, but rather, only approximately, the Type I error rate (alpha, the significance level of the response detection test) will be only approximately correct.
Thus a great strength of F.sub.SP is that the F-distribution is valid. The qualification is that the decision error rates are not known exactly, only approximately.