Pattern and signal discrimination problems arise in numerous applications. If the nature of observed patterns or signals is well understood, then selection of an appropriate analysis method is straightforward. However, if the process that generates a pattern or signal is poorly understood, then discriminations and comparisons between instances of observed data are frequently ad hoc and yield weak results. In many cases, each observed patterns or signal is known to lie in one of a plurality of distinct classes but the inherent characteristics that define each class and differentiate between classes are unknown. A means to “bootstrap” and discover empirical identification of discriminating characteristics is critical.
Signature detection is one example of a target problem. A “signature” is a pattern within a signal or a data stream that can be associated with a condition of interest in the signal generating system. The goal is to discover and characterize signatures of specific conditions by examining groups of data collected under conditions with and without the signature present. By comparing the two groups of data one hopes to extract a representation of the signature.
There is a need for classifying and discriminating, for example, messy biometric signals. One specific signature detection problem targeted by the instant invention is identifying specific cognitive processes in electroencephalographic (EEG) and electro-cortical (EcoG) signals. The signals are electrical voltages measured by one or more electrodes placed either on the scalp (EEG) or on the surface of the brain itself (EcoG). (Sometimes, in fact, especially in experiments with laboratory animals, electrodes are placed interiorly in the brain.) The relationship between underlying cognitive activity and a measured signal is at best very poorly understood. Superficially, EEG/EcoG voltage patterns generally look like “colored” noise.
An empirical approach to understanding signals from the brain is to put the brain into a known condition and then sample the patterns that are correlated with that condition. In some example tasks, a subject might be asked to push a switch, to distinguish tones, to read words or to name pictures.
By analyzing the collected data, one would like to discover a signature that is indicative of the experimental condition. Ultimately, one might hope to identify signature patterns associated with very specific activities. For example, by understanding the signature brain activity preceding the act of pushing a switch, it may be possible to design a system that detects when a person merely thinks about the action. It might likewise be possible to design a system that detects the signature relating to thinking about specific words or phrases. These systems have obvious application in machine/human interfaces. There are also medical applications, including pre-seizure or mid-seizure detection of epileptic seizures, mapping brain areas prior to surgery, and so forth.
EEG and EcoG signature detection has been attempted using many techniques, including time-series averaging, Fourier and Wavelet analysis, and Matching Pursuits methods. Research is widespread and, while certain interesting foci have emerged (e.g., 40-hertz binding, alpha energy suppression, etc.), the existing methods have not yielded a satisfactory description of the underlying signature patterns. In part this is due to limitations of the methods. As will be discussed, the existing analysis methods generally rely on comparing the signals to certain standardized, mathematically “nice” prototype signals. The existing methods do not accommodate nonconforming signal dynamics very well and at best they present a blurred average picture of the situation.
Finally, in practice it may be necessary not only to distinguish the absence or presence of certain signatures (e.g., subject sees an image), but also to clearly distinguish one signature from another (e.g., subject sees a dog not a cat.) It is important to understand both what is common in similar subject signals and what is distinctive in different subject signal groups.
Another specific signature detection problem occurs in engine health monitoring. The problem in this case data is to predict failures of the engine, transmission, or other key component in a mechanical system from data that is periodically recorded. Often oil particulates, mechanical vibration levels, and other physical data are utilized. There is increasing interest in using acoustic analysis to predict failures.
It is very difficult to model mechanical interactions a priori in sufficient detail, especially if a system is exposed to unpredictable environmental factors. Here too an empirical approach is applied. One or more acoustic sensors mounted on or near the apparatus record signals. Frequencies of interest may range from subsonic to ultrasonic, depending on the monitored system. In this case the signals represent time-varying acoustic pressure patterns, i.e., sound. When components of the engine fail, the time of failure is recorded. By examining the acoustic signals prior to failure in a collection of different units or in the same unit on different occasions, one hopes to extract a universal signature signal that precedes the event. If such signatures are identified, then a system could be deployed to monitor engine health and warn users of pending failure in time to take corrective actions.
The idea of signature detection is not limited to classic signals, like sound or EEG, but is potentially applicable to latent patterns in any kind of data set. In engine health analysis, one might equally well look for signature patterns of variation in oil particulate counts prior to a mechanical failure. If oil is sampled regularly enough, then different patterns of increase or decrease in ferrous or organic contaminants might be associated with incremental failure of components. By identifying a signature in the data, a system can be developed to warn users of maintenance issues or pending failure.
Using either example measure, the engine health problem is complicated by familiar factors. The signals are not easily modeled by mathematically nice prototype patterns, existing analysis methods do not accommodate nonconforming signal dynamics very well, and the environment introduces additional unpredictable variations. Fine points again arise: it is very important to predict pending catastrophic failure, but even more useful to predict whether a particular bearing or cylinder head is the likely culprit.
Both these example problems can be broadened in various ways. An EEG signature corresponding to a particular person's brain activity might be used as a security key device. An audio signature corresponding to a particular class of mechanical engine might be used to remotely identify aircraft or naval vessels in defense applications.
Moreover, signature recognition and detection is important in other fields. Signature patterns may help computer systems recognize images or detect motion. Signature patterns in seismic data may predict earthquake and or volcanic activity. Signature patterns in acoustic sounding returns may predict the presence or absence of minerals. Signature patterns in radar and sonar returns may be used for target identification and classification. Signature patterns in sound may be used to enhance speech recognition and machine translation. Signature patterns in DNA structures may be useful in genomic classification problems and in relating phenotype to genotype. Signature patterns in medical data may be used to diagnose disease. Many other well-known data mining or auto-classification problems share characteristic difficulties with the expanded examples, and could potentially be better addressed with a more adaptable analysis algorithm.
In general, problem data sets may arise whenever similar information is collected under two or more distinct conditions, or can by otherwise sorted into two or more distinct groups that must be compared. In typical cases, data groups are believed to be different from each other, but the characteristic differences between them are either poorly understood or completely unknown. Likewise, the data within each data group is typically expected to be similar; however the characteristic similarities may be poorly understood or may be completely unknown.
Sorted data sets naturally arise in controlled experimentation. In such cases, an experimental designer first defines two or more sets of conditions. Then, each experimental condition is manifested and information is recorded by some means. Each controlled period or situation is often termed a “trial”, and an experiment consists of one or more trials under each of a plurality of conditions. The data set comprises a trial-by-trial collection of information, consisting of the observations for each trial together with some means of distinguishing the relevant conditions for each trial.
Sorted data sets also arise in less controlled situations. Data may be collected continuously or periodically in any circumstance and tagged to indicate which of a plurality of possible conditions each datum is associated with. Tagging and sorting may occur during recording, or it may occur after the fact. Sorting may be automatic, or it may require a skilled individual, and may occur by any means so long as it establishes two or more groups of trial data. Here, we apply the term “trial” to each unit of sorted data.
Finally, some problem data sets may not have any a priori divisions. In this case, data is sampled or otherwise divided into identically sized units, each unit comprising a data “vector” {x1, x2 ,, . . . ,, xn}. Each data vector may be termed a trial and the goal becomes to discover structure or similarities within the collection.
The recorded data for each trial is often described as a “signal”, particularly if it represents a time-varying pattern of information. However, the recorded data may be variously termed an image, pattern, vector, epoch, echo, or any other term of art that denotes an ordered set of observations. Many equivalent descriptive terms will be specific to various fields of application and obvious to those skilled in the art. For simplicity all such data will be described herein as a “signal”, without limiting the invention. We will term a collection of signals a “signal data set.”
Signal data sets arise in many areas and may be derived from any time- or space-varying quantity. For example: In medicine they include but are not limited to records of EEG, EKG, MEG, skin-resistance, blood pressure, heart rate, breath rate, blood chemistry, blood gas concentrations, lung volume, muscle force, any of a number of common image rendering methods, DNA sequences, infection rates, and so on. In defense engineering applications they may include but are not limited to, radar echoes, sonar echoes, passive RF, audio or optical recording, magnetic anomaly detection, etc. In communications they occur in areas including but not limited to, speech recognition, optical recognition, data compression, etc. Other signal data sets arise in areas including machine health analysis, geographic information systems, credit risk assessment, financial trends analysis, bio-informatics, seismic and mineral discovery analysis, reliability studies, scientific investigations and so on. Appropriate data sets are common; the example list is not exhaustive and many similar and related applications will be obvious to those skilled in the subject art.
When analyzing poorly understood data sets, a priori analysis methods often result in analyses with no significant statistical difference between groups and/or little or no statistical similarity within groups. Trial and error may eventually lead to discovering satisfactory discrimination criteria, or criteria may eventually be established and refined based upon improving theoretical descriptions of the data. Generally the process is laborious and chancy. Both theoretical development and empirical investigations would benefit from an analysis method that automatically adapts to the data set in order to highlight important inherent characteristics of each signal group.
The characteristics that are inherently important are those that maximize our ability to either discriminate between groups or to define similarities within groups. Statistical power is often dependent upon how the data is represented, and different theoretically equivalent data representations may tend to conceal or emphasize different characteristics.
A signal, X, is typically represented by a “vector” of coefficients, {x1, x2, . . . xn}. Such a vector may be transformed by any of a host of means, known to those skilled in the art, into another vector that is representative of the original. If no information is lost, the transformation is reversible so that the original data can be recovered; such transformations are termed “non-destructive”. If information is lost, the transformation is termed “destructive”; however, such a representation may nonetheless be of use, because the characteristics that are highlighted under such a transformation are those that are concentrated into a few coefficients. In the example of a Fourier transform, the energy occurring at a particular sinusoidal frequency is concentrated in a single coefficient. Thus, strong sinusoidal patterns stand out clearly because they are represented by only a few large numbers within the new vector. In the equivalent time-series vector these same characteristics are obscured because they are distributed as small values over a large number of coefficients.
Statistical comparisons are frequently stronger, quicker and more straightforward when they are based on a few largely varying coefficients than when they are based on many minutely varying coefficients. Likewise, signal characteristics are more easily visualized when they are compactly represented.
Data transmission and storage situations suffer from similar problems. Small dynamic variations may be lost in channel noise. Concentrating important information into a few large data values allows more robust transmission. Furthermore, it is well known that such transformations can be used to compress data: after transforming data so that important information is concentrated into a few large data values, one may truncate smaller values and still recover a close approximation of the signal from the smaller data set. Moreover, under certain transforms the small, truncated coefficients will represent noise; hence, the reconstruction process may actually improve the signal to noise ratio.
In general, for any given signal data set, one would like to construct a data set-specific transform that concentrates important differences (and/or similarities) into a few coefficients. The resulting representation addresses a host of discussed needs. Moreover, if these coefficients correspond to well-understood characteristics (e.g., frequency, time, scale and others, known to those skilled in the art) then an analyst can readily interpret the results in a meaningful way. The present invention is directed toward discovering an approximately optimal representation of any signal data set based upon minimal a priori assumptions.
One object of this invention is to minimize assumptions as to the nature of similarities and differences within the data groups and automatically discover a useable set of criteria on which to discriminate. The practical aim is to find a relatively small set of coefficients and an appropriate representation form in order to compactly and robustly describe key characteristics of each signal and group of signals. Another object of the invention is representing data and classes of data in such a way that the descriptive coefficients are meaningful to the analyst, or are otherwise useful in further processing of the data. Yet a further object of the invention is representing data or classes of data compactly.
A further object of this invention is elimination of noise from a collection of data, whether the noise is only additive noise, or temporal or spatial jitter and frequency instabilities.
Yet a further object of the invention is to facilitate the identification and analysis of characteristics of interest, facilitate compact representation of patterns, signals or groups thereof, facilitate removal of noise there from, and facilitate rapid sorting of new data based on characteristics discovered in prior data. And yet a further object of the invention is to provide methods of comparing signal representations after the GAD algorithm is complete