(1) Field of the Invention
This invention generally relates to a signal classification system for classifying an incoming data stream. More particularly, the invention relates to an improvement to the M-ary classifier known in the prior art resulting in a higher probability of correct classification.
(2) Description of the Prior Art
In order to determine the nature of an incoming signal, the signal type must be determined. A classifier attempts to classify a signal into one of M signal classes based on features in the data. M-ary classifiers utilize neural networks for extracting these features from the data. In a training stage the neural networks incorporated in the classifier are trained with labeled data allowing the neural networks to learn the patterns associated with each of the M classes. In a testing stage, the classifier is tested against unlabeled data based on the learned patterns. The performance of the classifier is defined as the probability that a signal is correctly classified, herein referred to as xe2x80x9cPCCxe2x80x9d.
A prior art classifier is shown in FIG. 1. The classifier 2 receives data from a data source 4. Data source 4 is joined to a feature transformation module 6 for developing a feature set. The feature set is provided to pattern match processors 8 which correspond to each data class. Pattern match processors 8 provide an output measuring the developed feature set against trained data. The pattern match processor 8 outputs are compared in a comparison 9 and the highest output is selected.
The basis of most M-ary classifiers is the maximum aposteriori probability (MAP) classifier or Bayesian classifier                               arg          ⁢                      xe2x80x83                    ⁢                                    max                              j                =                1                            M                        ⁢                          xe2x80x83                        ⁢                          p              ⁡                              (                                                      H                    j                                    ❘                  X                                )                                                    =                  arg          ⁢                      xe2x80x83                    ⁢                                    max                              j                =                1                            M                        ⁢                          xe2x80x83                        ⁢                                          p                ⁡                                  (                                      X                    ❘                                          H                      j                                                        )                                            ⁢                                                p                  ⁡                                      (                                          H                      j                                        )                                                  .                                                                        (        1        )            
However, if the likelihood functions p(X|Hj) are not known, it is necessary to estimate them from training data. Dimensionality dictates that this is impractical or impossible unless X is reduced to a smaller set of statistics, or features Z=T(X).
While many methods exist for choosing the features, this invention concentrates on class-specific strategies. Class specific architectures are taught in the prior art in patents such as Watanabe et al., U.S. Pat. No. 5,754,681.
One possible class-specific strategy is to identify a set of statistics zj, corresponding to each class Hj, that is sufficient or approximately sufficient to estimate the unknown state of the class. Sufficiency in this context will be defined more precisely in the theorem that follows. Because some classes may be similar to each other, it is possible that the M feature sets are not all distinct. Let                     Z        =                              ⋃                          i              =              1                        M                    ⁢                      z            i                                              (        2        )            
where set union notation is used to indicate that there are no redundant or duplicate features in Z. However, removing redundant or duplicate features is not restrictive enough. A more restrictive, but necessary requirement is that p(Z|Hj) exists for all j. The classifier based on Z becomes                     arg        ⁢                  xe2x80x83                ⁢                              max                          j              =              1                        M                    ⁢                      xe2x80x83                    ⁢                                    p              ⁡                              (                                  Z                  ❘                                      H                    j                                                  )                                      ⁢                                          p                ⁡                                  (                                      H                    j                                    )                                            .                                                          (        3        )            
The object of the feature selection process is that (3) is equivalent to (1). Thus, they are sufficient for the problem at hand.
In spite of the fact that the feature sets zj are chosen in a class-specific manner and are possibly each of low dimension, implementation of (3) requires that the features be grouped together into a super-set Z. Dimensionality issues dictate that Z must be of low dimension (less than about 5 or 6) so that a good estimate of p(Z|Hj) may be obtained with a reasonable amount of training data and effort. The complexity of the high dimensional space is such that it becomes impossible to estimate the probability density function (PDF) with a reasonable amount of training data and computational burden. In complex problems, Z may need to contain as many as a hundred features to retain all necessary information. This dimensionality is entirely unmanageable. It is recognized by a number of researchers that attempting to estimate PDF""s nonparametrically above five dimensions is difficult and above twenty dimensions is futile. Dimensionality reduction is the subject of much research currently and over the past decades. Various approaches include feature selection, projection pursuit, and independence grouping. Several other methods are based on projection of the feature vectors onto lower dimensional subspaces. A significant improvement on this is the subspace method in which the assumption is less strict in that each class may occupy a different subspace. Improvements on this allow optimization of error performance directly.
All these methods involve various approximations. In feature selection, the approximation is that most of the information concerning all data classes is contained in a few of the features. In projection-based methods, the assumption is that information is confined to linear subspaces. A simple example that illustrates a situation where this assumption fails is when the classes are distributed in a 3-dimensional volume and arranged in concentric spheres. The classes are not separated when projected on any 1 or 2-dimensional linear subspace. However, statistics based on the radius of the data samples would constitute a simple 1-dimensional space in which the data is perfectly separated.
Whatever approach one uses, if Z has a large dimension, and no low-dimensional linear or nonlinear function of the data can be found in which most of the useful information lies, either much of the useful information must be discarded in an attempt to reduce the dimension or a crude PDF estimate in the high-dimensional space must be obtained. In either case, poor performance may result.
Therefore, it is one purpose of this invention to provide an improvement on the M-ary classifier.
Another purpose of this invention is to drastically reduce the maximum PDF dimension while at the same time retaining theoretical equivalence to the classifier constructed from the full feature set and to the optimum MAP classifier.
Yet another purpose is to provide a classifier that gives this performance using a priori information concerning data and classes that is discarded when the combined feature set is created.
Accordingly there is provided a class specific classifier for classifying data received from a data source. The classifier has a feature transformation section associated with each class of data which receives the data and provides a feature set for the associated data class. Each feature transformation section is joined to a pattern matching processor which receives the associated data class feature set. The pattern matching processors calculate likelihood functions for the associated data class. One normalization processor is joined in parallel with each pattern matching processor for calculating an inverse likelihood function from the data, the associated class feature set and a common data class set. The common data class set can be either calculated in a common data class calculator or incorporated in the normalization calculation. Preferably, the common data class set will be calculated before processing the received data. The inverse likelihood function is then multiplied with the likelihood function for each associated data class. A comparator provides a signal indicating the appropriate class for the input data based upon the highest multiplied result. The invention may be implemented either as a device or a method operating on a computer.