1. Field of the Invention
This invention relates to a computer automated method for creating image recognition engines optimized for a given language and/or source; and more particularly to a computationally efficient method for selecting, from a large universe of features, subsets of features that optimize recognition accuracy against a target data set. While not limited thereto, the invention will be explained in its preferred context for use in recognition of hand printed or machine printed characters.
2. Description of the Prior Art
As will be appreciated by those skilled in the art, each new language studied (and research in character recognition generally), proposes new character features to address specific problems or attributes of the source data set. However, the addition of new features is not necessarily sufficient for improved performance of a character recognition engine. If too many features are used, the rate of successful character discrimination actually drops as more features are added, a phenomena known in the art as peaking.
Each language (Japanese, Thai, English, et al.) has unique characteristics. Traditionally image character recognition (ICR) engines have been developed that make use of a feature set which was hand developed to support the language. The quality of the ICR is directly related to the amount of research effort applied to the language. High quality ICR engines exist where large markets support a large investment in research and development. An automated method that selects feature subsets would allow inexpensive development of ICR engines that perform well against languages that economically would not support a large investment.
Then, too, mixed languages can present unique problems as an OCR engine may need different features to distinguish between the two (or more) languages than the feature set used to best recognize either language individually. An automated feature selection tool would generate a feature set that is tailored to handle the particular mix of languages involved.
The following is a definition of certain of the terms used herein:
Class: All samples of a given codepoint (character) within the training data set.
Subclass: The set of samples within a single class which share some common attribute(s).
Codepoint: The representation to a machine of what people would recognize as a character.
Feature Vector: The set of measurements for the feature universe for a single sample ({right arrow over (v)})
S+: The class under consideration.
Sxe2x88x92: The set of all classes within the problem space other than the codepoint under consideration (S+).
Exclusive: Used to describe a binary vector. Exclusive indicates that the binary vector is distinct from all vectors within Sxe2x88x92 (eq. 1).
{right arrow over (v)}xcex9{right arrow over (S)}xe2x89xa0{right arrow over (v)}∀{right arrow over (S)}xcex5Sxe2x88x92xe2x80x83xe2x80x83(1)
xe2x80x83where:
xcex9 is the xe2x80x9candxe2x80x9d operator,
∀ is the xe2x80x9callxe2x80x9d operator, and
xcex5 means exists.
G: A subset of S+, also known as a xe2x80x9cviewxe2x80x9d
xcex1(G): The binary vector resulting from the logical conjunction of the samples represented by G.
"THgr"(S+,Sxe2x88x92): The collection of all G subsets in S+ such that xcex1(G) is exclusive against all vectors in Sxe2x88x92.
xcexa9(S+,Sxe2x88x92): The collection of all maximal subsets in "THgr"(S+,Sxe2x88x92) where maximal is defined as every H in "THgr"(S+,Sxe2x88x92) such that if xcex1(H) is not exclusive against some xcex1(G) in "THgr"(S+,Sxe2x88x92) then H=G.
MAX: Number of samples within the largest subclass occurring in xcexa9(S+,Sxe2x88x92) (eq. 2).                               Max          ⁡                      (            Ω            )                          =                                                            max                                                                                      G                  ∈                  Ω                                                              ⁢                                    "LeftBracketingBar"              G              "RightBracketingBar"                                      "LeftBracketingBar"                              S                +                            "RightBracketingBar"                                                          (        2        )            
xe2x80x83where: |G| is the number of elements in set G.
AVE: The average number of samples within the subclasses occurring within xcexa9(S+,Sxe2x88x92) (eq. 3).                               Ave          ⁡                      (            Ω            )                          =                                            1              M                        ⁢                                          ∑                                  G                  ∈                  Ω                                                  xe2x80x83                                            ⁢                                                                    "LeftBracketingBar"                    G                    "RightBracketingBar"                                                        "LeftBracketingBar"                                          S                      +                                        "RightBracketingBar"                                                  ⁢                                  xe2x80x83                                ⁢                where                ⁢                                  xe2x80x83                                ⁢                M                                              =                      "LeftBracketingBar"            Ω            "RightBracketingBar"                                              (        3        )            
An object of this invention is the provision of a computer automated method to select a subset of features in order to limit the number of features used by the character recognition engine, while optimizing its ability to discriminate among characters. That is, a method that narrows the number of features to those features that provide best within class consistency of recognition and the best cross-class discrimination.
Another object of the invention is the provision of a computer automated method of feature selection that is suitable for use with a large number of classes (i.e. 1000+characters) and a large number of features (i.e. 100+features).
A further object of this invention is the provision of a computer automated feature selection method in which the selection program is executed in a distributed operation on multiple processors.
Briefly, this invention contemplates the provision of a computer automated feature selection method based upon the evaluation of hyper-rectangles and the ability of these rectangles to discriminate between classes. The boundaries of the hyper-rectangles are established upon a binary feature space where each bit indicates the relationship of a real feature value to a boundary within the minimum and maximum values for the feature across all samples. Data reduction combines the binary vector spaces so that the number of samples within a single class is within a range which is computationally feasible. Identification of subclasses identifies maximal subsets of S+ which are exclusive against Sxe2x88x92. Feature evaluation determines within a single subclass the contribution of each feature towards the ability to discriminate the subclass from Sxe2x88x92. The base algorithm examines each feature, dropping any feature which does not contribute towards discrimination. A pair of statistics are generated for each remaining feature. The statistics represent a measure of how many samples from the class are within the subclass and a measure of how important each feature is to discriminating the subclass from Sxe2x88x92. The values for each subclass are then combined to generate a set of values for the class. These class feature metrics are further merged into metrics evaluating the features contribution across the entire set of classes. Feature reduction determines which features contribute the least across the entire set of classes. These features will be removed from consideration. The algorithm drops features if they fail to reach a predetermined significance level. If all features are found to be significant then the feature with the lowest contribution metric is discarded. Finally, peaking determination is used to determine if the process should be repeated against the reduced feature space. The peaking determination is done by examining the rate of change within the significance metrics.
The basic algorithm is set forth in two articles by M. Kudo and M. Shimbo: xe2x80x9cFeature Selection Based on the Structural Indices of Categories,xe2x80x9d Pattern Recognition 26 (1993) 891-901, and xe2x80x9cOptimal Subclasses with Dichotomous Variables for Feature Selection and Discrimination,xe2x80x9d IEEE Trans. Syst. Man Cybern. 19 (1989) 1194-1199, which are incorporated herein by reference.