The need for pattern classification by machine or device exists in a number of technological areas such as image classification, target recognition and tracking, speech and character recognition, signal processing including sonar classifier and process monitor, robotics, electronic and radar surveillance, medical, scientific, and engineering diagnosis and like applications.
A number of prior art approaches to solving pattern classification problems have been described. One such approach, known as "neural networks", divides up the computational work to be done among a set of parallel computational elements in order to speed up the work. Since pattern classification is computation-intensive, dividing up the work (parallelizing) among a number of computational devices makes sense if one is to expect real time responses from such devices. Pattern classification, however, involves two phases. One is a learning or training phase (from a set of training examples) in which the system learns or adjusts the parameters of the given mathematical formulae it will use for event identification. The second is a classification phase in which it actually uses the learned formulae to classify speech (words), objects and the like based on the given inputs. An algorithm (method) is usually used for the training of a neural network device. The basic weakness of neural network devices is that they take a long time to train and usually involve an extensive trial and error procedure.
The input to a pattern classification machine is a set of N measurements, and the output is the classification. The input is represented by the N-dimensional vector x, x=(X.sub.1, X.sub.2 . . . ,X.sub.N), called the pattern vector, with its components being the N measurements. Let .OMEGA..sub.x be the pattern space which is the set of all possible values that x may assume. Suppose there are K classes. A pattern classification machine will divide .OMEGA..sub.x into K disjoint decision regions .OMEGA..sub.1, .OMEGA..sub.2, . . . , .OMEGA..sub.K, each corresponding to one class. Sometimes, as will be hereinafter discussed, a class may consist of more than one disjoint region. Thus, the design of a pattern classification machine may be described as finding a rule that divides the pattern space .OMEGA..sub.x into K decision regions. The parameters of a pattern classification machine are estimated from a set of pattern samples with known classifications, x.sub.1, x.sub.2, . . . , x.sub.n, which is called the training set. As n becomes large, it leads to a machine of near optimum performance. This is machine learning. There are two types of machine learning, supervised and unsupervised. In supervised learning, the classification of each sample of the training set, x.sub.1, x.sub.2, . . . , x.sub.n is known. This is also called learning with a teacher. In unsupervised learning, or learning without a teacher, the classifications of the training samples are unknown. The present invention deals with supervised learning only.
The fundamental idea in pattern classification is to draw "proper" boundaries for each class region based on the given information (i.e., the training set). Classical statistical techniques (e.g., Gaussian classifier) use certain standard mathematical functions (i.e., specific probability distribution functions) to draw these boundaries. The limitation of these statistical techniques is that they are not flexible enough to form complex decision boundaries and thus can lead to errors. Neural networks, are capable of forming complex decision boundaries.
The present disclosure involves "masking" or "covering" a class region. Any complex nonconvex region can be covered by a set of elementary convex forms of varying size, such as circles and ellipses in the 2-dimensional case and spheroids and ellipsoids in the N-dimensional case. As is well known to practitioners of the art, the terms "convex form", "convex cover", or "convex mask" refer to a set of points congruent with a geometric shape wherein each set of points is a "convex set." Therefore, by definition, geometric shapes such as circles, ellipses, spheres, cubes and the like are convex "forms", "covers", and/or "masks". Thus a complex class region can be approximately covered by elementary convex forms, although overlap of these elementary convex "covers" may occur in order to provide complete and adequate coverage of the region. Sometimes, a cover may extend beyond the actual class region, if there is no conflict in doing so. Nonconvex covers may be used when there is no conflict in doing so.
The idea of "elementary convex covers" is not new in pattern classification. The neural network concept is the same and the proof (by construction) of its ability to handle arbitrarily complex regions is based on partitioning the desired decision region into small hypercubes or some other arbitrarily shaped convex regions. However, as will appear, this basic idea is used quite differently in the present invention.
The back propagation algorithm for training of multi-layer perceptrons (see: Rumelhart, D. E., G. E. Hinton, and R. J. Williams, "Learning Internal Representations by Error Propagation," in Parallel Distributed Processing: Explorations in Microstructure of Cognition, Vol. 1: Foundations. MIT Press [1986]) is the most widely used algorithm for supervised learning in neural networks. Its objective is to set up class boundaries such that the mean square error is minimized. The steepest descent direction is used for the minimization, but no line search is done in the descent direction. It generally uses the nonlinear logistic (sigmoid) activation function at the nodes of the network. In essence, the back propagation algorithm formulates the pattern classification problem as a unconstrained nonlinear optimization problem and then uses a poor technique for minimization. Just a few of its problems include:(a) slow convergence; (b) getting stuck at a local minimum point; (c) oscillations associated with bigger step sizes (learning rates); and (d) many trials required with different starting points (connection weights), fixed step sizes in the descent direction (learning rates, momentums), and number of hidden layers and nodes to be put in the network structure (i.e., experimenting with the nature of nonlinear function required to draw the boundaries).
The present invention is predicated upon a solution of these prior art problems through the use of linear masking functions and linear programming ("LP") formulations and provides a very fast and reliable solution technique while avoiding the difficulties incurred by the nonlinear optimization of functions.