An artificial neural network is defined as a mathematical model representing the biological behavior of neurons which classify patterns inputted into the artificial neural network. In order for an artificial neuron to correctly classify input patterns, adjustable weights and threshold must be appropriately set for each neuron or unit of the artificial neural network. The adjusting process is commonly referred to as training or learning, which reflects the traditional iterative nature of the biological learning processes. It follows the perception correction rule, as described, for example, in Roseblatt, Principles of Neurodynamics, New York: Spartan Books (1959), which states that a trained artificial neural network reflects a mathematical structure of an input-data set recursively selected in an on-line fashion. This view has motivated researchers to develop iterative training schemes for artificial neural networks which are time-consuming and computationally intensive. Moreover, such schemes often improperly position the classification boundaries of the artificial neural network which can result in prematurely-trained artificial neurons. These classification boundaries refer to the individual settings of the weights and threshold of each of the artificial neurons to properly distinguish between input patterns.
The following describes the mathematical formulations and existing approaches for building artificial neural networks. An artificial neural network is comprised of multiple processing units, each of which is a simple mathematical model of a neuron, as shown in FIG. 1, acting as a classifier, and is often referred to as a perception or Adaline. A vector x.sup.i of N entries constitutes the data set 10 of K input vectors x.sup.i (i=1, 2, . . . , K), each of which is represented by a point and labeled by a scaler valued class-indicator d.sup.i. The input vectors x.sup.i (i=1, 2, . . . , K) are fed through N input-nodes 20 to units 12. Each unit 12 consists of a hard limiter 13, a linear summer 14 and a linear combiner 15. The linear summer 14 performs weighted linear summation s with an N-dimensional weight vector w for the input vector x.sup.i. The hardlimiter 13, f(u), activates a high-or-low output (usually bipolar or binary), denoted by a scaler variable y, and connected to an output node 22, after a threshold w.sub.0 is added to the weighted sum s in the linear combiner 15, resulting in a scaler variable u.
To correctly classify patterns, the weights and the threshold form a boundary 16, and must be appropriately adjusted, so that it is properly positioned between clusters 17 of data points. A boundary margin 18, denoted by z, exists between the boundary and a data point in the data set, and is obtained for each point by z.sup.i =u.sup.i d.sup.i, i=1, 2 . . . , K. For correct classification, this boundary margin 18 should be positive. The boundary, whose directional vector is set by these weights, is properly positioned between pattern clusters in the data set by the specified threshold.
In FIG. 1, the data set 10 of K input vectors is expressed by a matrix X whose dimension is N.times.K, i.e., X=[x.sup.1, x.sup.2 . . . , x.sup.K ], and correspondingly a K-dimensional vector d is formed as the class indicator, i.e., d=[d.sup.1, d.sup.2, . . . , d.sup.K ].sup.T where the superscript T indicates the transpose. With the matrix X, outputs form the linear summer 14 for all the members in the data set are denoted by a K-dimensional vector s and expressed as EQU s=X.sup.T w. (1)
A bias term of w.sub.0 is added to the vector s, and the linear combiner' outputs for the data set, denoted by a K-dimensional vector u, can be represented by EQU u=s+w.sub.0 1 (2)
where 1 is the K-dimensional column vector of ones, i.e., 1=[1, 1, . . . , 1].sup.T. The unit's outputs can be expressed as y=f(u) where f() is the K-dimensional vector of the limiter function, f(u), i.e., f=[f(u.sup.1), f(u.sup.2), . . . , f(u.sup.K)].sup.T. When the class indicator d is bipolar, the boundary margins in the data set is denoted by a K-dimensional vector z, (z=[d.sup.1 u.sup.1, d.sup.2 u.sup.2, . . . , d.sup.K u.sup.K ].sup.T), and can be mathematically computed by an equation, z=Du, where D denotes the diagonally arranged matrix of d.
Classifier's errors are defined as discrepancy between unit's outputs and the corresponding class-indicators. For correct classification, an output-error to each element at the data set 10 is required to be zero, i.e., d.sup.i -y.sup.i =0, i=1, 2, . . . , K, when the boundary is placed between the clusters (referred to hereinafter as the zero-error requirement). As the dual representation of this zero-error requirement, the boundary-margin is to be positive at each member of the data set, i.e., z.sup.i =u.sup.i d.sup.i &gt;0, i=1, 2, . . . , K, which is regarded as the positive boundary-margin requirement.
The training performance for the zero-error output requirement is traditionally measured by a scaler-valued convex function which is the mean squares of the errors J.sub.P over the data set, and is given by EQU J.sub.P =[d-y].sup.T [d-y]. (3)
With conventional approaches for building artificial neural networks, realized neural networks have a feed-forward structure of three or more layers of the aforementioned artificial neurons. A feed-forward structure is defined as a neural network where the neurons are not connected within one layer, but are usually fully connected between layers where information flows in one way toward network's outputs without any feedback loops.
For correct classification in a feed-forward structured artificial neural network, each unit's weights and bias on the entire network must be appropriately determined through a training or learning process. The majority of the training schemes conventionally used are back-propagation methods, as described in Rumelhart, D. E., Hinton, G. E. & William, R. J. (1986), Learning internal representations by error propagation, in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1: Foundations, D. E. Rumelhart, and J. L. McClelland, Eds. Cambridge, Mass.:M.I.T. Press. In back-propagation, the output errors, Equation 3, is directly used as an objective function, and a numerical optimization method is applied to minimize it, leading to iterative adjustment of the network's weights and thresholds, as explained in Rumelhart et al. However, because of non-nonlinearity in the unit's limiter f(u) present implicitly in the vector y, the limiter is softened, as indicated by the smooth function profile. This compromise induces massive iterations to search for the solution, since the gradient information of the limiter is effectively available only around a narrow band of the softened limiter. Thus, back-propagation methods for training a feed forward artificial neural network is computationally intensive and thus time-consuming.
Moreover, the quadratic form of the output-errors, Equation 3, to be minimized as a single-valued object function, is not able to fully embody the classification requirement of zero-error, since it aggregates the errors over the entire data set. This scaler representation of the errors possesses multiple local-solutions for its minimum due to the limit's nonlinearity, and the majority of the solutions partially satisfies the classification requirement. Therefore, often the obtained minimum solution for the weights and thresholds partially fulfill the training requirements, which results in ill-positioning of the boundaries, leading to prematurely-trained artificial networks.
The fundamental drawback of the back-propagation methods stems from network's structural assumption concerning the number of neurons and their connection pattern especially for the network's middle layer(s). The necessary number of the neurons is assumed, and all of them are fully connected between the layers. In back-propagation, the role of the middle layers, i.e., artificial neurons connected between layers, are so unknown that they are often referred to as the hidden layers.
In back-propagation, the training problems is defined as nonlinear parameter optimization on the assumed network structure. Thus, even if the global minimum solution is reached on the assumed structure, the trained neural networks do not necessarily lead to correct classification.
Another approach to building an artificial neural network is the optimal associative mapping/linear least square method. In this method, the limiter's nonlinearity, f, is omitted from the neuron model to generate linear outputs, and the error criterion, Equation 3, is altered as EQU J.sub.c =[d-u].sup.T [d-u]. (3)
Differentiation of this error criterion J.sub.c with respect to w and w.sub.0 yields the necessary condition for the least square-errors between u and d, ##EQU1## where w* and w*.sub.0 denote the optimized weights and threshold, respectively.
By solving the above linear equation an analytical expression for the optimized weights and threshold can be obtained. This approach is fundamental to statistical analysis, and, for example, is described in Campbell, S. L. & Meyer, C. D. (1979) Generalized Inverses of Linear Transformations, London: Pitman Publishing, and is interpreted in terms of pattern recognition which is known as the optimal associative mappings, as described in Kohonen, T., (1988) Self-Organization and Associative Memory. 2nd Ed., New York: SpringerVerlag.
However, the omission of the nonlinearity in the optimal associative mapping/linear least square approach impairs performance of the artificial network as a pattern-classifier. Although this leads to a computationally advantageous non-recursive algorithm, the linearized limiter's bias term could shift the classification boundary into an unintended pattern cluster, resulting in incorrect pattern classification.
Still other approaches to building artificial neural networks have involved either a centered data matrix, a potential function, or a ridge estimator. To build an artificial neural network by creating a centered data matrix, linear equation 5 is divided into two parts, as shown in Campbell, S. L. & Meyer, C. D., (1979) Generalized Inverses of Liner Transformations. London: Pitman Publishing: (a) the bias optimized with the error criterion for Equation 4 is given by EQU w*.sub.0 =1.sup.T d/K-w*.sub.0
where w*.sub.0 =x.sup.T w* and x.sup.T =(1/K)1.sup.T X.sup.T, and (b) the optimal weight vector must satisfy the following equation. EQU CX.sup.T w*=Cd.
The K.times.K matrix C is known as the centering matrix, which is described, for example, in Wetherill, G. (1986), Regression Analysis with Applications, Chapman and Hall: New York and defined as EQU C=[I-(1/K)1(1.sup.T ]) (6)
where I denotes a K.times.K identity matrix. It shifts the coordinate origin of the data set to its data center where the unit's weights are to be optimized. When a K.times.N.XI. denotes a centered input matrix, .XI.=CX.sup.T, a relation between .XI. and X is determined by EQU .XI.=X.sup.T -1 x.sup.T. (7)
The above relation indicates that the input data matrix is needed to be centered, when the weight vector is optimized separately from the bias term.
Another approach to building an artificial neural network involves creating a potential function. Exclusion of a constant term of d.sup.T d in the factored error criterion J.sub.C, Equation 4, gives EQU J.sub.D =2d.sup.T u-u.sup.T u,
which is regarded as the correlation learning potential, one of the learning potential functions, for neurons in the neural network, as described, for example, in Amari, S. (1991), Mathematics in Neural Networks. Tokyo: San-Gyoh Tosho. This learning potential is used to represent the iterative nature of biological learning/training processes, in which the dynamics of the weights are formulated with a differential equation in a statistical sense. The weights are shown to statistically converge to the averaged equilibrium at which correlation between d and u is maximized and simultaneously the magnitude of u is minimized. The maximization of the correlation learning potential J.sub.C gives the identical weights optimized as the least square-errors between u and d.
A still further approach to building an artificial neural network uses a ridge estimator in combination with the above optimal associative mapping linear least square approach. The data matrix X often becomes singular or near singular as it dimensionally increases, which leads to computational difficulty. To accommodate it, a term of the weight's vector norm with a parameter k, that is--kw.sup.T w, is added as an auxiliary term to the output square errors, Equation 4. EQU J.sub.ridge =[d-u].sup.T [d-u]+kw.sup.T w.
The linear equation derived by differentiating the above J.sub.ridge is known as the ridge estimator, as described in Bibby, J. & Toutenburg, H., (1977), Prediction and Improved Estimation in Linear Models, New York: John Wiley & Sons. Although the parameter k distorts optimized solutions, it's presence gives computational stability. The larger the value of k becomes, the more the extent of skewed solution and numerical stability increases.