Single-hidden-layer neural networks (SHLNN) with least square error training are commonly used in data processing applications such as various types of pattern classification, including but not limited to image classification, text classification, handwriting recognition and speech recognition, due partly to their powerful modeling ability and partly due to the existence of efficient learning algorithms.
In a single hidden layer neural network, given the set of input vectors X=[x1, . . . , xi, . . . , xN], each vector is denoted by xi=[x1i, . . . , xji, . . . , xDi]T where D is the dimension of the input vector and N is the total number of training samples. Also, L is the number of hidden units and C is the dimension of the output vector. The output of the SHLNN is yi=UThi, where hi=σ(WTxi) is the hidden layer output, U is an L×C weight matrix at the upper layer, W is an D×L weight matrix at the lower layer, and σ(·) is the sigmoid function. Bias terms are implicitly represented in the above formulation if xi and hi are augmented with 1's.
Given the target vectors T=[t1, . . . , ti, . . . , tN], where each target ti=[t1i, . . . , tji, . . . , tCi]T, the parameters U and W are learned to minimize the square error E=∥Y−T∥2=Tr[(Y−T)(Y−T)T], where Y=[y1, . . . , yi, . . . , yN]. After the lower layer weights W are fixed, the hidden layer values H=[h1, . . . , hi, . . . , hN] are also determined uniquely. Subsequently, the upper layer weights U can be determined by setting the gradient
            ∂      E              ∂      U        =                    ∂                  Tr          ⁡                      [                                          (                                                                            U                      T                                        ⁢                    H                                    -                  T                                )                            ⁢                                                (                                                                                    U                        T                                            ⁢                      H                                        -                    T                                    )                                T                                      ]                                      ∂        U              =          2      ⁢                        H          ⁡                      (                                                            U                  T                                ⁢                H                            -              T                        )                          T            to zero, leading to the closed-form solution: U=(HHT)−1HTT, which defines an implicit constraint between the two sets of weights, U and W, via the hidden layer output H, in the SHLNN.
In some systems, achieving good classification accuracy involves utilizing a large number of hidden units, which increases the model size and the test time. Tradeoffs designed to reduce model size, often are not efficient in finding good model parameters.