Origin of the Invention
The invention described herein was made in the performance of work under a NASA contract, and is subject to the provisions of Public Law 96517 (35 USC 202) in which the contractor has elected not to retain title.
Microfiche Appendix
A computer program (microfiche, 26 pages) embodying the invention is listed in the microfiche appendix filed with this specification. The microfiche appendix contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyrights whatsoever.
Technical Field
The invention relates to methods for training neural networks and in particular to neural network training methods using adjoint systems of equations corresponding to the forward sensitivity equations of the neural network.
Background Art
The following publications represent the state of the art in neural network training techniques, and are referred to in the specification below by author name and year:
Barhen, J., Toomarian, N. and Gulati, S. (1900a) "Adjoint operator algorithms for faster learning and dynamical neural networks". In David S. Touretzky (Ed.), Advances in Neural Information Processing Systems. Vol. 2, 498-508, San Mateo, Calif.: Morgan Kaufmann.
Barhen, J., Toomarian, N. and Gulati, S. (1990b). "Application of adjoint operators to neural learning". Applied Mathematical Letters, 3 (3), 13-18.
Cacuci, D. G. (1981). "Sensitivity theory for nonlinear systems". Journal Math. Phys., 22 (12), 2794-2802.
Grossberg, S. (1987). The Adaptive brain. Vol. 2, North-Holland.
Hirsch, M. W. (1989) "Convergent activation dynamics in continuous time networks". Neural Networks, 2 (5), 331-349.
Maudlin, P. J., Parks, C. V. and Weber C. F. (1980). "Thermal-hydraulic differential sensitivity theory". American Society of Mechanical Engineering paper WA/HT-56.
Narendra, K. S. and Parthasarathy, K. (1990). "Identification and control of dynamical systems using neural networks". IEEE transaction on Neural Networks, 1 (1), 4-27.
Oblow, E. M. (1978). "Sensitivity theory for reactor thermal-hydraulic problems". Nuclear Science and Engineering, 68, 322-357.
Parlos, A. G., et. al. (1991). "Dynamic learning in recurrent neural networks for nonlinear system identification", preprint
Pearlmutter, B. A. (1989). "Learning state space trajectories in recurrent neural networks". Neural Computation, 1 (2), 263-269.
Pearlmutter, B. A. (1990). "Dynamic recurrent neural networks". Technical Report CMU-CS-90-196, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pa. Pineada, F. (1990). "Time dependent adaptive neural networks".In David S. Touretzky (Ed.), Advances in Neural Information Processing Systems. Vol. 2, 710-718, San Mateo, Calif.: Morgan Kaufmann.
Rumelhart, D. E., Hinton, G. E. and Williams, R. J. (1986). "Learning internal representations by error propagation". In D. E. Rumelhart, J. L. McCleland and the PDP Research Group, Parallel Distributed Processing: Exploration in the Microstructure of Cognition, Vol. 1, Foundations, Cambridge: MIT Press/Bradford Books.
Sato, M. (1990). "A real time learning algorithm for recurrent analog neural networks". Biological Cybernetics, 62 (3), 237-242.
Toomarian, N., Wacholder, E., and Kaizerman, S. (1987). "Sensitivity analysis of two-phase flow problems". Nuclear Science and Engineering, 99 (1), 53-81. Toomarian, N. and Barhen, J. (1991). "Adjoint operators and non-adiabatic algorithms in neural networks". Applied Mathematical Letters, 4 (2), 69-73.
Werbos, P. J. (1990). "Backpropagation through time: what it does and how to do it", Proceeding of the IEEE, 87 (10).
Williams, R. J., and Zipser, D. (1988). "A learning algorithm for continually running fully recurrent neural networks". Technical Report ICS Report 8805, UCSD, La Jolla, Calif. 92093.
Williams, R. J., and Zipser, D. (1989). "A learning algorithm for continually running fully recurrent neural networks", Neural Computation, 1 (2), 270-280.
Zak, M. (1989). "Terminal attractors in neural networks". Neural Networks, 2 (4), 259-274.
1. INTRODUCTION
Recently, there has been a tremendous interest in developing learning algorithms capable of modeling time-dependent phenomena (Grossberg, 1987; Hirsh, 1989). In particular, considerable attention has been devoted to capturing the dynamics embedded in observed temporal sequences (e.g., Narendra, 1990; Parlos et al., 1991).
In general, the neural architectures under consideration may be classified into two categories:
Feedforward networks, in which back propagation through time (Werbos, 1990) can be implemented. This architecture has been extensively analyzed, and is widely used in simple applications due, in particular, to the straightforward nature of its formalism. PA1 Recurrent networks, also referred to as feedback or fully connected networks, which are currently receiving increased attention. A key advantage of recurrent networks lies in their ability to use information about past events for current computations. Thus, they can provide time-dependent outputs for both time-dependent as well as time-independent inputs.
One may argue that, for many real world applications, the feedforward networks suffice. Furthermore, recurrent network can, in principle, be unfolded into a multilayer feedforward network (Rumelhart et al. 1986). A detailed analysis of the merits and demerits of these two architectures is beyond the scope of this paper. Here, we will focus only on recurrent networks.
The problem of temporal learning can typically be formulated as a minimization, over an arbitrary but finite time interval, of an appropriate error functional. The gradients of the functional with respect to the various parameters of the neural architecture, e.g., synaptic weights, neural gains, etc. are essential elements of the minimization process and, in the past, major efforts have been devoted to the efficacy of their computation. Calculating the gradients of a system's output with respect to different parameters of the system is, in general, of relevance to several disciplines. Hence, a variety of methods have been proposed in the literature for computing such gradients. A recent survey of techniques which have been considered specifically for temporal learning can be found in Pearlmutter (1990). We will briefly mention only those which are relevant to our work.
Sato (1990) proposed, at the conceptual level, an algorithm based upon Lagrange multipliers. However, his algorithm has not yet been validated by numerical simulations, nor has its computational complexity been analyzed. Williams and Zipser (1989) presented a scheme in which the gradients of an error functional with respect to network parameters are calculated by direct differentiation of the neural activation dynamics. This approach is computationally very expensive and scales poorly to large systems. The inherent advantage of the scheme is the small storage capacity required, which scales as O(N.sup.3), where N denotes the size of the network.
Pearlmutter (1989), on the other hand, described a variational method which yields a set of linear ordinary differential equations for backpropagating the error through the system. These equations, however, need to be solved backwards in time, and require temporal storage of variables from the network activation dynamics, thereby reducing the attractiveness of the algorithm. Recently, Toomarian and Barhen (1991) suggested a framework which, in contradistinction to Pearlmutter's formalism, enables the error propagation system of equations to be solved forward in time, concomitantly with the neural activation dynamics. A drawback of this novel approach came from the fact that their equations had to be analyzed in terms of distributions, which precluded straightforward numerical implementation. Finally, Pineda (1990) proposed combining the existence of disparate time scales with a heuristic gradient computation. The underlying adiabatic assumptions and highly approximate gradient evaluation technique, however, placed severe limits on the applicability of his method.