The present invention relates generally to modular networks and processes, and more particularly to a modular process in which each module receives data and outputs data that is structured as graphs.
Systems designers build complex systems for performing tasks, such as document processing, by first partitioning the task into manageable subtasks. In the case of document processing systems, some of these subtasks include field detection, word segmentation, and character recognition. After the task is partitioned, separate modules are built for each subtask. To build the system, one needs to specify the inputs and outputs of each module. Typically, each module is then trained, or manually optimized, outside of its context, i.e., without being connected to the remaining modules in the system. This often requires manually creating the input and output "data" for each module, in particular the intermediate modules, so that the module can be trained appropriately. After the complete system is assembled, a subset of the parameters of each of the modules is manually adjusted to maximize the overall performance. Determining the appropriate subset can be difficult, and often this final step can be extremely tedious, and time-consuming. Consequently, this design process often leads to suboptimal results. In fact, large trainable networks are often not practical for many real world applications due to the severe complexity in training them.
An example of the relationship between an intermediate module and the rest of the system can be understood with reference to a character recognition system. For example, a character recognition module can be trained to recognize well-formed individual characters. However, the role of the recognizer in the context of the entire system is usually quite different than simply recognizing the characters. Very often, character recognizers are expected to also reject badly segmented characters and other non-characters. They are also expected to produce reliable scores that can be used by a post-processor, such as a statistical language model. For example, if the task is to read the amount on a bank check, the overall or global objective function that must be minimized is the percentage of checks that must be rejected in order to attain a given error rate on the accepted checks. Merely training the character recognizer module to minimize its classification error on individual characters will not minimize the global objective function. Ideally it is desirable to find a good minimum of the global objective function with respect to all of the parameters in the system.
Furthermore, creating the intermediate data on which the module is to learn can be quite a task in and of itself Often these modules perform relatively simple functions but do so on a large amount of data, which requires a large amount of data on which to learn. In some cases, the intermediate data is not easily determined. Consequently, in most practical cases, this optimization problem appears intractable, which to date has limited the types of problems to which large trainable networks have been applied.
Another problem facing network designers consists of the ability of modules to appropriately structure data and represent the state of the module in a way that best represents the problem being solved. Traditional multi-layer neural networks can be viewed as cascades of trainable modules (e.g., the layers) that communicate their states and gradients in the form of fixed-size vectors. The limited flexibility of fixed-size vectors for data representation is a serious deficiency for many applications, notably for tasks that deal with variable length inputs (e.g. speech and handwriting recognition), or for tasks that require encoding relationships between objects or features whose number and nature can vary (such as invariant perception, scene analysis, reasoning, etc.).
Convolutional network architectures, such as Time Delay Neural Networks (TDNN) and Space Displacement Neural Networks (SDNN) as well as recurrent networks, have been proposed to deal with variable-length sequences of vectors. They have been applied with great success to optical character recognition, on-line handwriting recognition, spoken word recognition, and time-series prediction. However, these architectures still lack flexibility for tasks in which the state must encode more than simple vector sequences. A notable example is when the state is used to encode probability distributions over sequences of vectors (e.g., stochastic grammars).
In such cases, the data are best represented using a graph in which each arc contains a vector. Each path in the graph represents a different sequence of vectors. Distribution over sequences can be represented by interpreting parts of the data associated with each arc as a score or likelihood. Distributions over sequences are particularly handy for modeling linguistic knowledge in speech or handwriting recognition systems: each sequence, i.e., each path in the graph, represents an alternative interpretation of the input. Successive processing modules progressively refine the interpretation. For example, a speech recognition system starts with a single sequence of acoustic vectors, transforms it into a lattice of phonemes (i.e., a distribution over phoneme sequences), then into a lattice of words (i.e., a distribution over word sequences), then into a single sequence of words representing the best interpretation. While graphs are useful, systems designers have not been able to employ them in modular format to solve complex problems due to the difficulty in training these types of systems. It is particularly difficult to create the intermediate forms of data for the intermediate modules when using data structured as graphs.
The present invention is therefore directed to the problem of developing a modular building block for complex processes that can input and output data in a wide variety of forms, but when interconnected with other similar modular building blocks can be easily trained.