Pattern recognition systems typically receive digital inputs containing patterns (e.g., digital images containing real-world objects), extract “features” from the digital inputs (i.e., numeric or symbolic information from the digital input) using a “feature extraction” component, and ultimately classify or identify various patterns found in the digital input based upon the extracted features using a “classifier” component. Frequently the features are hand-crafted or pre-specified, but this limits the richness of features and the portability of the features from one domain to another (e.g., features crafted for handwritten digits are unlikely to be useful for 3-d object recognition, let alone for speech recognition). Much more sophisticated and flexible features can be learned from observations of digital input examples (a collection of such digital input examples sometimes referred to as a “training set”); in this context, the term “feature” can be arbitrarily high-level, and may even map directly to the class of the object (e.g., in handwritten digit recognition, each feature value may correspond to one of the digits).
Effective pattern recognition systems need to be able to identify a particular pattern despite the various transformations that the pattern may potentially undergo. For example, in the context of computer vision, the same real-world object can be depicted quite differently across a number of digital images. The same real-world object could be scaled, rotated in-plane, rotated out-of-plane and illuminated from different angles with different lights across various digital images. Non-rigid real-world objects could be further stretched, bent, or skewed. Each of these transformations could potentially lead to a change of pixel values in a digital image that is greater than if the object itself were replaced by an entirely different object. An effective real-world pattern recognition computer system needs to be “invariant” to (i.e., able to recognize the object despite) the vast number of transformations an object can undergo.
Many pattern recognition systems focus on one type of transformation: translation (where the “translation” of an image feature or an object in a digital image refers to the shifting of that image feature or object from one location in the digital image to a different location). Typically, the mechanism for translation invariance is embedded into the architecture itself of the pattern recognition system, and such architecture is specific to translations: e.g. by pooling responses of identical spatial feature detectors across different windows (where “pooling” is a disjunctive function of inputs or an approximation thereof). When compared to translations, other types of transformations can have far more complex influences on the digital inputs and therefore it is impracticable to encode all relevant such transformations into the architecture. It is thus preferable for invariant pattern recognition to utilize an architecture and learning objectives that are flexible enough to be invariant to a wide range of transformations.
While features learned by a pattern recognition system should be invariant to transformations of an object, they should also be simultaneously specific and distinct enough to distinguish between patterns corresponding to different objects, while broad enough to represent the full richness of digital input possibilities. While various rules and heuristics may encourage such properties in features, it remains challenging to consistently learn a pattern recognition system that achieves such properties and to make proper inferences on the system without a unifying set of mathematical objectives. For example, the pattern recognition system may assign too many features to an area of the input space, leaving much of the rest of the space under-represented. Alternatively, it may improperly infer that competing features for describing an input characteristic are both true with much higher probability than the input statistics would imply, because the inference rules inherently assume that the features are conditionally independent, while the learning rules makes intrinsically causal assumptions. With regard to mathematical objectives that alleviate such concerns, one or multiple “input fidelity” objectives may be defined, wherein “input fidelity” refers to the preservation of relevant information about the digital input. For example, the pattern recognition system may contain a component for stochastically producing digital inputs, and optimize the probability of such component creating the digital inputs in the training set (such an objective referred to as a “likelihood objective” or simply “maximum likelihood”). Alternatively, the pattern recognition system may contain a component for reconstructing a digital input from the features, and seek reconstructions that are as similar as possible to the original digital input (such an objective referred to as a “reconstruction error objective”).
One challenge with combining invariance objectives with input fidelity objectives in existing systems is that the two objectives may have conflicting effects. Given a typical mechanism for generating or reconstructing digital inputs, a feature output with a particular state or value may only generate or reconstruct a limited set of digital inputs. If, for a transformation of interest, two different transformed states of a digital input differ by a range outside the limited generative/reconstructive range of a feature output, then such feature output may not be able to accurately generate/reconstruct the transformed states of a digital input. Given this, the input fidelity objective may work to prevent such feature from responding similarly to different transformed states of a digital input, in direct opposition to the invariance objective.