Pattern recognition or classification of data generally consists obtaining raw input data and classifying the data based on a set of predefined categories or data patterns. Data classification can be used in a wide variety of contexts including handwriting recognition, voice recognition as well as interpretation of scanned of documents. While humans are very skillful at automatically resolving variations in raw data to classify data in standardized categories, machine interpretation and pattern recognition is a complex task. For example, an individual reading a handwritten note can usually identify words and characters despite variations and idiosyncrasies in the author's handwriting, while a computing machine may be unable to correctly interpret the input data.
Correct classification of data can be particularly important for computers, processors and the like. Frequently, raw input data must be classified using predefined categories such as data types before additional processing can take place. For example, English alphabet characters can be encoded using American Standard Code for Information Exchange (ASCII) or the Unicode standard prior to processing.
Certain types of raw data vary based upon both the entity providing the data and the instances of data provided by an entity. For example, each instance of handwritten characters provided by the same user will have small variations. Handwritten characters provided by different users will have larger variations based upon the varying styles of the authors. The larger variations between samples provided by various entities can make pattern recognition difficult even for humans.
One approach to pattern recognition has been to use machine learning systems to recognize data patterns such as alphanumeric characters. Typically, machine learning systems are trained using a large number of samples. If the raw data to be classified varies based upon the entity providing the samples (e.g., handwriting), samples can be provided by multiple users. This training set of samples can be used to generalize the machine learning system, such that the system can recognize input from new users from which the system has received no training data.
However, if the variations caused by the entity providing the samples are too large, recognition can difficult for a machine learning system however large and varied the set of training data. For example, each person's handwriting is so distinct that it can be used for identification purposes. As a result, learning from training data obtained from one set of users, even a large set of users, does not necessarily produce models that generalize well to new handwriting styles. Machine learning system recognition using a generic, writer independent recognizer can perform especially poorly for users with rare writing styles. Similarly, large variations in samples provided by different entities can make data classification difficult for generically trained machine learning systems in other contexts. For example, variations in accent, dialect and enunciation of users complicate voice recognition.