Training and classification of data is a fundamental problem shared by many application domains such as natural language, images, audio or genome data.
Traditional training and classification methods represent data as a set of features. For instance, to classify a document one might use word frequencies as features. A common method to incorporate some sequential information is the use of word grams. For example, instead of using single words as features, one may additionally use pairs or triples of adjacent words. While this works well when a lot of data is available (e.g. when classifying a document), it is not effective on shorter pieces of data such as sentences or paragraphs. Further, the majority of sequential information and large-scale structure is discarded when one represents data as a set of n-gram features.
A recent example of this technique represents rules as a set of words. A rule is said to pass and imply a class if all words are present in a data instance. Rules are learned by using one of the many algorithms developed to solve a mathematical problem called the frequent itemset problem. Each learned frequent item set is a rule. Rule based systems are often combined with other types of classifiers.
The training and classification methods described above do not model the large scale structure of the sequence, and this can lead to ambiguities and mis-classification. One such ambiguity may occur when text of a data is conceptually similar but not phrased the exact same way. Another ambiguity may occur when only specific words in a text are identified without considering the transition words between the identified words. Thus, there is a need in the art to provide an improved system and method for training and classifying data.
More recently developed learning methods treat data as sequences. These include hidden Markov models (HMMs) and conditional random fields (CRFs). Such methods typically focus on extracting information out of a sequence rather than classifying the sequence as a whole. Variations on these techniques (e.g. hidden CRFs) exist for classifying sequences. Some common non-sequential classification methods are equivalent to extremely simple HMM or CRF models. Typical application of these methods as a classifier is in image classification, although it can also be used to classify other types of data.
Currently existing methods that are designed for sequential data require a rough model of sequence structure to be supplied or learned. For text data, it is not obvious how one should build models for the purpose of classification and it is difficult to learn good models automatically.
It is to be understood that the attached drawings are for purposes of illustrating the concepts of the invention and may not be to scale.