The present invention relates to a system for analyzing data in order to classify content of unknown data or to recreate missing content of data. More particularly, it relates to analysis of data which can be represented as matrices of multiple factors.
Many learning problems require recognition, classification or synthesis of data generated through the interaction of multiple independent factors. For example, an optical character recognition system may be used to recognize characters in an unfamiliar font. Speech analysis requires recognition of words by different speakers having different tonal characteristics. An adaptive controller may need to produce state-space trajectories with new payload conditions. Each of these types of problems can be decomposed into two factors, each having multiple elements, which interact to create the actual data. For ease of discussion, the two factors will be called xe2x80x9ccontentxe2x80x9d and xe2x80x9cstylexe2x80x9d. For example, in typography used for optical character recognition, each character includes as content a letter (A, B, C, etc.) and as style a font (Times, Helvetica, Courier, etc.). In both printing and handwriting, people can generally recognize letters independent of the font or handwriting. However, optical character recognition systems generally are based upon template comparisons. Thus, they do not operate well with unknown fonts, and are extremely poor with the variations in handwriting. Thus, such systems can not classify the elements of one factor (letter) independent of the other factor (font or handwriting).
Similarly, in speech analysis, the sound of words (content) which are spoken are greatly effected by the speaker (style). Thus, systems which analyze the sounds to determine patterns have difficulty with new speakers. This is also true for people, particularly when the speaker has a strong accent. However, after exposure to someone with an strong accent for a period of time, a listener can much more easily determine the words being spoken. The listener has learned to distinguish the content from the style in the speech. On the other hand, speech recognition systems must have specific training for each speaker. They do not generally recognize new speakers or accents, and cannot learn these over time.
Therefore, a need exists for a system which easily separates the content and style of data in order to recognize the content with new styles. A need exists for a system which can also create new content in a known style.
Theoretical work has been performed by others on modeling of data which is formed from a mixture of factors through Cooperative Vector Quantization (CVQ). G. E. Hinton and R. Zemel disclose theories relating to factorial mixtures in xe2x80x9cAutoencoders, Minimum Description Length and Helmholtz Free Energy,xe2x80x9d NIPS 6, (1994). Z. Ghahramani discloses a system which applies mixture models to data analysis in xe2x80x9cFactorial Learning and the EM Algorithm,xe2x80x9d NIPS 7, 657-674 (1995). In CVQ, as used in these systems, each element of each factor is assigned a code vector. Each data point is modeled as a linear combination of one code vector from each factor. Thus, the factors interact only additively. The linear nature of the models suggested by these researchers severely limits the modeling capability of their theories. Often, factors do not interact only additively. For example, in typography, the letter and font are not additively combined to form each character. Instead, the font can significantly modify certain characteristics of each letter.
Therefore, a need exists for a system which models complex interactions between factors and yet which allows for simple processing of the model in analyzing data.
The deficiencies of existing systems and of theoretical approaches previously made on multiple factor problems are substantially overcome by the present invention which provides a computer-based system for analyzing multiple factor data.
According to one aspect of the invention, data is modeled as a product of two linear forms corresponding to parameters of each factor. The data may or may not result from physical processes having a bilinear form that is used to model the data. But, by increasing the dimensionality of the bilinear forms sufficiently, the model can represent known training data to an arbitrary accuracy.
According to another aspect of the invention, the system determines the model parameters from training data having multiple factor interaction. Typically, the training data is a complete matrix of observations as to each content and style type. However, for some analyses, the training data may be fully labeled as to content and style, unlabeled, or partially labeled. Furthermore, the system can reasonably determine parameters from training data having unknown observations within the matrix. To determine the parameters, the system creates a model based upon parameter vectors for each of the factors and a combination matrix for combining the parameter vectors for each factor. The values of the elements in the parameter vectors and the combination matrix are iteratively determined based upon the training data, using Estimation-Maximization (EM) techniques.
According to another aspect of the invention, once parameter vectors are obtained, the system can be used to analyze unknown data. The analysis can be used to categorize content of data in an unknown style. In this manner, the system can be used to recognize letters in new styles in connection with optical character recognition, or words with new speakers. The analysis can also be used to create new content in a known style. Thus, the system can complete missing data, such as generating missing characters for a given font.