Text categorization addresses the general purpose of organizing electronic information by filing documents according to an existing structure (taxonomy) and filter relevant/irrelevant information, whereby simply browsing of categories can be performed to search for documents.
In many contexts, people are confronted with documents available in more than one language. This is a typical situation in many multilingual regions of the world, including many regions of Europe and, for example, most legal and regulatory documents in Canada. However, document categorization models are mostly developed in a monolingual context, typically from a resource-rich language such as English.
Currently, when a data categorization is needed for documents which are available in two (or more) languages and share the same set of categories, the available techniques train monolingual categorizers on each part of the corpus independently. This approach ignores the potentially richer information available from the other language, and produces widely different results on the different parts of the corpus. Furthermore, this approach is impractical when the number of available documents in the different languages is uneven.
In multiview learning for text categorization, there are two important classes of known techniques: the multiple kernel learning approach, and techniques relying on (kernel) Canonical Correlation Analysis (CCA). Multiple kernel learning typically assumes that all views of an example are available during training and testing, for example the same object viewed from different angles in object recognition. CCA identifies matching maximally-correlated subspaces in the two views that may be used to project data before learning is performed, or integrated with learning. Other multiview learning techniques also exist, but concerns arise due to computational complexity and scalability to large document collections.
None of the prior art techniques makes use of the classification information available from one language to improve the classification of another language.
There is thus a need for a system and method which is able to leverage the multilingual data provided in the different languages of the corpus in order to produce a text classification with an accuracy that is higher than what one may obtain from an independent monolingual categorizer typical of prior art methods where different language versions of the same document are categorized separately in an independent manner.