Due to an increased availability of documents in digital form, text classification is now being applied in a variety of contexts, ranging from document indexing based on a controlled vocabulary, to document filtering, automated metadata generation, word sense disambiguation, and population of hierarchical catalogs of web resources. Text classification has also been widely used in most modern information retrieval systems, performing tasks like spam detection, adult content filtering, etc. Text classification is also often an important feature in a search engine's relevance ranking operations.
One key difficulty with previous text classification algorithms lies in the fact that they require a large, often prohibitive, number of labeled training examples in order to learn to classify text accurately. Labeling must often be done by a human being; this is one of the most costly and time-consuming processes involved in the development of text classifiers.
Many information systems involve a documents corpus that contains documents written in a variety of different languages. Some documents may be written in one language, while other documents may be written in other languages. Internet search engines, which discover, index, and retrieve web pages from all over the Earth, are likely to access a document corpus that contains documents written in a variety of different languages. Text classification becomes even more daunting when used in such multi-lingual environments. In such environments, when “bag of words” classification models are used, separate classifiers are typically used for each language. To develop separate classifiers for each language, separate sets of training data (one set in each language) are prepared to train separate specific classification models (one for each language) for each classifier. When a separate classification model needs to be trained for each different language, the cost and time involved in labeling each set of training data becomes many times greater than the cost and time involved in labeling a single set of training data in just one language. For many languages, training data in that language is very hard to acquire, or entirely unavailable.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.