Field
Implementations of the present invention relate to natural language processing. In particular, implementations of the present invention relate to classifying text documents written in one or many languages.
Related Art
Many natural language processing systems involve classifying texts into predefined categories. For example, in order to sort the huge amount of news available online into some meaningful categories, e.g., politics, cultural events, sport etc., a text classification method may be applied.
Nowadays, there is a great desire to be able to analyze multi-language data. However, existing text processing systems are usually language-dependent, i.e., they are able to analyze text written only in one particular language.
The very few existing cross-language systems are based on machine translation techniques, they choose a so called target language, translate all documents to that language with machine translation techniques, and then construct document representation and apply classification. The machine translation creates additional errors and, moreover, the analysis is usually based on low-level properties of documents, and the meanings of documents are not reflected in the utilized representation.
Thus, there is a need it is possible to create systems that can improve cross-language document classification, systems that would take into account not only the symbolic information but the semantics, i.e., meaning, of documents.