1. Field
Implementations of the present invention relate to natural language processing. In particular, implementations relate to classifying, and also to clustering and filtering of text-centric documents written in one or more languages.
2. Description of the Related Art
The modern man has to deal every day with huge volume of new information. Also, corporations, agencies and libraries must receive and process a lot of text and text resources. Information is presented in many forms including texts, resources and references, print (e.g., newspapers and magazines), Internet sources (e.g., videos, audio programs), etc. Selection, cataloguing and filtering of information is an important task in managing information overload. Sometimes texts must be selected based on some feature or a plurality of a tightly defined set of features. Other times there is a need to find texts that are similar to a given text. Yet other times, there is a need to form groups or classes of texts according to a set of criteria. Text-based information which a person or organization must use may originate from many countries and may be written in different languages. Known mathematical methods of classifying and clustering objects that have been adopted for solving these tasks are insufficient to adequately cope with information overload.
Many natural language processing systems involve classifying texts into predefined categories. For example, in order to sort the huge amount of news available online into some meaningful categories, e.g., politics, cultural events, sporting events, etc., a text classification method may be applied. Other tasks related to text processing include clustering and filtering.
Nowadays, there is a great desire to be able to analyze multi-language data. However, existing text processing systems are usually language-dependent, i.e., they are able to analyze text written only in one particular language and cannot readily be ported to address another language.
The very few existing cross-language systems are based on machine translation techniques. These systems generally choose a so called target language, translate all documents to that language with machine translation techniques, and then construct document representations and apply classification. Such machine translation creates additional errors not found in the source material and, moreover, the analysis is usually based on low-level properties of documents, and the meanings of documents are not reflected in the utilized representation or translation.
Thus, it is possible to create systems that can improve cross-language document processing, including classification, clustering and filtering, systems that can take into account not only the symbolic information found in sources, but systems that address semantics, i.e., meaning, of documents.