The exemplary embodiment relates to natural language processing of text. It finds particular application in connection with processing of mixed language text and will be described with particular reference thereto.
It is quite common for a text document, written in a given language, to include some phrases, sentences, or paragraphs which are written in another language. This is particularly the case in informal communication media, such as blogs, social networks and the like, but can occur in a wide range of document types. Mixed language text, as used herein, is text which follows the syntax and grammar of a first (main) language but includes, within it, one or more sequences of words in one or more secondary languages. As examples of mixed language text consider the following, where the secondary language text is shown in bold for ease of illustration:
1. A blog comment mixing French and some English, extracted from “Overblog”, a French site dedicated to blogs and discussion forums:                Bienvenue à tous dans les Charts du Vendredi, avec le classement made in Japan des meilleures ventes de jeux et de consoles sur le sol nippon pour la période du 15 au 21 février derniers . . . [ ] La PSP n'est qu'un brin au dessus de sa grande sœur aussi, tandis que la DS tient toujours tout le monde a bonne distance, of course . . . .        
2. In a scientific article mixing Spanish, English and Quechua: Maldesarrollo: entre el “American way of life” y el “sumak kawsay”.
3. In the reference section of an English scientific article, a French reference:    [1] K. R. Beesley and L. Karttunen. Finite State Morphology. CSLI Studies in Computational Linguistics, 2003.    [2] G. G. Bes. La phrase verbale noyau en français. Recherches sur le français parlé, 15:273-358, 1999.
As can be seen from these examples, in some cases, the secondary language sequences are delimited, e.g., by structural delimiters, such as quotes as in Example 2, whereas in other cases, such as Examples 1 and 3, there is no indication that these are not ordinary main language words. A reader fluent in the main language is usually capable of recognizing that these are probably words of a different language, and understand their use in the sentence, even if he is unable to translate them exactly. However, computer-implemented systems for processing text, e.g., for opinion mining, machine translation, information extraction, grammar and spelling checkers, and the like, are unable to process them effectively, for example, to assign parts of speech or perform syntactic analysis of the sentence.