The present invention is related to natural language processing. More particularly, the present invention is related to natural language systems and methods for processing words and associated word-variant forms, such as verb-clitic forms, in one of a range of languages, for example Spanish, when they are encountered in a textual input.
Numerous natural language processing applications rely upon a lexicon for operation. Such applications include word breaking (for search engines), grammar checking, spell checking, handwriting recognition and speech recognition, machine translation, text mining, etc. In some languages, the large number of possible word-variant forms makes it difficult for these natural language applications to accurately analyze the words in their various forms. Word-variant forms (or simply word forms) include compound words, clitics, inflections etc. The large number of word forms makes it difficult to include all of the word forms in the lexicon. A lexicon which contains all of these possible variations, inflections, compounds, etc. of a language is referred to as a full form lexicon.
One example of a large number of word-variant forms in a language is verb-clitic attachment in Spanish. In Spanish, clitics (unstressed forms of the personal pronouns) have different behavior depending on whether they appear before or after the verb. Clitics can precede many verbal forms. In this case, they are detached: e.g. lo vi ‘I saw it’; se la entregué ‘I gave it to him/her’. However, when one or more clitics appear after the verb, they are attached to it forming a single word (e.g. verlo ‘to see it’; entregándosela ‘giving it to him/her’). These strings are referred to herein as “verb-clitic” forms. Other languages use hyphens to attach clitics (French: donnez-moi ‘give me’; Portuguese: mandá-las ‘to send them’) or hyphens and apostrophes (Catalan: enviar-te'l ‘to send it to you’). In Spanish, clitics are attached directly to the verbal form.
Clitic attachment may or may not cause changes in the verbal form. Here are examples of the possible cases:                no change:                    comer -> comerlo ‘to eat -> to eat it’            decid -> decidme ‘tell -> tell me’                        addition of accent mark:                    toma -> tómalo ‘take -> take it’            pon ->póntela ‘put ->put it on’                        deletion of final consonant:                    mirad -> miraos ‘look -> look at yourselves’            freid ->freíos ‘fry ->fry yourselves’                        addition of accent mark and deletion of final consonant:                    presentad -> presentáosles ‘introduce -> introduce yourselves to them’            digamos -> digámoselo ‘let's say -> let's say it to him/her’            peinemos -> peinémonos ‘let's comb -> let's comb ourselves’                        deletion of accent mark:                    propón -> proponlo ‘propose -> propose it’                        
As used herein, the original verbal form to which clitics attach is referred to as “clitic host” and the resulting form with the orthographic/phonological transformations as “clitic host variant.”
For simple word breaking, if the desired output is only to break the different segments, dealing with clitics would appear to be quite easy. Just looking at the last letters of the words would be sufficient to determine whether a string is a verb-clitic form. However, this approach would mark as verb-clitic forms words that look like verb-clitic forms but are not. For example, the word ángeles could very well be taken for the Person 2 Singular Imperative Polite form of an inexistent verb *anger (with the needed accent added in the right position), but it is actually the plural of the noun ángel ‘angel’. To solve this issue, a list of exceptions with all the known cases of false verb-clitic forms could be used. However, this solution would not prevent the treatment as verb-clitic forms of other words with similar characteristics that are unknown to the system.
In a word breaker scenario where the clitic host form has to be emitted, the computation needs to be more refined. When considering the spell checking scenario, it is obvious that low level solutions are not sufficient for the verb-clitic problem. Verbs with clitics are very common in the Spanish language and users expect spell checkers to be able to recognize them and offer appropriate suggestions when they contain an error.
Some Spellers are based on full form lexica: they list all the possible inflected forms for a given language. This way, if a string isn't found in the list, it is considered an error, and the entries from the lexicon with the smallest edit distance are offered as suggestions. Even though this is a good solution for a language with limited inflectional morphology, such as English, it poses significant problems for a highly inflectional language, such as Spanish.
As an example, one particular Spanish lexicon was found to contain around 13,400 verbs in their Infinitive base form, and there were found to be about 55 inflected forms for each verb. Therefore, without counting verb-clitic forms, a full form lexicon already needs to include around 737,000 inflected verbal forms. To recognize 240 verb-clitic forms per verb (adding over 3.2 million entries), the number of verbal forms alone would be too high (almost 4 million) to be effectively handled for these applications.
A variation of this methodology is to harvest the verb-clitic forms, i.e. to extract all the occurrences of verb-clitic forms from corpora and add only those to the list (instead of productively generating all the possible forms). This option has the shortcoming of providing little coverage for spell checking. Clitics are very frequent in Spanish; the user won't understand the inconsistency in their treatment.
The opposite solution to the use of full form lexica is to have a lexicon with only the lemmas. The lemmas can then be annotated with the morphological information needed to recognize all their inflected forms (including all the irregularities and orthographical changes for verbs, nouns, and adjectives). Note that clitics don't only attach to the Infinitive or base forms of the verbs, but also to inflected forms that can present lots of irregularities. For instance, while the 1st Person Imperative Plural form of subir ‘to raise, to get on’ is subamos, producing the verb-clitic form subámonos; the corresponding form of decir ‘to say’ is digamos, producing the verb-clitic form digámonos. This approach implies a big hit for runtime performance, especially for spell checking, where the goal is not only to recognize correct verb-clitic forms, but also to offer good suggestions for misspelled ones.