1.1. Field of the Invention
The present invention relates to the field of computer-aided text and speech processing, and in particular to a method and respective system for converting an input text given in an incomplete language, into speech, wherein a computer-aided grapheme-phoneme conversion is used.
1.2. Description and Disadvantages of Prior Art
The term “incomplete” language used herein shall be understood to be a language, which does not necessarily contain a complete syntactic description of phrases in its textual representation. Good examples are the “natural” semitic languages (such as Arabic and Hebrew), in which written text often lack vowels. Other examples are “artificial” languages, which may be used to abbreviate complete text.
The present invention will thus be defined from prior art by aid of the Arabic language, as it can be very advantageously applied to the processing of Arabic, a member of the family of Semitic languages, that does only occasionally make use of vowels when written.
1.2.1 Introduction to Arabic Language
Arabic is one of the Semitic languages and is an important language to religion and literature. Arabic is spoken by almost 200 million people in more than twenty two countries.
Arabic Text
The most striking difference between Arabic and other languages is that Arabic text is usually presented without vowels and other diacritical marks. Vowels, when used, are presented by diacritical marks, placed above or below the character. The process of adding vowels and other diacritical marks to Arabic text can be called Diacritization or, for simplicity, Vowelization. Vowelization defines the sense of each word, and how it will be pronounced. However, the use of vowels and other diacritics has lapsed in modern Arabic writing. It is, therefore, the norm for an entire formal Arabic newspaper to have only a dozen or so thoughtfully-used vowels and diacritics placed only where there is not enough context to know what is intended.
These zero-width optional elements are used occasionally to disambiguate homographs when there is insufficient context for the reader to do so. A good reader anticipates these potential ambiguities and inserts shorts vowels and diacritics as needed, such as to disambiguate the Arabic for “Amman” and “Oman”, or to indicate the passive voice. Occasionally one hears professional news announcers pause and backtrack to re-read a passage with a different “vocalization” of a word.
Vocalization
In any vocalized language vowels play an important role since they are-the most prominent and central sound of a syllable. The vowels help us to join consonants to achieve a full sound. In English a, e, i, o and u (also y) are the vowels which are clearly spelled out in a text, whereas in Arabic they are not.
Arabic Vowels and other Diacritics
Arabic has three short vowels:                1. The Fathah sign () represents the “a” sound and is an oblique dash over a consonant like         2. The Kasra sign () represents the “i” sound and is an oblique dash under a consonant like         3. The Damma sign () represents the “u” sound and is a loop over a consonant that resembles the shape of a comma, like         
In addition there are three kinds of diacritics:                1. “Sukun”, written as a small circle above the Arabic consonant, is used to indicate that the letter is not vowelized. In this patent application we will refer to sukun as “0”.        2. “Shadda” is a gemination mark that is placed above the Arabic letters and results in a repetition of the letter at the phonemic level. We will be referring to it here as “˜”.        3. “Nunation” is expressed by one of three different diacritics (Fathatan, Dammatan, Kasratan). These are placed above the last letter of the word and have the phonetic effect of placing an “N” at the end of the word.        
In the remainder of this patent application we will distinguish between vowel signs and other diacritical marks only if it is required for purposes of illustration or in cases of exception. In general, we will refer to both groups of marks as vowels, and will refer to written text that makes use of any vowel signs and/or diacritical marks as vowelized text. In contrast, all other written text is referred to as un-vowelized text or simply as text, which is often used as input to the inventional method.
Problems in Automatic Speech and Natural Language Processing
As mentioned above, almost all written Arabic text—for example all newspaper text—is un-vowelized, which may lead to ambiguity in meaning and different possibilities of pronunciation. A normal Arabic speaker can put vowels “on the fly” while reading to get the intended meaning; readers usually apply their linguistic and semantic knowledge to resolve ambiguities.
While humans perform quite well on this task, the omission of vowels in written text leads to some serious problems for automatic speech and language processing systems. For example, the language model component of an automatic speech recognition system requires vowelized text in order resolve ambiguities and to achieve very high transcription accuracy.
Even more obvious is the fact that on-line vowelization of written text is indispensable for a text-to-speech (TTS) system, in order to correctly pronounce the input text.
For the construction of such speech technology components, current state-of-the-art speech recognition applications usually use manually vowelized text, which is tedious and error prone to create and results in less reliable components.
State-of-the-art TTS systems, as represented simplified in FIG. 1 may be used for a large variety of applications 10A, 10B, etc. Examples are telephony applications used for call-centres and elsewhere. A text written in a Semitic language is input into a vowelization tool 13, which is used and developed by a plurality of highly skilled developers 14, who apply an ensemble of morpho-syntactical rules, usually including a determination of etymological roots of text components, the determination of casus, numerus, and genus, and part-of-speech tagging, i.e. the identification of word types (nomen, verbum, adjective, etc.). Usually, also a semantic analysis is performed in order to determine the meaning of a word in its context. By applying this ensemble of rules the missing vowels and other diacritical marks are inserted into the original input text.
Those rules are depicted by reference sign 15, and an exemplary plurality of exceptions is depicted with reference sign 16 in order to illustrate the empiric character of this rule collection. Rules and large exception dictionaries are often stored electronically as part of the front-end component 18 of a text-to-speech (TTS) system. As depicted in FIG. 3, which is a schematic block diagram representation illustrating some more details of the prior art TTS conversion, the TTS-front-end also generates a phonetic description (also known as “baseform”) and a prosodic description (aka intonation contour) of the input text.
The TTS back-end component 19 generates synthetic speech signals 11 from the above-mentioned phonetic and prosodic description for outputting via a speaker system.
The above-mentioned TTS engine including the front-end 18 and back-end 19 thereof is implemented in prior art usually as a component of a prior art voice server, which is schematically illustrated in FIG. 2. As FIG. 2 illustrates, such prior art voice server further comprises a speech recognition engine 22, as most of the applications are operating fully bidirectional, i.e. they convert speech to text and vice-versa. Further, the voice server 22 comprises a voice browser 20 for rendering the acoustic or textual information, connected to a telephony stack 26, which handles the incoming calls. A JAVA speech application programming interface (JSAPI) component 25 receives audio signals, words and grammars corresponding to a respective telephone call structure from the voice browser component. The JSAPI component co-operates with the speech recognition engine and the text-to-speech engine 22 as it is known from prior art. The present invention basically improves the TTS engine 18, 19 and improves the training, the speech recognition engine 22 is based on.
Further, according to FIG. 2, the prior art environment comprises an interface to a co-operating web application server 24 which may be implemented on a PC system, either desktop or portable, possibly together with the before-mentioned applications or the respective application components and the voice server itself, or which may run in a client-server environment.
As mentioned above, the use of un-vowelized text is common in written Arabic. An average Arabic speaker or reader will add vowels on the fly while reading to get the intended meaning. In contrast, the use of un-vowelized text by a computer program that performs any kind of natural language processing is almost impossible, because such text is highly ambiguous: without a mechanism for vowelization, the system would simply behave unpredictable.
For an illustration of the problem, consider the following example that—for the purpose of explanation—is given in English: Imagine that English in its written form uses only consonants, but no vowels. In this case, the two words “write” and “wrote” will both be written as “wrt”. When “wrt” appears in a sentence, a reader will have at least two choices:                (1) add “i” and “e” and pronounce it as “write”        (2) add “o” and “e” and pronounce it as “wrote”        
A prior art morphological analyzer can only propose these two solutions, and more information is needed for disambiguation. For example, the consideration of the syntactic sentence structure can be used to obtain the correct vowelizations (“I will write a letter.” vs. “Yesterday I wrote a letter.”).
While this simple example illustrates the ambiguity problem of non-vowelized text, Arabic has the additional problem of non-diacritized text, where the same vowelization can lead to different meanings dependent on additional diacritical marks. For example, in Arabic a triliteral word “K T B” could have any of the following semantics:                (1) Using the vowel pattern “a a a” will result in “KaTaBa”, which has the meaning “He wrote”.        (2) An additional gemination mark (shadda) will result in the pattern “a ˜a a” and the vowelized text is “KaT˜aBa”, which has the meaning “He has forced someone to write”.        (3) The “o e a” vowel pattern results in “KoTeBa” (“It has been written”).        (4) Similar to (2), the use of a gemination mark will result in the “o ˜e a”, and the vowelized text is “KoT˜eBa” (“He has been forced to write”).        (5) Finally, the vowel/diacritics pattern “o o 0” transforms the consonant cluster into “KoToB0” (“the books”).        
In prior art, a morphological analyzer analyzes the word “K T B” and will offer the 5 different vowelization patterns and solutions above; ambiguity in this case must be resolved from a combination of syntax and semantics according to the pre-established collection of rules 15, mentioned above with reference to FIG. 1. These rules 15, including the empirically found exceptions 16 are, however, disadvantageously difficult to maintain up-to-date, or to extend to a new application due to the productive nature of any “living” spoken language.
Further disadvantages of prior art are due to the fact that the development of morphological and syntactical analyzers requires a lexicon database. The lexicon should cover the entire language, and therefore its collection is not a trivial task, and requires the expertise of computational linguists. In previous efforts done by IBM Cario Scientific Center (1987) morphological, syntactical, and some semantical features for about 5700 Arabic roots have been collected.
Morphology is important because a “living” language is “productive”: In any given text one will encounter words and word forms that have not been seen before and that are not in any precompiled dictionary. Many of these new words are morphologically related to known words. It is important to be able to handle morphology in almost all languages, but it is absolutely essential for highly inflected languages.
The major types of morphological processes are inflection, derivation, and compounding: Inflections are the systematic modifications of a root form by means of prefixes and suffixes to indicate grammatical distinctions like singular and plural. Derivation is less systematic. It usually results in a more radical change of syntactic category, and it often involves a change in meaning. Compounding refers to the merging of two or more words into a new word. Further, words are organized into phrases, groupings of words that are clumped as a unit. Syntax is the study of the regularities and constraints of word order and phrase structure.
The above-mentioned prior art morpho-syntactical analyzer is able to handle only two types of the Arabic sentences, namely the Arabic verbal and nominal sentences, and generating the corresponding parse trees. These two sentence types can also be vowelized completely with a certain degree of ambiguity that needs to be resolved through a semantical analyzer.
Sakhr (www.sakhr.com), a Middle East based company, has developed a system for automatic diacritization of Arabic that depends on various levels for language processing and analysis. Starting from a morphological level and ending with disambiguation of word meanings, the method relies on an extensive basic research in the area of Natural Language Processing (NLP) and large linguistic databases that Sakhr has developed over many years. Disadvantageously, in this approach the databases can be maintained up-to-date only with a large amount of manual work and highly skilled staff due to the “productive” nature of any language, as it was described above, and due to the even more problematic fact that Arabic is a highly inflected language.
1.3. Objectives of the Invention
It is thus an objective of the present invention to help overcome the above mentioned disadvantages.