1. Field of the Invention
This invention relates to a method and apparatus for converting text to a waveform. More specifically, it relates to the production of an output in form of an acoustic wave, namely synthetic speech, from an input in the form of signals representing a conventional text.
2. Related Art
This overall conversion is very complicated and it is sometimes carried out in several modules wherein the output of one module constitutes the input for the next. The first module receives signals representing a conventional text and the final module produces synthetic speech as its output. This synthetic speech may be a digital representation of the waveform followed by conventional digital-to-analogue conversion in order to produce the audible output. In many cases it is desired to provide the audible output over a telephone system. In this case it may be convenient to carry out the digital-to-analogue conversion after transmission so that transmission takes place in digital form.
There are advantages in the modular structure, e.g. each module is separately designed and any one of the modules can be replaced or altered in order to provide flexibility, improvements or to cope with changing circumstances.
Some procedures utilise a sequence of three modules, namely
(A) pre-editing, PA1 (B) conversion of graphemes to phonemes, and PA1 (C) conversion of phonemes to (digital) waveform. PA1 One three four five PA1 Thirteen forty-five or PA1 One thousand three hundred and forty-five.
A brief description of these modules will now be given.
Module (A) receives signals representing a conventional text, e.g. the text of this specification, and it modifies selected features. Thus module (A) may specify how numbers are processed. For example, it will decide if EQU "1345"
becomes
It will be apparent that it is relatively easy to provide different forms of module (A), each of which is compatible with the subsequent modules so that different forms of output result.
Module (B) converts graphemes to phonemes. "Grapheme" denotes data representations corresponding to the symbols of the conventional alaphbet used in the conventional manner. The text of this specification is a good example of "graphemes". It is a problem of synthetic speech that the graphemes may have little relationship to the way in which the words are pronounced, especially in languages such as English. Therefore, in order to produce waveforms, it is appropriate to convert the graphemes into a different alphabet, called "phonemes" in this specification, which has a very close correlation with the sound of the words. In other words it is the purpose of module (B) to deal with the problem that the conventional alphabet is not phonetic.
Module (C) converts the phonemes into a digital waveform which, as mentioned above, can be converted into an analogue format and thence into audible waveform.
This invention relates to a method and apparatus for use in module (B) and this module will now be described in more detail.
Module (B) utilises linked databases which are formed of a large number of independent entries. Each entry includes access data which is in the form of representations, eg bytes, of a sequence of graphemes and an output string which contains representations, eg bytes of the phoneme equivalent to the graphemes contained in the access section. A major problem of grapheme/phoneme conversion resides in the size of database necessary to cope with a language. One simple, and theoretically ideal, solution would be to provide a database so large that it has an individual entry for every possible word in the language, including all possible inflections of every possible word in the language. Clearly, given a complete database, every word in the input text would be individually recognised and an excellent phoneme equivalent would be output. It should be apparent that it is not possible to provide such a complete database. In the first place, it is not possible to list every word in a language and even if such a list were available it would be too large for computational purposes.
Although the complete database is not possible, it is possible to provide a database of useable dimension which contains, for example, common words and words whose pronunciation is not simply related to the spelling. Such a database will give excellent grapheme/phoneme conversion for the words included therein but it will fail, i.e. give no output at all, for the missing words. In any practical implementation this would mean an unacceptably high proportion of failure.
Another possibility uses a database in which the access data corresponds to short strings of graphemes each of which is linked to its equivalent string of phonemes. This alternative utilises a manageable size of database but it depends upon analysis of the input text to match strings contained therein with the access data in the database. Systems of this nature can provide a high proportion of excellent pronunciations with occurrences of slight and severe mispronunciation. There will also be a proportion of failures wherein no output at all is produced either because the analysis fails or a needed string of graphemes is missing from the access section of the database.
A final possibility is conveniently known as a "default" procedure because it is only used when preferred techniques fail. A "default" procedure conveniently takes the form of "pronouncing" the symbols of the input text. Since the range of input symbols is not only known but limited (usually less than 100 and in many cases less than 50) it is not only possible to produce the database but its size is very small in relation to the capacity of modern data storage systems. This default procedure therefore guarantees an output even though that output may not be the most appropriate solution. Examples of this include names in which initials are used, degrees and honours, and some abbreviations for units. It will be appreciated that, in these circumstances, it is usual to "pronounce" out the letters and on these occasions the default procedures provides the best results.
Three different strategies for converting graphemes to phonemes have just been identified and it is important to realise that these alternatives are not mutually exclusive. In fact it is desirable to use all three alternatives according to a strict order of precedence. Thus the "whole word" database is used first and, if it gives an output, that output will be excellent. When it fails "the analysis" technique is used which may involve a small but acceptable number of mis-pronunciations. Finally if the "analysis" fails the default option of pronouncing the "letters" is utilized and this can be guaranteed to give an output. Although this may not be completely satisfactory, it will, in a proportion of cases as explained above, give the most appropriate result.