This invention relates generally to a speech synthesizer reproducing speech by the joining together in sequence a plurality of phonemes, and more particularly to a speech synthesizer where the phonemes and the control instructions for outputting the phonemes in the proper sequence are stored in separate memories. In a speech synthesizing system, typical speech elements are selected and stored as waveform data from the natural speech of humans in pitches, that is, intervals or periods of repetition, as voiced phonemes for voiced sounds having periodicity. Voiceless sounds having no periodicity are also selected from human speech as voiceless phonemes and stored. Alternatively, portions of the voiceless sounds are used repetitively as voiceless phonemes. The voiced and voiceless phonemes are stored in separate voiced phoneme and voiceless phoneme memories respectively, and then read-out and coupled together in accordance with externally provided control information. Thereby, speech is synthesized. The externally provided control information comprises instructions as to whether a phoneme is voiced or voiceless, phoneme numbers, amplitudes, pitches, repetition numbers, and the like. With such a speech synthesizing system, typical voiced and voiceless phonemes of a language are all recorded as representative phonemes. Those phonemes which are most analogous to the natural speech and language which is to be reproduced are successively selected and coupled together to generate a desired word. In other words, phonemes are selected from an inventory of voiced and voiceless phonemes which are typical of a given language.
However, the quality of speech as synthesized by such a system has proved unsatisfactory because the representative typical phonemes which constitute the synthesized speech are extracted from words which are most frequently different from the actual words which are to be produced from memory by joining phonemes together. In actual applications where a synthesized message is produced, words are usually generated in groups ranging from several words to a few more than ten words. The interval of time to deliver such a synthesized message is in the order of ten seconds. Thus, ability to generate any and all words on demand is not always necessary, as a fixed message is frequently all that is required.
Also, storage of the voiced phonemes and the voiceless phonemes in separate memories is not desirable from the standpoint of assembling a speech synthesizer in a one-chip integrated circuit. Because the ratio in the size of the voiced and voiceless phoneme memories to be used varies with the actual words to be stored, an unused area frequently remains in one of the memories which makes it impossible to put the memories to truly efficient use. Further, the use of two phoneme memories complicates the control circuitry.
What is needed is a speech synthesizer using a single memory for storing both voiced and voiceless phonemes in an efficient manner. It is also desirable that the synthesized speech accurately represent the words and language which is spoken.