There are various methods available to solve the speech synthesis problem in general. The most successful methods use an inventory of prerecorded speech units, such as diphones, and concatenate the units (with or without some prosodic modifications) to synthesize fluent speech with correct prosody. Prosody relates to the pitch, rhythm, stress, tempo and intonation used in expressing words i.e. how the words are spoken. Through employing unit selection methods described in U.S. Pat. No. 6,266,637, one can achieve a reasonable quality of synthesized speech and avoid the prosodic modification of speech units by recording a very large inventory of units and searching for optimal units to be concatenated at the synthesis stage.
However, these techniques require a large amount of volatile and nonvolatile memory to store the unit inventory, and search results. Also, the search for optimal units at the synthesis stage is complicated and increases the computation load significantly.
An alternative form of Text-to-Speech (TTS) synthesizers is the class of small-unit concatenation systems that use less than a few thousands of speech units. Amongst the various versions of these systems proposed in the literature, the Time-Domain Pitch-Synchronous Overlap and Add (TD-PSOLA) method is very simple and offers a reasonable speech quality if the problems of pitch, phase and spectral discontinuities are properly addressed. Details of TD-PSOLA is described in Diphone Synthesis Using an Overlap-Add Technique for Speech Waveforms Concatenation, F. Charpentier and M. G. Stella, Proceedings of the ICASSP, 1986 pp. 2015 to 2018 and Pitch-Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis Using Diphones, E. Moulines, and F. Charpentier, Speech Communication, vol. 9, No. 5–6, 1990 and U.S. Pat. No. 5,369,730.
In PC-based synthesis systems, synthesized speech is stored in temporary files that are played back when a part of the text (such as a complete phrase, sentence or paragraph) has been processed. In contrast, in a typical real-time system, the text has to be processed while synthesis is taking place. Synthesis cannot be interrupted once it has started. Also, synthesis is not a straight-through process in which the input data can be simply synthesized as it is made available to the processor. The processor has to buffer enough data to account for variations in prosody. It also has to work on several frames at a time in order to perform interpolation between such frames while synthesis is taking place.
Therefore, it is desirable to provide a real-time audio synthesis method and system that offers a high quality audio in real-time, and that can meet requirements of low resource usages (i.e. lower memory usage, low power consumption, low computation load and complexity, low processing delay).