1. Field of Invention
The invention relates to methods and apparatus for synthesizing speech.
2. Description of Related Art
Rule-based speech synthesis is used for various types of speech synthesis applications including Text-To-Speech (TTS) and voice response systems. Typical rule-based speech synthesis techniques involve concatenating pre-recorded phonemes to form new words and sentences.
Previous concatenative speech synthesis systems create synthesized speech by using single stored samples for each phoneme in order to synthesize a phonetic sequence. A phoneme, or phone, is a small unit of speech sound that serves to distinguish one utterance from another. For example, in the English language, the phoneme /r/ corresponds to the letter xe2x80x9cRxe2x80x9d while the phoneme /t/ corresponds to the letter xe2x80x9cTxe2x80x9d. Synthesized speech created by this technique sounds unnatural and is usually characterized as xe2x80x9croboticxe2x80x9d or xe2x80x9cmechanical.xe2x80x9d
More recently, speech synthesis systems started using large inventories of acoustic units with many acoustic units representing variations of each phoneme. An acoustic unit is a particular instance, or realization, of a phoneme. Large numbers of acoustic units can all correspond to a single phoneme, each acoustic unit differing from one another in terms of pitch, duration, and stress as well as various other qualities. While such systems produce a more natural sounding voice quality, to do so they require a great deal of computational resources during operation. Accordingly, there is a need for new methods and apparatus to provide natural voice quality in synthetic speech while reducing the computational requirements.
The invention provides methods and apparatus for speech synthesis by selecting recorded speech fragments, or acoustic units, from an acoustic unit database. To aide acoustic unit selection, a measure of the mismatch between pairs of acoustic units, or concatenation cost, is pre-computed and stored in a database. By using a concatenation cost database, great reductions in computational load are obtained compared to computing concatenation costs at run-time.
The concatenation cost database can contain the concatenation costs for a subset of all possible acoustic unit sequential pairs. Given that only a fraction of all possible concatenation costs are provided in the database, the situation can arise where the concatenation cost for a particular sequential pair of acoustic units is not found in the concatenation cost database. In such instances, either a default value is assigned to the sequential pair of acoustic units or the actual concatenation cost is derived.
The concatenation cost database can be derived using statistical techniques which predict the acoustic unit sequential pairs most likely to occur in common speech. The invention provides a method for constructing a medium with an efficient concatenation cost database by synthesizing a large body of speech, identifying the acoustic unit sequential pairs generated and their respective concatenation costs, and storing the concatenation costs values on the medium.
Other features and advantages of the present invention will be described below or will become apparent from the accompanying drawings and from the detailed description which follows.