1. Field of the Invention
The present invention relates generally to speech synthesis and more specifically to caching join costs for commonly used phoneme sequences for use in speech synthesis.
2. Introduction
Currently, unit selection speech synthesis is performed by selecting and concatenating appropriate acoustic units from a large audio database. Unit selection speech synthesis can be computationally expensive because there are so many possible combinations to consider in real-time calculations. Join cost calculations are among the most frequently performed operations. In order to solve the problem of expensive join cost calculations, many in the art have tried to cache join cost calculations, but combinatorics (specifically permutations with repetition) make the number of join cost calculations prohibitively large. As a reminder, the phrase permutation with repetition represents mathematical combinations where order matters and an item can be used more than once. Permutation with repetition is mathematically represented by the equation NR where N is the number of objects you can choose from and R is the number to be chosen. As an example, consider a modest estimate of roughly 60 possible phonemes for N. R is the number of phonemes in a given word. The possible permutations are immense. For synthesis of a particular word consisting of a sequence of 5 sounds, if we consider that there are 30 examples of each required sound in the database that could potentially be chosen, then 305, or approximately 24 million, possible outcomes exist. For a word consisting of a sequence of 6 sounds, just one sound more, then 306 possible outcomes exist, skyrocketing the possible outcomes to over 700 million.
The BMR approach, as represented in U.S. Pat. No. 7,082,396, tries to minimize the cache of join cost calculations by only caching “winning” joins which represent the best path through a network for at least one sentence in a text database. The BMR approach is generally successful, but is limited because it requires a lengthy training process and as the number of units in the cache increases, the yield from the process decreases. If the front end changes, substantial retraining may be necessary to add the new material in the front end. Accordingly, what is needed in the art is a method of performing speech synthesis by making a synthesis-independent way to generate a manageable cache of join costs for phoneme sequences.