Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
There has been interest shown in “adaptation” or building synthetic voices from spontaneous or “found” speech, such as voicemails or audio-video clips. To build a synthetic voice, the audio is typically transcribed to text. The easiest transcription method is to use automatic speech recognition [ASR] algorithms on the spontaneous/found speech to generate the text. However, ASR can be error-prone, leading to a large number of transcription errors in the output text, with a subsequent degradation in the quality of the synthetic voice.
Training for speech recognition and generation of synthetic voices can use a “source” database of spoken speech, which in some cases, can involve a large amount (e.g., 50+ hours) of spoken speech. During training, a Hidden Markov Model (HMM) can be generated to model the database of spoken speech, and decision trees can be generated based on the HMM.
For example, suppose the input sentence to the HMM were “Butter wouldn't melt.” Then, the complete or “full context” for the first phoneme, or basic element of speech in this sentence, can include features such as:
phoneme id: /b/
consonant/vowel: consonant
plosive (stop consonant)
position of word in sentence: 1
words in sentence: 3
syllable position in word: 1
syllables in word: 2
phoneme to the right: /u/
phoneme to the left: N/A
vowel to the right: yes
type of vowel to right: mid-back
. . . .
Each feature can be modeled as the answer to a yes/no question in a decision tree, with a most significant feature for distinguishing data at a top node of the tree and progressively less significant features in nodes below the top node. At the top of the tree, errors in input text have little or no import, as there is little or no classification of text-related features, while at the leaf node level, text errors have greater import. That is, top-level questions often have little or no relation to the input text, such as “Is this the start of a sentence?” or “Is the input sound a consonant?” At the leaf nodes, the questions are detailed so to identify specific acoustic features.
As a full context can be very detailed and many contexts can be generated for a corpus of input speech, linguistic features or “contexts” of the HMM can be “clustered”, or grouped together to form the decision trees. Clustering can simplify the decision trees by finding the distinctions that readily group the input speech. In some cases, a “tied” or “clustered” decision tree can be generated that does not distinguish all features that make up full contexts for all phonemes; rather, a clustered decision tree stops when most important features in the contexts can be identified.
A group of decision trees, perhaps including clustered decision trees, can form a “source acoustic model” or “speaker-independent acoustic model” that uses likelihoods of the training data in the source database of spoken speech to cluster and split the spoken speech data based on features in the contexts of the spoken speech. As shown in FIG. 1A, each stream of information (fundamental frequency (F0), duration, spectral, and aperiodicity) can have a separately trained decision tree in the source acoustic model. FIG. 1B shows the top four layers of an example spectral decision tree.
In current adaptation systems, audible words or “speech data” from a designated speaker can be transcribed to generate “input text.” The speech data can be broken down into frames of small amounts, e.g., 10 to 50 milliseconds of speech. During adaptation, each frame of the speech data and/or the input text can be applied at any level of the tree to propagate down to the physical models at the leaf (bottom) nodes.
FIG. 1C shows an example of prior art adaptation of source acoustic model 100, where speech data 110 and input text 120 generated by ASR are propagated down decision trees DT1, DT2, DT3, . . . DTn, n>0, of source acoustic model 100. As an example, for adaptation of the acoustic model of FIG. 1A, n=4, with DT1 corresponding to the Aperiodicity DT, DT2 corresponding to the Spectral DT, DT3 corresponding to the Duration DT, and DT4 corresponding to the F0 DT.
The decision trees DT1, DT2, DT3, . . . DTn can be generated based on an HMM constructed during a training session utilizing part or all of the speech in the database of spoken speech. FIG. 1C also shows that frames of speech data 110 can be provided to an automatic speech recognition (ASR) unit to determine input text for the speech data. In a typical adaptation system, each feature tree is only traversed to a fixed depth, such as two or four nodes deep, to avoid errors in classification of features and/or errors in the input text. Source acoustic model 100 can then be said to be adapted by a prior system when all decision trees DT1, DT2, DT3, . . . DTn, have been traversed to their respective fixed depths.