The present invention relates to models of speech. In particular, the present invention relates to vocal tract resonance (VTR) models of structured speech.
In recent years, much research in spoken language technology has been devoted to incorporating structures of human speech and language into statistical speech recognition systems. Researchers have explored the approaches of using the hidden structure of speech in the human speech generation process, either implicitly or explicitly. One key component of these hidden dynamic modeling approaches is a target-filtering operation in some non-observable (i.e., hidden) domain.
Human speech contains spectral prominences or VTRs. These VTRs carry a significant amount of the information contained in human speech. In the past, attempts have been made to model the VTRs associated with particular phonetic units, such as phonemes, using discrete state models such as a Hidden Markov Model. Such models have been less than ideal, however, because they do not perform well when the speaking rate increases or the articulation effort of the speaker decreases. Research into the behavior of VTRs during speech indicates that one possible reason for the difficulty of conventional Hidden Markov Model based systems in handling fluent speech is that during fluent speech the static VTR values and hence the static acoustic information for different classes of phonetic units become very similar as the speaking rate increases or the articulation effort decreases. Although this phenomenon, known as reduction, has been observed in human speech, an adequate and quantitative model for predicting such behavior in VTR tracts has been needed
Recently, a bi-directional target filtering approach to modeling speech coarticulation and context assimilated reduction has been developed. This hidden trajectory model functionally achieves both anticipatory and regressive coarticulation, while leaving the phonological units as the linear phonemic sequence and bypassing the use of more elaborated nonlinear phonological constructs. One key set of parameters in the hidden trajectory model is VTR targets, which are specific to each phone but are context independent.
How to determine the values of these parameters is important to the success of applying the model to speech recognition. The simplest way is to train a single set of VTR targets for all the speakers; i.e., in a speaker-independent manner. In this case, the training averages out the VTR targets' variability over all speakers in the training set. However, VTRs and their targets are related to the vocal tract length of the speaker, and hence they vary among speakers. A single set of VTR targets can produce the VTR trajectories that typically match well with data for some speakers, but not for other speakers. An improved method of determining values of these resonance targets would be therefore needed.