The invention relates generally to the field of speech signal processing, and more particularly, concerns formant tracking based on phoneme information in speech analysis.
Various speech analysis methods are available in the field of speech signal processing. A particular method in the art is to analyze the spectrograms of particular segments of input speech. The spectrogram of a speech signal is a two-dimensional representation (time vs. frequency), where color or darkness of each point is used to indicate the amplitude of the corresponding frequency component. At a given time point, a cross section of the spectrogram along the frequency axis (spectrum) generally has a profile that is characteristic of the sound in question. In particular, for voiced sounds, such as vowels and vowel-like sounds, each has characteristic frequency values for several spectral peaks in the spectrum. For example, the vowel in the word xe2x80x9cbeakxe2x80x9d is signified by spectral peaks at around 200 Hz and 2300 Hz. The spectral peaks are called the formants of the vowel and the corresponding frequency values are called the formant frequencies of the vowel. A xe2x80x9cphonemexe2x80x9d corresponds to the smallest unit of speech sounds that serve to distinguish one utterance from another. For instance, in the English language, the phoneme lit corresponds to the sound for the xe2x80x9ceaxe2x80x9d in xe2x80x9cbeat.xe2x80x9d It is widely accepted that the first two or three formant frequencies characterize the corresponding phoneme of the speech segment. A xe2x80x9cformant trajectoryxe2x80x9d is the variation or path of particular formant frequencies as a function of time. When the formant frequencies are plotted as a function of time, their formant trajectories usually change smoothly inside phonemes corresponding to a vowel sound or between phonemes corresponding to such vowel sounds. This data is useful for applications such as text-to-speech generation (xe2x80x9cTTSxe2x80x9d) where formant trajectories are used to determine the best speech fragments to assemble together to produce speech from text input.
FIG. 1 is a diagram illustrating a conventional formant tracking method in which input speech 102 is first processed to generate formant trajectories for subsequent use in applications such as TTS. First, a spectral analysis is performed on input speech 102 (Step 104) using techniques, such as linear predictive coding (LPC), to extract formant candidates 106 by solving the roots of a linear prediction polynomial. A candidate selection process 108 is then used to choose which of the possible formant candidates is the best to save as the final formant trajectories 110. Candidate selection 108 is based on various criteria, such as formant frequency continuity.
Regardless of the particular criteria, conventional selection processes operate without reference to text data associated with the input speech. Only after candidate selection is complete are the final formant trajectories 110 correlated with input text 112 processed (formant data processing step 114) to generate, e.g., an acoustic database that contains the processed results associating the final formant data with text phoneme information for later use in another application, such as TTS or voice recognition.
Conventional formant tracking techniques are prone to tracking errors and are not sufficiently reliable for unsupervised and automatic usage. Thus, human supervision is needed to monitor the tracking performance of the system by viewing the formant tracks in a larger time context with the aid of a spectrogram. Nonetheless, when only limited information is provided, even human-supervised systems can be as unreliable as conventional automatic formant tracking.
Accordingly, it would be advantageous to provide an improved formant tracking method that significantly reduces tracking errors and can operate reliably without the need for human intervention.
The invention provides an improved formant tracking method and system for selecting formant trajectories by making use of information derived from the text data that corresponds to the processed speech before final formant trajectories are selected. According to the invention, the input speech is analyzed in a plurality of time frames to obtain formant candidates for each time frame. The text data corresponding to the input speech is converted into a sequence of phonemes. The input speech is segmented by putting in temporal boundaries. The sequence of phonemes is aligned with a corresponding segment of the input speech. Predefined nominal formant frequencies are then assigned to a center point of each phoneme and this data is interpolated to provide target formant trajectories for each time frame. For each time frame, the formant candidates are compared with the target formant trajectories and candidates are selected according to one or more cost factors. The selected formant candidates are then output for storage or further processing in subsequent speech applications.