1. Field of the Invention
The present invention relates to a speech processing apparatus, method, and computer program product for synthesizing speech.
2. Description of the Related Art
A speech synthesizing device, which synthesizes speech from a text, includes three main processing units: a text analyzing unit, a prosody generating unit, and a speech signal generating unit. The text analyzing unit analyzes an input text (containing latin characters, kanji (Chinese characters), kana (Japanese characters or any other type of characters)) by using a dictionary or the like, and outputs linguistic information defining how to pronounce the text, where to put a stress, how to segment the sentence (into accentual phrases), and the like. Based on the linguistic information, the prosody generating unit outputs phonetic and prosodic information, such as a voice pitch (fundamental frequency) pattern (hereinafter, “pitch contour”) and the length of each phoneme. The speech signal generating unit selects speech units in accordance with the arrangement of phonemes, connects the units together while modifying them in accordance with the prosodic information, and thereby outputs synthesized speech. It is well known that, among those three processing units, the prosody generating units that generates the pitch contour has a significant influence on the quality and naturalness of the synthesized speech.
Various techniques for generating a pitch contour have been suggested, such as classification and regression trees (CART), linear models, and hidden Markov model (HMM). These techniques can be classified into two types:                (1) Outputting a definitive value for each segment of the utterance (usually for each unit of the utterance at a given linguistic-level): Techniques based on a code book and on a linear model belong to this type.        (2) Outputting multiple possible values for each segment of the utterance (usually for each unit of the utterance at a given linguistic-level): In general, an output vector is modeled in accordance with a probability distribution function, and a pitch contour is formed in such a manner that a solution of an objective function consisting of multiple subcosts, such as likelihoods, is maximized. An example of this type is HMM-based technique proposed in “Speech parameter generation from HMM using dynamic features” by Tokuda, K., Masuko, T., Imai, S., 1995, Proc. ICASSP, Detroit, USA, pp. 660-663; and “Hidden Markov models based on multi-space probability distribution for pitch pattern modeling” by Tokuda, K., Masuko, T., Miyazaki, N., and Kobayashi, T., 1999, Proc. ICASSP, Phoenix, Ariz., USA, pp. 229-232.        
For techniques belonging to the method (1), where a definitive value is generated for the considered linguistic-level units, it is difficult to produce a smoothly changing pitch contour. The reason is that the pitch patterns generated for each unit may not match with the pitch patterns generated for the adjacent units at the connecting point to each other. This creates an abnormal sound or a sudden change in intonation, that prevents the speech from sounding natural. Hence, this methods challenge is how to connect individually generated pitch segments to one another so that the final speech does not sound discontinuous or abnormal.
The above problem is often tried to be solved by means of a filtering process onto the sequence of generated pitch segments that smooth the gaps. However, even if the gaps between pitch segments at the connection points are reduced to some extent, it is still difficult to make the pitch contour evolve in a continuous way so that smooth speech is obtained. In addition, if the filtering is too intensely applied, the pitch contour becomes blunt, which, again, makes the speech sound unnatural. Furthermore, parameters of the filtering process need to be adjusted by trial-and-error methods while checking the sound quality. This requires considerable time and labor.
The above problem regarding the pitch connection may be mended by the method of outputting multiple possible values represented by a statistical distribution as shown in (2). However, this method tends to excessively smooth the generated pitch contour and thus make it blunt, resulting in an unnatural sounding speech. The blunt pitch pattern may be fixed by artificially widen the variance of the generated pitches as proposed in “Speech parameter generation algorithm considering global variance for HMM-Based speech synthesis” by Toda, T. and Tokuda, K., 2005, Proc. Interspeech 2005, Lisbon, Portugal, pp. 2801-2804. However, the problem still remains, because the widening of small local differences in the pitch contour can make the global pitch contour unstable. An additional problem of standard HMM-based method is that in order to model together the spectral and the pitch information, the basic linguistic units are defined at a segmental level, i.e. frame by frame. However, pitch is basically a supra-segmental signal. In standard HMM-based method, supra-segmental information is introduced through the model clustering and selection. However, this lack of an explicit modeling at supra-segmental level makes difficult to control certain speech characteristics such as emphasis, excitation, etc. Moreover, in such framework it is not clear how to create and integrate models for other linguistic levels such as syllable or breath group that present different dimension for each unit and consequently, a different range of effect over surrounding pitch segments.