The present invention relates to a speech information processing apparatus and a method to generate a natural pitch pattern used for text-to-speech synthesis.
Text-to-synthesis represents the artificial generation of a speech signal from an arbitrary sentence. An ordinary text-to-speech system consists of a language processing section, a control parameter generation section, and a speech signal generation section. The language processing section executes morpheme analysis and syntax analysis for an input text. The control parameter generation section processes accent and intonation, and outputs phoneme signs, pitch pattern, and the duration of phoneme. The speech signal generation section synthesizes the speech signal.
In the text-to-speech system, an element related to the naturalness of synthesized speech is the prosody processing of the control parameter generation section. In particular, pitch pattern influences the naturalness of synthesized speech. In known text-to-speech systems, pitch pattern is generated by a simple model. Accordingly, the synthesized speech is generated as mechanical speech whose intonation is unnatural.
Recently, a method to generate the pitch pattern by using a pitch pattern extracted from natural speech has been considered. For example, in Japanese Patent Disclosure (Kokai) xe2x80x9cPH6-236197xe2x80x9d, unit patterns extracted from the pitch pattern of natural speech or vector-quantized unit patterns are previously memorized. The unit pattern is retrieved from a memory by input attribute or input language information. By locating and transforming the retrieved unit pattern on a time axis, the pitch pattern is generated.
In the above-mentioned text-to-speech synthesis, it is impossible to store the unit patterns suitable for all input attributes or all input language informations. Therefore, transformation of the unit pattern is necessary. For example, elasticity of the unit pattern in proportion to the duration is necessary. However, even if the unit pattern is extracted from the pitch pattern of the natural speech, the naturalness of the synthesized speech falls because of this transformation processing.
It is one object of the present invention to provide a speech information processing apparatus and a method to improve the naturalness of synthesized speech in text-to-speech synthesis.
The above and other objects are achieved according to the present invention by providing a novel apparatus, method and computer program product for generating clustered patterns for text-to-speech synthesis. In the apparatus, a representative pattern memory stores a plurality of initial representative patterns as a noise pattern. Different attribute is previously affixed to each initial representative pattern. A pitch pattern memory stores a large number of natural pitch patterns as an accent phrase. A clustering unit classifies each natural pitch pattern to the initial representative pattern based on the attribute of the accent phrase. A transformation parameter generation unit evaluates an error between a transformed representative pattern and each natural pitch pattern classified to the initial representative pattern, and generates a transformation parameter for each natural pitch pattern based on the evaluation result. A representative pattern generation unit calculates an evaluation function of the sum of the error between the transformed representative pattern an each natural pitch pattern classified to the initial representative pattern, and updates each initial representative pattern based on a result of the evaluation function. The representative pattern memory stores each updated representative pattern as a clustered pattern of the attribute affixed to the corresponding initial representative pattern.