(1) Field of the Invention
The present invention relates to a speech analyzer and a speech analysis method which extract a vocal tract feature and a sound source feature by analyzing an input speech.
(2) Description of the Related Art
In recent years, the development of speech synthesis techniques has enabled generation of very high-quality synthesized speech.
However, the conventional use of such synthesized speech is still centered on uniform purposes, such as reading off news texts in announcer style.
Meanwhile, speech having distinctive features (synthesized speech highly representing personal speech or synthesized speech having a distinct prosody and voice quality, such as the speech style of a high-school girl or speech with a distinct intonation of the Kansai region in Japan) has started to be distributed as a kind of content. For example, there is a service for mobile phones, which uses a speech message by a celebrity as a ringtone. Thus, in pursuit of further amusement in interpersonal communication, a demand for creating distinct speech to be heard by the other party is expected to grow.
The method for speech synthesis is classified into two major methods. The first method is a waveform concatenation speech synthesis method in which appropriate speech elements are selected, so as to be concatenated, from a speech element database (DB) that is previously provided. The second method is an analysis-synthesis speech synthesis method in which speech is analyzed so as to generate synthesized speech based on analyzed parameters.
In terms of converting the voice quality of the above-mentioned synthesized speech in many different ways, in the waveform concatenation speech synthesis method, it is necessary to prepare the same number of the speech element DBs as necessary voice quality types, and to concatenate the speech elements while switching between the speech element DBs. Thus, it requires enormous costs to generate synthesized speech having various voice qualities.
On the other hand, in the analysis-synthesis speech synthesis method, the analyzed speech parameters are transformed. This allows conversion of the voice quality of the synthesized speech. Generally, a model known as a sound source vocal tract model is used for the parameter analysis.
However, it is assumed that various noises are mixed into input speech in real environment. Accordingly, it is necessary to take measures to the mixed noise. For example, as a method for suppressing noise, there is a technology disclosed in Patent Literature 1: Japanese Unexamined Patent Application Publication No. 2002-169599 (pages 3 to 4, FIG. 2).
FIG. 11 shows the structure of the noise suppressing method disclosed in Patent Literature 1.
The noise suppressing method according to Patent Literature 1 sets a gain smaller than the gain for each band in the noise frame, with regard to the band assumedly not to include speech component within a frame determined as a speech frame (or has small speech component), and aims to achieve high audibility by enhancing the band in the speech frame.
More specifically, the noise suppressing method in which the input signal is divided into frames per predetermined time period, the divided frame is divided into predetermined frequency bandwidths, and noise is suppressed for each of the divided bandwidth, includes determining a speech frame whether the frame is a noise frame or a speech frame, setting a band gain value for each band in each frame based on the result in the determining of a speech frame, and generating an output signal in which noise is suppressed by reconstructing the frame after the noise suppression for each band using the band gain value. In the determining of the band gain value, the band gain value is set such that a band gain value in the case where the frame which is subject to the determining is determined as a speech frame is smaller than a band gain value in the case where the frame which is subject to the determining is determined as a noise frame.