1. Field of the Invention
The present invention relates to a technology effective in being applied to an apparatus, a method and a program that change a reproducing speed of a voice without changing a tone pitch.
2. Description of the Related Art
There has hitherto been proposed a technology for getting a content of a conversation easy to hear by slowing down a speed of the conversion (which will hereinafter be called a “voice speed”) without changing a pitch of a voice of a conversing partner. At this time, if only the voice speed is simply slowed down, a delay corresponding to the slowdown occurs. Technologies of obviating the delay are proposed for solving this problem by diminishing a non-utterance section (a section in which a sound such as a human voice is not uttered) existing in the middle (intermission or pause) of the conversation and by getting faster the voice speed in the non-utterance section (refer to Patent documents 1 and 2).
FIG. 25 is a diagram showing an example of function blocks of a conventional voice speed control apparatus P1. In the conventional voice speed control apparatus P1, with respect to an section judged to be non-utterance (which is, i.e., the non-utterance section) by an utterance/non-utterance judging unit P2, a continuation time calculating unit P3 calculates a length of continuation time of this non-utterance section. Then, a voice speed determining unit P4 determines as to whether or not the voice speed should be increased according to the continuation time of the non-utterance section, and the voice speed control unit P5 controls the voice speed in the non-utterance section.
FIG. 26 is a graphic chart for explaining a conventional mechanism of how the voice speed is controlled. In FIG. 26, “t1” represents a non-utterance continuation time threshold value. A section ranging from a start of the non-utterance section up to “t1” is called a protection section. In the protection section, as shown in FIG. 26, the voice speed is set to, e.g., a 1-fold speed without being increased in most cases. If the continuation time (the non-utterance continuation time) of the non-utterance section, which is acquired by the continuation time calculating unit P3, exceeds “t1”, the voice speed determining unit P4 determines that the voice speed is doubled. Then, the voice speed control unit P5 controls the voice speed according to this value (the 2-fold value). Herein, the numerical value, which is as specific as the 2-fold value, is an example, and other values (triple, quintuple, etc) may also be available. The delay is obviated by such a process.                Patent document 1: Japanese Patent Application Laid-Open Publication No. 2003-216200        Patent document 2: Japanese Patent Application Laid-Open Publication No. 08-292796        Patent document 3: Japanese Patent Application Laid-Open Publication No. 200-244972        
On the occasion of executing the process of diminishing the non-utterance section and the process of increasing the voice speed in the non-utterance section, however, it is necessary to take account of accuracy of the utterance/non-utterance judgment. For instance, there is a case in which misjudgment might occur in the utterance/non-utterance judgment under a noisy environment. FIG. 27 is a graph showing an example of an inputted voice under an environment with no noise. FIG. 28 is a graph showing an example of the inputted voice under an environment with a noise. In FIGS. 27 and 28, each of upper graphs shows a power value, while each of lower graphs shows an example of a result of the utterance/non-utterance judgment. Under the environment with no noise, the precise utterance/non-utterance judgment also about utterance starting points and utterance endpoints is conducted. Under the noisy environment, however, a case is that a noise level takes a value approximate to or exceeding the power value in the utterance starting points and in the utterance endpoints, and in this case the utterance starting points and the utterance endpoints are absorbed by the noises. Hence, under the noisy environment, it is difficult to actualize the precise utterance/non-utterance judgment. For example, under the noisy environment, there is a high possibility, wherein voice elements exhibiting small voice power as at the utterance starting points and at the utterance endpoints might be misjudged to be unuttered in spite of being uttered (which are, e.g., the voice elements depicted by dotted lines in the lower graph in FIG. 28). The voice elements with the small voice power are exemplified by unuttered consonants in addition to the utterance starting points and the utterance endpoints.
If the process of diminishing the non-utterance section and the process of increasing the voice speed on the basis of the misjudgment described above are executed, such problems arise that vanishment of the voice element occurs and the non-utterance continuation length is excessively reduced. FIG. 29 is an explanatory graph showing the problems caused in the case of executing the process of diminishing the non-utterance section and the process of increasing the voice speed in the non-utterance section on the basis of the misjudgment. In FIG. 29A, the utterance starting points and the utterance endpoints are accurately judged because of having no noise. Hence, the process of diminishing the non-utterance section existing between the utterance starting point and the utterance endpoint and the process of increasing the voice speed, are properly carried out. On the other hand, in FIG. 29B, the utterance starting point(s) and the utterance endpoint(s) are misjudged due to the noise. Therefore, in the case of FIG. 29B, the process of diminishing the non-utterance section is executed without taking account of the utterance endpoint (a waveform of the utterance endpoint depicted by the dotted line) judged to be the non-utterance section and the utterance starting points (two waveforms of the utterance starting points drawn by the dotted lines: illustrated in superposition on the waveform of the utterance endpoint) judged to be the non-utterance section. As a result, such a problem is caused that the non-utterance section between the utterance starting point and the utterance endpoint, which are depicted by the dotted lines, gets excessively short, and in the exemplified case the vanishment of the voice element occurs due to cutoff of any one (or both) of the utterance starting point and the utterance endpoint. Further, in the case of increasing the voice speed in the non-utterance section, as compared with the case of diminishing the non-utterance section, the problem that the utterance starting points and the utterance endpoints are lost is prevented. However, the problem of getting hard to hear the utterance starting points and the utterance endpoints still remains unsolved.
This problem, especially about the utterance endpoints, can be obviated to some extent by providing a protection section. FIG. 30 is a graph showing an example of how the voice speed is controlled in the case of providing the protection section. If the misjudgment about the utterance endpoints occurs in excess over the protection section, the problem of getting hard to hear the utterance endpoints is not obviated. In this case, it is considered to set the protection section comparatively long. In the protection section, however, the voice speed is not basically increased, and hence excessive elongation of the protection section hinders the obviation of the delay and is therefore unpreferable.