Conventionally, a moving image playback apparatus such as a video tape recorder or the like, which also plays back audio data, comprises a multiple-speed playback function, quick fastforward function, and the like to allow the user to preview the entire moving image (i.e., the full contents to be played back) within a short period of time upon playback.
For a video tape recorder as a typical moving image playback apparatus, the following technique has been proposed in recent years. That is, upon executing multiple-speed playback of a recording medium, first voice periods in which voice energy is equal to or higher than a predetermined threshold value, and second voice periods in which voice energy is lower than the predetermined threshold value, are detected, and audio signal components in the first voice periods successively undergo pitch conversion and are played back. In this way, the contents of the recording medium can be audibly played back at a double speed, so that the user can understand the contents of playback voice which is slightly in rapid utterance, while deflating the second voice periods.
However, when the audio signal locally undergoes a pitch conversion process, synchronization between voice and video data cannot always be maintained upon moving image playback (moving image quick-preview playback). Hence, since a video image of a person who is speaking in the playback video cannot be synchronized with his or her playback voice, the playback result is unnatural for the human sense, and the user may find it unsatisfactory.
For example, Japanese Patent Laid-Open Nos. 10-32776, 9-214879, and the like have proposed techniques which detect silent states based on voice energy, and recognize voice other than the detected silent states as voice periods uttered by persons so as to summarize a moving image. However, in a moving image such as a news program or the like throughout which voices uttered by persons are dominant, voice periods uttered by persons can be detected to some extent on the basis of voice energy, but this method is infeasible in an environment where background noise or background music is present.
Furthermore, many prior arts that detect voice and play back a moving image in consideration of detected voice have been proposed even before the aforementioned patent publications. Most of these techniques detect voice by executing a threshold value process of voice energy. In the background of these techniques, a problem caused by the ambiguity of the Japanese language is present, i.e., “human voice” such as speech is called “ (/onsei/)” in Japanese, and general sounds including human voice are also called “(/onsei/)”. Therefore, it is inappropriate to generically name the threshold value processes of sound energy in such prior arts as true “voice detection”.
On the other hand, Japanese Patent Laid-Open No. 9-247617 has proposed a technique for detecting “feature points of voice information or the like” by obtaining feature points by computing the FFT (Fast Fourier Transform) spectrum of an audio signal, and analyzing its tone volume. However, with the method using the FFT spectrum, when an audio signal to be played back contains so-called background music or the like, which forms a spectrum distribution over a broad range, it becomes difficult to detect voice uttered by a person from such signal.
In this way, the conventional moving image playback that involves voice suffers a problem that detection of voice periods is too technical and inaccurate, as described above. Furthermore, when a moving image summary is generated or a moving image undergoes multiple-speed playback using the detection result, synchronization between video and audio data cannot be maintained upon playback.
In recent years, media in which information of utterance contents is multiplexed on moving image data and an audio signal or is inserted in another region or band by means of a caption, closed caption, or the like are available. Upon playing back such media, when a moving image summary is to be generated or a moving image undergoes multiple-speed playback using the detection result of voice periods, synchronization between video and audio data cannot be maintained upon playback.
In general, it is not easy for some users such as elderly persons, children, and the like to make full use of various apparatuses. In addition, voice uttered rapidly cannot be well understood by such users. Hence, upon executing quick preview (clipped playback) of contents such as multiple-speed playback in the aforementioned moving image playback apparatus such as a tape recorder or the like, optimal playback conditions for such user are different from those for normal users.
Furthermore, upon executing quick preview (clipped playback) of contents such as multiple-speed playback in the aforementioned moving image playback apparatus, optimal playback conditions for users with poor dynamic visual acuity, users with hearing problems against rapid utterance, non-native foreign users of a language of voice to be played back, and the like are different from those for normal users.