Cameras of such types as life-log cameras or action cameras have been widely used in the fields of sports and the like in recent years. Since such cameras are mostly used in long-time continuous photographing and use a monotonous composition, it is hard to fully appreciate captured moving images of the cameras without image processing. For this reason, for example, a technology for generating a fast reproduction moving image by reproducing a captured moving image at a high speed to rearrange the image to be short has gained attention. Such fast reproduction moving images include, for example, so-called time-lapse moving images, and hyper-lapse moving images that are first-person time-lapse moving images (which are captured from the point of view of a photographer himself or herself).
With respect to videos, for example, technologies of suppressing significant camera shake appearing in fast reproduction have been developed. Meanwhile, with respect to sound, for example, technologies of performing fast reproduction while suppressing distortion of pitch and tone using speed control in which waveforms are extended/contracted or thinned out have been developed. In the speed control technology, however, there are cases in which pitch and tone are severely distorted or a speaking sound of a person becomes fragmented so it becomes hardly-understandable unnatural sound at reproduction speeds applied to fast reproduction moving images (e.g., a quad speed and higher). For this reason, there is demand for a technology for reproducing natural sound in fast reproduction moving images.
For example, Patent Literature 1 described below discloses a technology of dividing input sound of a person into utterance sections and non-utterance sections and reproducing the non-utterance sections at a higher speed than the utterance sections.