In recent years, with the widespread use of a mobile phone having a moving image shooting function, a digital video camera, and the like, individual users have become able to hold an enormous number of multimedia contents (moving images with audio, and hereinafter referred to as just “moving images”). This leads to a demand for a means for efficiently searching for a moving image.
Conventional methods of searching for a moving image includes a method of searching for a moving image with use of a title thereof by applying a title to each moving image in advance, a method of searching for a moving image with use of a category to which the moving image belongs by classifying each moving image into any one of a plurality of categories in advance, and so on.
There has also been used a method of searching for a moving image by creating a thumbnail of each moving image in advance and displaying the created thumbnails of the moving images side-by-side. This allows users to easily search for a moving image with his eyes.
However, these methods require users to perform an operation of applying an appropriate title to each moving image, classifying each moving image into a category, creating a thumbnail of each moving image, or the like. This is troublesome.
By the way, as an art relating to classification of moving images, there has been disclosed an art of extracting a highlight part from a moving image of sports focusing on audio (see Patent Literature 1). According to the art of the Patent Literature 1, a feature is extracted from each of short time sections (of approximately 30 ms) of audio contained in a moving image, and a period, in which audio having a specific feature (such as applause and shout of joy) lasts for a predetermined period or longer, is classified as a highlight part.
Furthermore, as an art relating to audio classification, there has been disclosed an art of classifying a voice of an unknown speaker in discussions (see Non-Patent Literature 1). According to the art of the Non-Patent Literature 1, a piece of feature data with respect to each of many speakers is prepared in advance, and clustering is performed based on a similarity between audio and each piece of the feature data, thereby classifying a section in which each speaker's voice is output.
With the above two arts, it is possible to classify which part (of approximately several milliseconds to several seconds) of audio contained in a moving image represents what type of sound. For example, with the art of the Patent Literature 1, it is possible to classify a part in which massive applause lasts for a predetermined period or longer, as a highlight part in which an event comes to a climax. Also, with the art of the Non-Patent Literature 1, it is possible to classify which speaker is speaking in which part of discussions.