The present invention relates to a video retrieval apparatus and method capable of retrieving a desired scene (video and/or voice) using a key word.
Recently rapidly popularized computer networks represented by multi-channel broadcast and the internet distribute a huge amount of videos to societies including homes. Meanwhile increased recording medium capacity enables a large amount of video signals to be stored in the homes. This phenomenon requires techniques for retrieving a video scene that a user desires from the large number of video signals easily and with high accuracy.
Conventionally considered methods are a method that detects a changing point of video signals from a variation of the video signals to display a video scene according to the point, and retrieval systems such as a method that detects a particular scene comprised of particular objects to display using an image recognition technique. However there is a problem that in these retrieval systems, a user""s purpose of retrieving is not always reflected on a retrieved scene accurately.
Further there is a retrieval system that reads subtitle information and closed caption information that American broadcast adopts from videos by character recognition to retrieve a particular scene. This system enables a user to acquire the scene on which the user""s purpose of retrieving is reflected accurately in scenes well-adopting the subtitle information and closed caption. However, since such information is limited to part of broadcast programs because the information needs to be inserted manually, it is difficult to widely apply the information to general videos.
On the other hand, it is expected that using as a key word voice information accompanying videos achieves a retrieval system that reflects a retrieval purpose accurately. Unexamined Japanese Patent Publication HEI6-68168 discloses a video retrieval system that retrieves a desired scene using a voice key word.
FIG. 1 illustrates a functional block diagram of the retrieval system disclosed in above-mentioned Unexamined Japanese Patent Publication HEI6-68168. Voice/video input section 201 receives a voice signal and video signal, voice signal storage section 202 stores the received voice signal, and video signal storage section 203 stores the received video signal. Voice analysis section 204 analyzes the voice signal to generate sequence of characteristic parameters representative of characteristics of the voice. Voice characteristic storage section 205 stores the generated sequence of characteristic parameters.
Meanwhile a key word for a user to use in a scene retrieval later is provided in the form of a voice to key word characteristic analysis section 206. Key word characteristic analysis section 206 analyzes the voice as the key word to generate sequence of characteristic parameters representative of characteristics of the key word. Key word characteristic parameter storage section 207 stores the generated sequence of characteristic parameters.
Key word interval extraction section 208 compares the sequence of characteristic parameters of the voice signal stored in the storage section 202 with the sequence of characteristic parameters of the key word voice, and extracts a key word interval in the voice signal. Index addition section 209 generates index position data 210 that relates the extracted key word interval to a frame number of the video signal corresponding to the voice signal.
When a retrieval is performed using index position data 210, it is possible to designate the frame number of the video signal in which the key word appears using the voice signal, thereby enabling video/voice output section 211 to output a corresponding video and voice, and consequently to present the user desired video and voice.
However there is a problem that it is necessary to register in advance a voice key word to be used in a retrieval, and that it is not possible to retrieve using other key words. In particular, a user input uncertain key word results in a retrieval error, and thereby it is not possible to retrieve a scene reflecting a retrieval purpose accurately.
The present invention is carried out in view of foregoing. It is an object of the present invention to provide an apparatus and method capable of retrieving a scene that a user desires in retrieving a video and/or voice, using an out-of-vocabulary word other than words and key words that are registered in advance for example, in a dictionary, and an uncertain key word that the user inputs.
The present invention provides a scene retrieval system which applies a series of voice recognition processing procedures separately to generation of retrieval data and retrieval processing, and thereby which is capable of retrieving a video/voice scene that a user desires with high speed, and reproducing the scene with high speed.
Further it is designed to generate sequence of a score of a subword, which is an intermediate result of the voice recognition processing, as a retrieval index in generating retrieval data, and to convert an input key word into time series of subword to collate with the retrieval index in retrieval processing.
Therefor it is not necessary to collate with a word dictionary or a retrieval key word registered in advance, and thereby the problem, so-called out-of-vocabulary word problem, is solved that it is not possible to cope with an unregistered key word. Further it is possible to retrieve a video/voice scene with the highest reliability even when a user inputs an uncertain key word.
Moreover the sequence of the score of the subword that is the retrieval index is multiplexed in a data stream along with the video signal and voice signal, whereby it is possible to transmit the retrieval index through broadcast networks and communication networks such as the internet.
The subword is a basic unit of an acoustic model that is smaller than a single word. Examples of the subword is a phoneme, syllable such as consonant-vowel and vowel-consonant-vowel, and demisyllable. Each word is represented as a sequence of subwords.