An exemplary device for extracting and matching features of moving images is described in Non-Patent Document 1. FIG. 14 is a block diagram showing the device described in Non-Patent Document 1.
A block unit feature extraction unit 1000 extracts features in block units from a first video to be input, and outputs a first feature to a matching unit 1030. Another block unit feature extraction unit 1010 extracts features in block units from a second video to be input, and outputs a second feature to the matching unit 1030. A weighting coefficient calculation unit 1020 calculates a weighting value of each of the blocks based on a learning video to be input, and outputs a weighting coefficient to the matching unit 1030. The matching unit 1030 compares the first feature output from the block unit feature extraction unit 1000 with the second feature output from the block unit feature extraction unit 1010 using the weighting coefficient output from the weighting coefficient calculation unit 1020, and outputs a matching result.
Next, operation of the device shown in FIG. 14 will be described.
The block unit feature extraction unit 1000 divides each of the frames of the input first video into blocks, and calculates a feature for identifying the video from each block. Specifically, the block unit feature extraction unit 1000 determines the type of the edge for each block, and calculates the type as a feature of each block. Then, for each of the frames, the block unit feature extraction unit 1000 forms a feature vector configured of the edge types of the respective blocks. Then, the block unit feature extraction unit 1000 calculates the feature vector of each of the frames, and outputs the acquired feature to the matching unit 1030 as the first feature.
Operation of the block unit feature extraction unit 1010 is similar to that of the block unit feature extraction unit 1000. The block unit feature extraction unit 1010 calculates the second feature from the input second video, and outputs the acquired second feature to the matching unit 1030.
On the other hand, the weighting coefficient calculation unit 1020 calculates probability that a caption is inserted in each block of a frame beforehand, using a learning video. Then, based on the calculated probability, the weighting coefficient calculation unit 1020 calculates a weighting coefficient of each block. Specifically, a weighting coefficient is calculated such that weighting becomes high as the probability of a caption being superposed is low, in order to improve robustness to caption superposition. The acquired weighting coefficient is output to the matching unit 1030.
The matching unit 1030 compares the first feature output from the block unit feature extraction unit 1000 with the second feature output from the block unit feature extraction unit 1010, using the weighting coefficient output from the weighting coefficient calculation unit 1020. Specifically, the matching unit 1030 compares the features of the blocks at the same position in the two frames, and calculates a score of the block unit such that the score is 1 if they are the same, and the score is 0 if they are not the same. The matching unit 1030 sums the acquired scores of the block units by weighting them with use of the weighting coefficients, and calculates a matching score (similarity of a frame unit) of the frame. The matching unit 1030 performs these processes on the respective frames to thereby acquire a matching result between the first video and the second video.
Through these processes, it is possible to perform matching of moving images while reducing influences of caption superposition in portions where the influences may be large, and to achieve high matching accuracy even with caption superposition.
Patent Document 1 describes a device for retrieving moving images, using features of images such as mean values in block units or DCT coefficients and motion vector information obtained between previous and next frames. In the moving image retrieval device of Patent Document 1, first, at least one of values of physical moving image feature information including luminance, color difference information, and color information of each frame, a mean value thereof, the sum of the values, or a difference value thereof, is extracted from the input image with respect to each frame. Then, the extracted values are aligned on a time axis, and all values in the alignment or values extracted from the alignment in certain intervals or irregular intervals are extracted as moving image feature information. Alternatively, it is also possible to extract a DCT coefficient and motion compensation information of a frame from compressed moving image data, and obtain a mean value of DCT coefficients, a sum value thereof, or a difference value of the values, and from the motion compensation information, obtain at least one of a motion vector, an average motion vector between previous and next frames, a sum motion vector, a difference vector, a motion vector of the frame as a whole, and the like. Then, the obtained values are aligned on a time axis, and all values in the alignment or values extracted from the alignment in certain intervals or irregular intervals are extracted as moving image feature information.