With the fast development of the integrated circuit technology, the mobile terminal has possessed powerful processing capacity and is turning from a simple communication tool into a comprehensive information processing platform. With extremely powerful processing capacity, memory, consolidated storage medium, and computer-like operating system, such a modern mobile terminal is in effect a complete mini computer system capable of performing a complex processing task. With the fast increase of capacity of expandable memory of the mobile terminal, a user can store videos of most formats into a mobile terminal such as a mobile phone, a palmtop, and the like; however, since the data size of a video file is fairly large, when facing a video file of long duration, the user cannot know the content of the video unless he watches the video from its beginning till its end.
Given that there is often much redundancy in the content of the video file, the user can focus on the part of the video content he is interested in if the main scenes of the video file can be acquired; or, if the user has obtained a video image frame and wants to quickly relocate to the position of the frame and watch the video from the position, he usually can search for the position only by manually dragging the video progress bar, wherein it is not only inefficient but is also very likely to miss many scenes desired to watch due to minor shakes since the screen of a mobile terminal such as a mobile phone is too small for easy control of dragging progress. It is therefore hard to achieve accurate relocation with the method above.
In the related art, the method for retrieving content of a video file includes: first capturing video frames and then converting the captured video into multiple pictures, wherein the step is generally implemented by a third-party video decoder or DirectX provided by Microsoft; during the capture of these pictures, comparing the difference of images of multiple adjacent frames and taking those frames of bigger difference as key frames or obtaining the key frames by using another more complex determining method such as a spatio-temporal attention model; and finally, performing complex matching and comparison of the key frames with a target image to be retrieved, for example, determining with a strategy such as an expert system or a neural network; furthermore, in the related art, the result of the retrieval is obtained only after all shots of the video are processed, and thus too much memory will be taken up if all shots of the video are processed without a plan; therefore, such a method is not suitable for the mobile terminal. In addition, with such a method, complex analysis methods of huge amount of computation are adopted in most aspects such as acquisition and matching determination of the key frames, rendering the method only suitable for current computers of increasingly powerful processing capacity; but heavy resource consumption and computation of the method is unbearable for the mobile terminal with relatively limited processing capacity and resources.