The embodiments described in the disclosure relate to the field of image processing and specifically, to systems and methods for video and image processing, image recognition, and video annotation using sensor measurements.
Recently, action videos have become very popular due to the wide availability of portable video cameras. At the same time professional and semi-professional video of sporting events have become more common and more sophisticated. To achieve near professional quality of mass sport video and to make sport video more interesting and appealing to the viewer, multiple special effects are employed. It is often very desirable to annotate video with on screen comments and data, e.g. velocity, altitude, etc. These parameter values are usually obtained from sources that are not internal or connected to the camera device. It may also be desirable to analyze activities captured in a video, compare it with other videos, or select specific parts of the video for zooming, annotation, or enhancement. To achieve these effects, multiple conditions must be satisfied.
To do this correctly, video frames capturing the selected event must be determined exactly. Since the most interesting events are often very fast motions, the time synchronization must be very accurate to provide the desired visual effect. For example, to slow down only the frames showing a skier's jump, time synchronization must be accurate to tenths of a second to create the appropriate visual effect.
For example, to select a particular part of a frame for enhancement (e.g., of a basketball player performing a dunk), a camera frame must be well calibrated to the real world three-dimensional coordinates. While camera calibration is well known (e.g. Tsai, Roger Y. (1987) “A Versatile Camera Calibration Technique for High Accuracy 3D Machine Vision Metrology Using Off-the-Shelf TV Cameras and Lenses,” IEEE Journal of Robotics and Automation, Vol. RA-3, No. 4, August 1987, pp. 323-344), for a mass market adaptation such procedures must be highly automated with a possible use of image recognition of the sample target in the video frame.
There are methods that sync camera time and sensor time by using a common time source such as GPS or network time (e.g., commonly owned U.S. Pat. No. 8,929,709). Such methods require an accurate time source in both camera and sensor. Unfortunately, some cameras don't allow very accurate sub-second timestamps. Therefore, additional synchronization tuning is required. Image recognition methods can determine the video frame where a particular action starts or ends and, therefore, allow synchronization up to the time resolution of a frame.
A separate requirement may be the graphical enhancement of the video by adding graphics to particular images in the frame, such as a person's face, etc.
Image recognition has become a common part of the video and image processing. It is used to recognize particular images, like faces, cars, animals, or recognize and track particular objects or activities, say athlete jumping or moving.
In all the above applications image recognition methods are very CPU intensive. To make video image analysis efficient one needs to know what kind of motion or image to search for. Modern automatic cameras and drones that can work in autonomous or “start and forget” modes produce gigabytes of video data that needs to be analyzed for image recognition. Therefore, for efficient image recognition, it is very advantageous to know the range of frames in which to search for the desired images and an area of the screen where such images should appear.