In FIG. 1a has been presented a prior art example of the generic architecture of a passive auto-focus system arranged in connection with a digital imaging system 10′. Generally speaking, auto-focus 12 is the procedure of moving of the lens 14 in and out until the sharpest possible image I of the subject T is obtained. Depending on the distance of the subject T from the camera 10′, the lens 14 has to be at a certain distance from an image sensor 11 to form a clear image I.
More particular, the image sensor 11 of the device 10′ produces (image) data I for the auto-focus unit 12. Auto-focus unit 12 calculates parameters for motor 13 adjusting the position of the lens 14 on the basis of the data. The motor 13 adjusts the lens 14 owing to which the captured image I is more accurate. Of course, other architectures are also possible. There are two main approaches for auto-focus in the prior art, an active auto-focus, AAF and a passive auto-focus, PAF.
In the active auto-focus, which is the more expensive, the camera emits a signal to the direction of the object (or scene) to capture in order to detect the distance of the subject. The signal could be a sound wave, as it is the case in submarines under water, or an infrared wave. The time of the reflected wave is then used to calculate distance. Basically, this is similar to the Doppler radar principle. Based on the distance the auto-focus unit then tells the focus motor which way to move the lens and how far to move it.
However, there are several problems with this AAF approach, besides the costs. In using infrared, the subject has to be in the middle of the frame. Otherwise the auto-focus will be fooled by receiving reflected waves from other objects. The reflected beams could also come from objects in front of the subject, if user is taking a photo, for example, from behind a barrier, in a stadium. Any source of bright objects in the scene will also make it difficult for the camera to receive the reflected waves.
In the passive auto-focus, PAF, the cameras determine the distance to the subject by analyzing the image. The image is first captured, and through a sensor, dedicated for the auto-focus, it is analyzed. Usually the sensor specific for the auto-focus use, has a limited number of pixels. Thus, a portion of the image, only, is often utilized.
A typical auto-focus sensor applied in PAF is a charge-coupled device (CCD). It provides input to algorithms that compute the contrast of the actual picture elements. The CCD is typically a single strip of 100 or 200 pixels, for example. Light from the scene hits this strip and the microprocessor looks at the values from each pixel. This data is then analyzed by checking the sharpness, horizontally or vertically, and/or its contrast. The obtained results are then sent as a feedback to the lens, which is adjusted to improve the sharpness and the contrast. So, for example, if the image is very blurred then it is understood that the lens needs to move forward in order to adjust the focus.
The same technique could be used for videos as well. However, in capturing videos, there are several use cases, where an object of interest is needed to be tracked and focused on during recording process. In videos objects of interest are often moving. The movement may happen either in the same plane of the captured scene, or backward or forward with respect to the camera position. The challenge is how to maintain the focus on an object while shooting a video.
In the literature, different techniques were used for object identification and ROI tracking process initialization. Such a problem arises in many applications such as the initialization of region of interest (ROI) tracking in a video sequence. In the particular case where only human faces are targeted, the skin color information is used to detect the foreground. Some approaches apply feature-based refinement to distinguish faces from other parts of the body with similar color characteristics, e.g. hands. Such methods cannot be applicable in the general case where the target can be of any type, such as, for example, a car, an airplane, an athlete, an animal, etc. In the case where the target is not necessarily of a particular type, e.g. human face, but can rather be any object, the identification can not be done automatically and user input becomes imminent.
As a matter of fact, in order to perform any ROI-based editing/tracking, the target has to be differentiated from the background. Other semi-automatic alternatives require quite complex input from the end-user. The latter is expected to select points on the boundary of the ROI, which will be used as input for automatic segmentation. The user still can supervise the result of segmentation and provide feedback, if necessary. This type of user-interaction is not trivial and might be considered as tedious for the typical mobile device user. Such approach might make any related application experience unpleasant. Besides, the problem of image/frame segmentation has no simple solution yet and most reliable methods are of high computational complexity.
In addition to the above, when the ROI has been first identified then a tracking process is performed in order to localize the ROI in each frame. Generally speaking, visual tracking of non-rigid objects in a video sequence can be seen as a recursive-matching problem from which certain reliability is required. At each time step, a region in the current frame is matched to a previous instance, or a model, of the target.
The difficulty of the tracking problem is of multiple dimensions. In real life, 3D objects captured on a 2D video frame do not capture the depth of the objects. Changes in the object that occur with time, such as translation, rotation, deformation, etc., are not captured faithfully on the 2D screen. Tracking the object while it undergoes these changes is a challenging task.
Firstly, variability in the target's location, shape and texture and the changes in the surrounding environment make target recognition a very challenging task. More precisely, the object may also be affected by its surroundings and other external factors. Some examples of these are interference with other objects, occlusions, changes in background and lighting conditions, capturing conditions, camera motion, etc. All these numerous factors impede a robust and reliable tracking mechanism of the object-of-interest.
Secondly, the matching can be done based on the colour, shape or other features. Methods based on one of these aspects usually provide robustness in one sense but show weaknesses under some other scenarios. Finally, the tracking of shapes and features involves significant computational load. Therefore, algorithms that consider more than one multiple visual aspect of the target can enhance tracking performance but at the expense of higher computational load.
Existing tracking methods can be classified as colour-based, shape-based, motion-based or model-based. The latter usually relies on multiple features, i.e. colour, motion, shape and texture details. Methods using colour content usually are computationally simple. However, their drawback is mainly the false alarms generated when the target shares some colour contents with a neighbouring background region. In such a scenario, the tracked region can falsely expand or split into two distinct regions and the tracking starts degrading. In the literature, shape and feature based methods are usually computationally complex and memory-inefficient. The main sources of complexity are the modelling of such features and the matching of the model, e.g. edge matching, contour representation and matching, etc.
Examining most of the proposed tracking methods, one can conclude that tracking robustness is usually associated with very high computational complexity. This is not a feasible approach for mobile devices, for example, with limited computational power. Hence, there is a need for a new solution that provides robust tracking with low complexity algorithms.