The invention relates to the general field of processing video stream images.
The invention relates more particularly to a method of tracking characters that appear in images (i.e. frames) of a video stream of a document that contains text made up of one or more lines of characters. Each image of the video stream represents all or part of the text. No limit is associated with the document medium under consideration nor with the form of the text, nor indeed with the type of sensors used for acquiring the images. Thus, by way of example, such a document may be a page of an identity document having marked thereon lines of characters in a machine readable zone (MRZ), a number plate, etc. The sensor may be a camera, a contactless sensor, such as a contactless biometric sensor, a sensor embedded in a mobile telephone such as a smartphone, or it may be constituted by a plurality of sensors, etc., with the document traveling past the sensor(s) and with the sensor(s) being suitable for remotely acquiring partial or complete images of the document in question.
In known manner, the images constituting a video stream differ from static images of the kind that can be taken by a still camera, for example, in that the images of a video stream possess time redundancy: a given line of text appears on a plurality of contiguous video frames. Advantage may be taken of this time redundancy in order to improve the chances of locating a text and of recognizing the characters of the text, such as portions of the text appearing under conditions that vary from one frame (image) to another. The purpose of tracking characters that appear in a plurality of images of a video stream is thus to determine the positions of the characters in continuous and accurate manner in the dynamic scenes conveyed by the images of the video stream.
In the present state of the art, most tracking methods rely on forming a complete image of the document by aligning partial images, a technique that is also known as “mosaicing”. Such alignment can be performed in particular by correlating images with one another (known as “template matching”), or by extracting remarkable points such as, for example: so-called “scale invariant feature transform” (SIFT) points as described in the document by D. G. Lowe entitled “Object recognition from local scale-invariant features”, Proceedings of the International Conference on Computer Vision, Vol. 2, pp. 1150-1157, 1999. Thereafter, reconstruction of the complete image makes it possible to perform conventional reading of the document by recognizing the characters in the image.
Depending on the document in question, the image that is reconstructed by mosaicing may present reconstruction defects. For example, for an identity document and for reading the lines of the MRZ, putting partial images into alignment is made difficult by the repetition or the quasi-periodicity of certain character patterns (e.g. chevrons) that may give rise to matching ambiguities that are complicated to solve.
The document by S. Uchida et al., entitled “Mosaicing-by-recognition for video-based text recognition”, Pattern Recognition 41.4, 2008, pp. 1230-1240, proposes a method relying on aligning images by character recognition: the problems of mosaicing and of recognition are formulated as a single optimization problem, thereby making it possible to act in simultaneous and collaborative manner to handle both of these aspects and thereby obtain greater accuracy. Characters are aligned on the basis of being recognized and by relying on similarity between images at successive instants.
Nevertheless, as indeed emphasized by Uchida et al., that method is relatively complex.
Furthermore, the recognition of characters in the manner proposed by Uchida et al. is particularly sensitive to variations in image acquisition conditions (e.g. the existence of a reflection between two images, sampling, changes in lighting, the presence of blurring, etc.). Thus, a small variation in the image can result in erroneous detection: a typical example is a small variation of appearance in a character “2” that might lead to it being recognized as a “Z”.
There thus exists a need for a method of tracking video images that does not present such drawbacks.