The present invention relates to the extraction of textual information from a sequence of video frames in which each frame includes an image portion containing the textual information.
It has become important to be able detect and recognize textual information from images of that information. One application is tracking the identity of automobiles through their license plates, for example for automatic traffic violation control, automatic parking lot billing, etc. Another application is tracking the content and identity of boxes and other containers through labels attached to them, for example, production tracking of factory material and routing of finished goods or outgoing packages. Many other applications exist for such technology. In these applications, a camera scans an area of interest. When textual information of interest passes through the visual field of the camera, an image of the textual information is temporarily stored. That image is analyzed to locate the textual information in that image. The image of the textual information in the full image is extracted. The text is then recognized from the textual information image. For example, a camera might be located at an exit of a parking lot to take a picture of departing cars. When a car leaves the lot, the picture containing the image of the car is stored in a memory. From this image the license plate within the image of the car is located. Then the image of the characters on the license plate is extracted. Finally, the actual text on the license plate is recognized from the character image for billing purposes.
Much work has been done in the area of recognizing the textual information from the image of that information; for example, recognizing the letter xe2x80x9cAxe2x80x9d from the image of an xe2x80x9cAxe2x80x9d. This is termed optical character recognition (OCR). However, before the OCR operation can occur, the character images must be extracted. The present application is related to the character image extraction operation. Various approaches exist in the prior art.
In general, the textual information is assumed to be an image with the character being one color on a background of a contrasting color. For example, on license plates, it may be assumed that dark or black characters are placed on a light or white background. The previously located area containing the textual information within the image (i.e. the license plate) is converted to an array of pixels, each pixel having a value representing the brightness of the pixel. One approach to character image extraction has been to use a global threshold. In this approach, A global threshold is established. If the value of the pixel is on one side of the threshold (for example, greater than the threshold) that pixel is assumed to be a character image pixel, and if it is on the other side of the threshold (i.e. less than the threshold) that pixel is assumed to be a background image pixel. Prior art approaches also apply global contrast enhancement prior to character extraction. This approach does not work well in real life applications. First, the resolution of the textual information is usually low because the original image in which the textual information resides contains much more information than the textual information alone, for example, the parking lot image described above contains an image of the entire car, and the license plate is a small percentage of the whole scene, containing a small percentage of the pixels contained in the whole scene. Second, global thresholding and contrast enhancement operates accurately only when the scene being processed is uniformly illuminated and not too noisy. This is seldom the case in real life applications.
In the paper xe2x80x9cMorphology Based Thresholding for Character Extractionxe2x80x9d IEICE Transactions on Inf. and Syst., E76-D(10):1208-1215, 1993; a method is described for extracting character images in which characters are considered as xe2x80x9cditchesxe2x80x9d formed of two edges of opposite directions. Morphological operators enhance the area within the ditch. This method works when the contrast between characters and the background is high, but not when the contrast is low, which can occur in real life applications.
Other approaches to character extraction utilize adaptive thresholding, in which thresholds are derived from local regions, instead of globally. Such methods can deal with images which are not illuminated uniformly. However, the accuracy of such methods does depend on the selection of the local regions. If the local regions are selected such that the image of a single character spans two regions, a broken character might result if the thresholds selected for the two regions are different. One solution to this problem is to select and then grow a region in an attempt to ensure that the image of a single character is contained within a single region.
All the above prior art character image extraction approaches analyze a single frame of image information to extract character image information. However, the inventors have realized that additional information is available in successive video frames containing textual information. The information in multiple video frames can desirably improve the performance of the character image extraction function.
In accordance with principles of the present invention, a method for extracting an image representing textual information from a video sequence includes the following steps. First, receiving a sequence of video frames, each including an image of textual information. Then, locating the textual information in each frame of the video sequence to form a stack of text arrays, each array containing data representing substantially only the textual information. Finally, extracting a single textual image array representing the image of the textual information from the stack of text arrays.
In accordance with another aspect of the invention, apparatus for extracting an image representing textual information from a video sequence includes a source of a video sequence having a plurality of frames, each containing an image of the textual information; and a processor, coupled to the video sequence source, responsive to all of the plurality of frames, for generating a single array representing an image of the textual information.