The amount of video data stored in multimedia and other archives grow very rapidly which makes searching a time consuming task. Text is an important component in video data conveying a major portion of its information. In several to types of programming, such as sports and news, graphic overlays which include text and symbols (e.g., logos) are superimposed on the video picture content. Such superimposing is generally done by video character generators such as manufactured by Chyron Corporation of 5 Hub Drive, Melville, N.Y. 11747. The text is known as “overlayed text”. For example, in broadcasting, the overlay of “Breaking News” is superimposed by the producer over the main display.
The text information may also be present as part of the picture content, which we will term “in-scene text”. An example for text that is present as part of the scene are road-signs, billboards with visual text information, ad campaign titles on players shirts in the field etc. The in-scene text information is captured by the camera as it is filming the scene.
Additional characteristics of text information in video, are “static text”, in which case the text information is static with respect to the picture information, and “scrolling text”, in which case the text is scrolling on the screen with a rate that is independent of camera motion or object motion In the scene.
While the graphic overlays are generally displayed at a constant image location and exhibit only temporal variations, (namely appearance and disappearance), in other cases the overlay may be moving (e.g. scrolling). The term “text” is used in the present application to indicate “static text”, and “scrolling text” refers to a specific separate scenario.
Reference is now made to FIGS. 1A and 1B, which illustrate four examples of video text information, as follows:                overlayed text examples of FIG. 1A (static and scrolling, left to right, respectively). The static is shown in picture 134 and the scrolling left to right is shown in pictures 135 and 136 respectively of FIG. 1B.        In-scene text examples (bottom row of FIG. 1A), with two scenarios depicted. In one example, the text is on the main moving object in the scene (picture bottom left 137), and in the second example, the text is part of the background (picture bottom right 138/139).        
Two classes of applications for text localization are known, as follows:                document conversion; and        searching purposes (such as Web searching) and image and video indexing purposes.        
The first class of applications (document conversion) mostly involves binary images and requires a very high accuracy in locating all the text in the input image. This necessitates a high image resolution.
On the other hand, the most important requirements for the second class of applications is a high speed for the localization and content extraction, with only the most important text in the image or frame required to be extracted. For example, only font size above a certain threshold may be of interest.
Automatic text location without character recognition capabilities refers to locating regions, which only contain text without a prior need to recognize characters in the text. Two primary methods are used for locating text:
i).The first method regards text as textured objects and uses well-known methods of texture analysis, such as Gabor filtering and spatial variance to automatically locate text regions.                ii) The second method of text location uses connected component analysis. This method is very fast and achieves high localization accuracy. It has mainly been used in binary images and has recently been extended to multi-valued images, such as color documents and video frames.        
FIG. 2 illustrates an example of the latter method. The input image is decomposed into multiple foreground images. Each segmented image passes a connected component module and a text identification module, These modules may be implemented in parallel. The outputs from all channels are composed together to identify locations of text in the input image. Text location is represented as the coordinates of its bounding box.
In dealing with multi-valued images, the image is decomposed into a set of “real foreground” images and a “background-complementary foreground image”. A binary image has two element images, the given image and its inverse, each being a real foreground image. In pseudo colored images, real foreground images are extracted via histogramming the pixel values and retaining those foreground images in which the number of pixels is greater than a given threshold. The color with the largest number of pixels is regarded as the background, from which the background-complementary foreground images can be generated.
For color images and video frames, color quantization schemes or clustering may be used, as known in the art to generate a small number of meaningful color prototypes. Once a set of color prototypes is extracted, a similar method as for pseudo-color images can be used to produce real foreground and background-complementary foreground images for the color-quantized images.
After the decomposition of the multi-valued image, a look up table is obtained associating pixel values to foreground images. Each pixel in the input images may contribute to one or more foreground images specified by the table.
It is known to generate connected components on each now-binary masked image per foreground image, as described by Jain and B. Yu, in “Automatic text Location in Images and Video Frames”, TR. A connected component algorithm may be implemented in parallel for all foreground images. If we assume that most of the important text content is horizontally positioned in the frame, clustering of the connected components in the horizontal direction is pursued resulting in candidate text lines.
A verification module in the system determines whether candidate text lines contain text or non-text based on statistical feature of connected components. For separated characters, their corresponding connected component should be well aligned. The number of connected components should be in proportion to the length of the text line. For characters touching each other, features can be extracted based on the projection profiles of the text line in both the horizontal and vertical directions.
Connected component analysis and text Identification modules are applied to individual foreground images. Text lines extracted from different foreground images may be overlapping and therefore, need to be merged. Heuristic rules may be used in merging the information into a final set of bounding boxes localizing the text in the individual frame.
Examples of the text localization stage are illustrated in FIG. 3, to which reference is now made. FIG. 3a shows examples of foreground images extracted from a given input image. FIG. 3b exemplifies a connected component scenario and the generation of candidate text line and FIG. 3c demonstrates the use of projections in the verification stage and the string formation stage, binarization procedure.
FIG. 4 is a schematic block diagram illustration of the detection-binarzation-OCR process for a single frame image.
Optical character recognition is well known in prior art. Generally recognition is done on text images that are bi-level (black or white). OCR engines are commercially available, such as for example, an OCR engine available from Caere Corp., 100 Cooper Court, Los Gatos, Calif. 95032 USA.
The clarity of the text in the frame being analyzed is a sensitive and critical point. The accuracy of text detection (as well as the binarization process and the OCR) is dependent on issues such as contrast and occlusions. Such is the case that if the color of the text string is white, for example, and it is overlayed on a brightly colored shirt in the scene, the recognition of the words may only be partly successful. If the text string is partly occluded, such as the case when a person is walking in front of the street sign and only part of the sign is visible, again the recognition accuracy will be greatly diminished.
In a single frame we may get unclear text, as in the above examples, as well as lighting conditions. It may be the case that in a single frame only partial information is present. Such is the case for “in-scene text” when there is camera motion, and for “overlayed text” when there is scrolling of the text on the screen. In the scrolling scenario, only partial words and partial sentences may be obtained in the single frame scenario.
When utilizing a frame-by-frame system, it may be possible to increase the accuracy of the results (per issues mentioned above) by combining results from each frame analysis. This involves much redundancy in the amount of work needed to analyze each frame independently, since though it increases the search time, as information per each frame will be preserved it will still not enable the option of querying temporal content to be exercised.
A further disadvantage of the prior art is the categorization of the text into in-scene vs. overlayed text, static vs. scrolling is not possible in a single frame by frame analysis scenario.
A possible application for frame-by-frame video text indexing is in monitoring exposure of brand names. Companies put their name on billboards and other objects in events of high television exposure such as sports events. These companies want to know the actual level and quality of exposure of their brand name to the audience. This data can be later matched with audience metering data to reflect the actual commercial value of the brand name exposure. While in overlaid text the quality of the presentation is controlled, in the latter example, that quality is governed by the motion of the camera, which generally tracks the action in the scene. On such occasions the exposure of brand name is uncontrolled. What is needed is a method to index text in video, comparing said indexing data with a list of brand name to derive brand name exposure data.
Prior art teaches how to detect billboards which contain a known pattern or image and track them over time. A disadvantage of the known art is that it cannot derive indexing data in the case that the same brand name appears in many sizes and color, none of which is known to the system beforehand.
Furthermore, existing techniques have several disadvantages including that for text detection, binarization and recognition in single frame scenarios lose the temporal information that is a vital component of the video sequence, in that they lose accuracy in the recognition. In addition, the existing systems lose the temporal coding information regarding the string initialization and end points and lose the ability to categorize the text strings into selected categories.