In the past, it has been very difficult to identify and read text which is placed in a still or video image. Digital "photographs" and videos have been playing an increasing role in education, entertainment, and multimedia applications. With hundreds of thousands of videos, there have been urgent demands on efficiently storing, browsing, and retrieving video data. For example, the credits in a movie, subtitles in foreign movies, or for the hearing impaired, dates in home videos, or even logos and trademarks are important because it would be possible to determine the contents of still and moving images (herein after referred to as "pictures").
Various video understanding techniques using one or a combination of image contents, audio, and textual information presented in the videos have been proposed to index, parse, and abstract the massive amount of video data. Among these, the texts that are presented in the video images play an important role in understanding the raw video sequences. For example, the captions in news broadcasts usually annotate where, when, and who of the ongoing events. In home videos, the captions/credits depict the title, the producers/actors, or sometimes, the context of the story. In advertisements, the text presented tells which product it is. Furthermore, specific texts/symbols that are presented at specific places in the video images can be used to identify the TV station/program of the video. Essentially, the texts/captions in videos provide highly condensed information of the contents of the video. They are very useful for understanding and abstracting the videos and facilitate browsing, skimming, searching, and retrieval of digital video databases.
While extracting information from images and videos are easy to do for humans, it is very difficult for computers to do. First there is the optical character recognition (OCR) problems which still prevent 100% recognition of black characters even on a white background. The problems are compounded when the text is superimposed on complex backgrounds with natural images or complex graphics.
Many attempts to solve these problems have been performed on uncompressed still images or decompressed video sequences. These methods in general utilize the characteristics of text including: 1) restricted character size, 2) text lines always appearing in clusters of vertical characters which are aligned horizontally, and 3) text usually having a high contrast with the background.
Almost all the previously published methods on locating text can be categorized as either component-based or texture-based.
For component-based text extraction methods, text is detected by analyzing the geometrical arrangement of edges or segmented color/grayscale components that belong to characters. For example, in one system text is identified as horizontal rectangular structures of clustered sharp edges. Another system extracts text as those connected components of monotonous color which follow certain size constraints and the horizontal alignment constraints. In a similar manner, a further system identifies text as connected components which are of the same color, which fall in some size range, and which have corresponding matching components in consecutive video frames. Since the component-based approach assumes that the characters come out as connected color/grayscale components, these systems usually require images of relatively high resolution in order to segment the characters from their background.
For texture-based extraction methods, text is detected by using the characteristic that text areas possess a special texture. Text usually consists of character components which contrast with the background, which at the same time exhibit a periodic horizontal intensity variation due to the horizontal alignment of characters, and which form text lines with about the same spacings between them. As a result, using texture features, text is expected to be segmented. One system uses the distinguishing texture presented in text to determine and separate text, graph, and halftone image areas in scanned grayscale document images. Another further utilizes the texture characteristics of text lines to extract texts from grayscale images with complex backgrounds. They defined for each pixel the text energy as the horizontal spatial variation in a 1.times.n neighborhood window, and located texts as rectangular areas of high text energy. This method was applied to a variety of still intensity images with an acceptable performance.
Almost all the previously published methods are performed on uncompressed still images or image sequences even when they are designed for digital videos, for example, JPEG (Joint Photographic Experts Group, a world-wide standards organization) images. However, digital videos and some image formats are usually compressed to reduce the size of the data for efficient storage and transmission. For example, MPEG (Motion Picture Experts Group, a worldwide standards organization) videos are compressed exploiting the spatial redundancy within a video frame and the temporal redundancy between consecutive frames. The spatial information of an image/frame is obtained by applying a decompressing algorithm to the compressed version. As a result, it is difficult to apply the image processing procedures of the previously mentioned systems directly in the compressed domain. The digital video sequence is decompressed before one can apply such text detection/extraction algorithms. None of the previous systems attempts to utilize features in the compressed domain to locate text directly.
In the multimedia arena, as digital image and video data accumulates and compression techniques become more and more sophisticate, there is an emerging trend and need of feature extraction and manipulation directly from compressed domain image/videos. By manipulating features directly from the compressed domains, it would be possible save the resources (computation time and storage) of decompressing the complete video sequence.
One system has been proposed to extract embedded captions in a partially uncompressed domain for MPEG videos, where the video frames are reconstructed using either the DC components or the DC components plus two AC components at a reduced resolution. Then areas of large between-frame differences are detected as the appearance and disappearance of the captions. Unfortunately, this method only detects abrupt captions and does not handle captions that gradually enter or disappear from the frames. It is further vulnerable to moving objects in the image. As the image resolution is reduced by a factor of 64 (DC sequence only) or 16 (DC+2AC), a considerable amount of information is lost and the accuracy of the method is deteriorated.
Thus, there has been a great deal of unsuccessful effort expended trying to find a system which can identify text any place in a picture. Further, with moving pictures, it is necessary that the system be able to recognize text in real time.
In addition, since there are currently no systems by which an OCR system can distinguish between text and a background image, such a system is highly desirable.
Further, there are no systems which can read text where the images are compressed such as MPEG-1 or MPEG-2 or MPEG-4. Decompressing compressed images takes up time and storage space and is therefore undesirable.
Thus, a system which can actually determine which areas are actually likely to have text and then decompresses only those areas would be extremely valuable.