1. Field of the Invention
The present invention relates to a video image processing apparatus, more specifically to a text image extraction apparatus for e-Learning video. The text change frame detection apparatus locates the video frames that contain text information. The text extraction apparatus extracts the text information out of the video frames and send the extracted text information to an optical character recognition (OCR) engine for recognition.
2. Description of the Related Art
Text retrieval in video and image is a very important technique and has a variety of application, such as storage capacity reduction, video and image indexing, and digital library, etc.
The present invention focuses on a special type of video—e-Learning video, which often contains a large amount of text information. In order to efficiently retrieve the text content in the video, two techniques are needed: text change frame detection in video and text extraction from image. A text change frame is a frame that marks the change of text content in a video. The first technique fast browses the video and selects those video frames that contain text area. The second technique then extracts the text information from those video frames and sends them to an OCR engine for recognition.
Text change frame detection technique can be regarded as a special case of scene change frame detection technique. The techniques for detecting the scene change frame that marks the changes of the content in video from a plurality of frames in a video have been studied actively in recent years. Some methods focus on the intensity difference between frames, some methods focus on the difference of color histogram and the texture. However, these methods are not suitable for text change frame detection in video, especially in e-Learning field.
Take presentation video—a typical e-Learning video as example, in which the video frame often contains a slide image. Examples of slide image include the PowerPoint® image and the film image from a projector. The change of the content of slide will not cause a dramatic change in color and texture. Also, the focus of the video camera often moves around in a slide image during the talk, which causes image shifting. Image shifting also occurs when the speaker moving his or her slides. These content shifting frames will be marked as scene change frames by conventional methods. Another drawback of the conventional method is that they can not tell directly whether a frame contains text information.
Another way to extract text change frame from video is performing text extraction method on every frame in the video and judging whether the content has been changed. The problem of such strategy is that it is very time consuming.
After the text change frames are detected, a text extraction method should be used to extract the text lines from the frames. Many methods are proposed to extract the text lines from video and static image, such as:    V. Wu, R. Manmatha, and E. M. Riseman, “TextFinder: An Automatic System to Detect and Recognize Text in Images,” IEEE transactions on Pattern Analysis and Machine Intelligence, VOL. 21, NO. 11, pp. 1224-1229, November, 1999.    T. Sato, T. Kanade, E. Hughes, M. Smith, and S. Satoh, “Video OCR: Indexing Digital News Libraries by Recognition of Superimposed Captions,” ACM Multimedia Systems Special Issue on Video Libraries, February, 1998.
Also, some patents related to this field have been published, such as U.S. Pat. Nos. 6,366,699, 5,465,304, 5,307,422.
These methods will meet problem when deal with video frame in e-Learning. The characters in e-Learning video image always have very small size, also the boundaries of these characters are very dim, and there are many disturbances around the text area, such like the bounding box of text line, the shading and occlusion of human body, etc.
However, there are the following problems in the above mentioned conventional video image processing.
It is very time consuming to perform text extraction method on every frame in the video and judge whether the content has been changed.
The characters in e-Learning video image always have very small size, also the boundaries of these characters are very dim, and there are many disturbances around the text area. Therefore, the conventional text extraction method will leave many false character strokes in the final binary image, which give a wrong recognition result in the following OCR stage.