Digital still cameras and digital video cameras permit imagery to be stored and displayed for human viewing. However, the captured digital imagery contains information that, if automatically extracted, could be used for other purposes. Information about real-world scenery imaged by the camera, such as text appearing in the scene, could then be processed and/or disseminated by new computing-based applications.
Additionally, the volume of collected video data is expanding at a tremendous rate. A capability to automatically characterize the contents of video imagery would enable video data to be indexed in a convenient and meaningful way for later reference, and would enable actions (such as automatic notification and dissemination) to be triggered in real time by the contents of streaming video. Methods of realizing this capability that rely on the automated recognition of objects and scenes directly in the imagery have had limited success because (1) scenes may be arbitrarily complex and may contain almost anything, and (2) the appearance of individual objects may vary greatly with lighting, point of view, etc. It has been noted that the recognition of text is easier than the recognition of objects in an arbitrarily complex scene, because text was designed to be readable and has a regular form that humans can easily interpret.
However, research in text recognition for both printed documents and other sources of imagery has generally assumed that the text lies in a plane that is oriented roughly perpendicular to the optical axis of the camera. However, text such as street signs, name plates, and billboards appearing in captured video imagery often lies in a plane that is oriented at an oblique angle, and therefore may not be recognized very accurately by conventional optical character recognition (OCR) methods. Therefore, a need exists in the art for an apparatus and method to take advantage of 3-D scene geometry to detect the orientation of the plane on which text is printed, thereby improving text detection and extraction.
Additionally, in video data of text appearing in real-world scenery, the text usually persists in the scene for some length of time, and therefore appears in multiple video frames. Digitized video frames of the same scene may vary slightly, thereby causing an OCR process operating on individual frames to produce slightly different results. Therefore, a need exists for a method that combines OCR results from multiple images to produce a single, accurate result for a single instance of text.