The present invention relates to systems that recognize patterns in digitized images and more particularly to such systems that isolate symbols such as text characters in video data streams.
Real-time broadcast, analog tape, and digital video are important for education, entertainment, and a host of multimedia applications. With the size of video collections being in the millions of hours, technology is needed to interpret video data to allow this material to be used and accessed more effectively. Various such enhanced uses have been proposed. For example, the use of text and sound recognition can lead to the creation of a synopsis of an original video and the automatic generation of keys for indexing video content. Another range of applications relies on rapid real-time classification of text and/or other symbols in broadcast (or multicast, etc.) video data streams. For example, text recognition can be used for any suitable purpose, for example video content indexing.
Various text recognition techniques have been used to recognize digitized patterns. The most common example is document optical character recognition (OCR). The general model for all of these techniques is that an input vector is derived from an image, the input vector characterizing the raw pattern. The vector is mapped to one of a fixed number or range of symbol classes to xe2x80x9crecognizexe2x80x9d the image. For example, the pixel values of a bitmap image may serve as an input vector and the corresponding classification set may be an alphabet, for example, the English alphabet. No particular technique for pattern recognition has achieved universal dominance. Each recognition problem has its own set of application difficulties: the size of the classification set, the size of the input vector, the required speed and accuracy, and other issues. Also, reliability is an area that cries out for improvement in nearly every area of application.
As a result of the foregoing shortcomings, pattern recognition is a field of continuous active research, the various applications receiving varying degrees of attention based on their respective perceived merits, such as utility and practicability. Probably the most mature of these technologies is the application of pattern recognition to text characters, or optical character recognition (OCR). This technology has developed because of the desirability and practicality of converting printed subject matter to computer-readable characters. From a practicality standpoint, printed documents offer a data source that is relatively clear and consistent. Such documents are generally characterized by high-contrast patterns set against a uniform background and are storable with high resolution. For example, printed documents may be scanned at arbitrary resolution to form a binary image of the printed characters. Also, there is a clear need for such an application of pattern recognition in that the conversion of documents to computer-based text avoids the labor of keyboard transcription, realize economy in data storage, permits documents to be searched, etc.
Some application areas have received scant attention because of the attending difficulty of performing symbol or character classification. For example, the recognition of patterns in video streams is an area that is difficult due to at least the following factors. Characters in a video stream tend to be presented against spatially non-uniform (sometimes, temporally variable) backgrounds, with poor resolution, and low contrast. Recognizing characters in a video stream is therefore difficult and no reliable methods are known. In addition, for some applications, as disclosed in the foregoing related applications at least, fast recognition speeds are highly desirable.
Systems and methods for indexing and classifying video have been described in numerous publications, including: M. Abdel-Mottaleb et al., xe2x80x9cCONIVAS: Content-based Image and Video Access System,xe2x80x9d Proceedings of ACM Multimedia, pp. 427-428, Boston (1996); S-F. Chang et al., xe2x80x9cVideoQ: An Automated Content Based Video Search System Using Visual Cues,xe2x80x9d Proceedings of ACM Multimedia, pp. 313-324, Seattle (1994); M. Christel et al., xe2x80x9cInformedia Digital Video Library,xe2x80x9d Comm. of the ACM, Vol. 38, No. 4, pp. 57-58 (1995); N. Dimitrova et al., xe2x80x9cVideo Content Management in Consumer Devices,xe2x80x9d IEEE Transactions on Knowledge and Data Engineering (November 1998); U. Gargi et al., Indexing,Text Events in Digital Video Databases,xe2x80x9d International Conference on Pattern Recognition, Brisbane, pp. 916-918 (August 1998); M. K. Mandal et al., xe2x80x9cImage Indexing Using Moments and Wavelets,xe2x80x9d IEEE Transactions on Consumer Electronics, Vol. 42, No. 3 (August 1996); and S. Pfeiffer et al., xe2x80x9cAbstracting Digital Moves Automatically,xe2x80x9d Journal on Visual Communications and Image Representation, Vol. 7, No. 4, pp. 345-353 (1996).
The extraction of characters by a method that uses local thresholding and the detection of image regions containing characters by evaluating gray-level differences between adjacent regions has been described in xe2x80x9cRecognizing Characters in Scene Images,xe2x80x9d Ohya et al., IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 16, pp. 214-224 (February 1994). Ohya et al. further discloses the merging of detected regions having close proximity and similar gray levels in order to generate character pattern candidates.
Using the spatial context and high contrast characteristics of video text to merge regions with horizontal and vertical edges in close proximity to one another in order to detect text has been described in xe2x80x9cText, Speech, and Vision for Video Segmentation: The Informedia Project,xe2x80x9d by A. Hauptmann et al., AAAI Fall 1995 Symposium on Computational Models for Integrating Language and Vision (1995). R. Lienhart and F. Suber discuss a non-linear color system for reducing the number of colors in a video image in xe2x80x9cAutomatic Text Recognition for Video Indexing,xe2x80x9d SPIE Conference on Image and Video Processing (January 1996). The reference describes a split-and-merge process to produce homogeneous segments having similar color. Lienhart and Suber use various heuristic methods to detect characters in homogenous regions, including foreground characters, monochrome or rigid characters, size-restricted characters, and characters having high contrast in comparison to surrounding regions.
The use of multi-valued image decomposition for locating text and separating images into multiple real foreground and background images is described in xe2x80x9cAutomatic Text Location in Images and Video Frames,xe2x80x9d by A. K. Jain and B. Yu, Proceedings of IEEE Pattern Recognition, pp. 2055-2076, Vol. 31 (Nov. 12, 1998). J-C. Shim et al. describe using a generalized region-labeling algorithm to find homogeneous regions and to segment and extract text in xe2x80x9cAutomatic Text Extraction from Video for Content-Based Annotation and Retrieval,xe2x80x9d Proceedings of the International Conference on Pattern Recognition, pp. 618-620 (1998). Identified foreground images are clustered in order to determine the color and location of text.
Other useful algorithms for image segmentation are described by K. V. Mardia et al. in xe2x80x9cA Spatial Thresholding Method for Image Segmentation,xe2x80x9d IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 10, pp. 919-927 (1988), and by A. Perez et al. in xe2x80x9cAn Iterative Thresholding Method for Image Segmentation,xe2x80x9d IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 9, pp. 742-751 (1987).
Various techniques for locating text in a digitized bitmap are known. Also known are techniques for binarizing character data to form an image that can be characterized as black-on-white and for performing character recognition on bitmap images. Text, and other patterns, in video streams range from the predictable, large, and clear, which are easy to classify to the crude, fleeting, and unpredictably-oriented and -positioned, which contain insufficient information, even in principle, to classify without assistance from auxiliary contextual data. There is also on-going research to increase recognition speed as well as accuracy. Therefore, there is room for improvement in the current state of the art, particularly where the application, such as video stream data, strains current technology.
Briefly, an image processing device and method for classifying symbols, such as text, in a video stream employs a back propagation neural network (BPNN) whose feature space is derived from size, translation, and rotation invariant shape-dependent features. Various example feature spaces are discussed such as regular and invariant moments and an angle histogram derived from a Delaunay triangulation of a thinned, thresholded, symbol. Such feature spaces provide a good match to BPNN as a classifier because of the poor resolution of characters in video streams. The shape-dependent feature spaces are made practicable by the accurate isolation of character regions using the above technique described in the current application.
The ability to detect and classify text appearing in video streams has many uses. For example, video sequences and portions thereof, can be characterized and indexed according to classifications derived from such text. This can lead to indexing, enhanced search capabilities, annotation features, etc. In addition, recognition of text in a video stream can permit the presentation of context-sensitive features such as an invokable link to a web site generated in response to the appearance of a web address in a broadcast video stream.
Text in video presents a very different problem set from that of document OCR, which is a well-developed, but still maturing technology. Text in documents tends to be uni-colored and high quality. In video, scaled-down scene images may contain noise and uncontrolled illumination. Characters appearing in video can be of varying color, sizes, fonts, orientation, thickness, backgrounds can be complex and temporally variant, etc. Also, many applications for video symbol recognition require high speed.
The technique employed by the invention for classifying video text employs an accurate high speed technique for symbol isolation. The symbol bitmap is then used to generate a shape-dependent feature vector, which is applied to a BPNN. The feature vector provides greater emphasis on overall image shape while being relatively insensitive to the variability problems identified above. In the technique for isolating character regions, connected component structures are defined based on the edges detected. Since edge detection produces far fewer pixels overall than binarizing the entire field occupied by a symbol, the process of generating connected components can be much more rapid. The selection of feature space also enhances recognition speed. With simulated BPNNs the size of the input vector can seriously affect throughput. It is very important to be selective with regard to the components used from the selected feature space. Of course, heterogeneous feature spaces may be formed by combining mixes of different features such as moments and line-segment features. Also, computational economies may be realized where the selected features share computational steps.
The invention will be described in connection with certain preferred embodiments, with reference to the following illustrative figures so that it may be more fully understood. With reference to the figures, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the preferred embodiments of the present invention only, and are presented in the cause of providing what is believed to be the most useful and readily understood description of the principles and conceptual aspects of the invention. In this regard, no attempt is made to show structural details of the invention in more detail than is necessary for a fundamental understanding of the invention, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be embodied in practice.