The present invention is directed, in general, to video processing systems and, more specifically, to a system for analyzing and characterizing a video stream based on the attributes of text detected in the content of the video.
The advent of digital television (DTV), the increasing popularity of the Internet, and the introduction of consumer multimedia electronics, such as compact disc (CD) and digital video disc. (DVD) players, have made tremendous amounts of multimedia information available to consumers. As video content becomes readily available and products for accessing it reach the consumer market, searching, indexing and identifying large volumes of multimedia data becomes even more challenging and important.
Systems and methods for indexing and classifying video have been described in numerous publications, including: M. Abdel-Mottaleb et al., xe2x80x9cCONIVAS: Content-based Image and Video Access System,xe2x80x9d Proceedings of ACM Multimedia, pp. 427-428, Boston (1996); S-F. Chang et al., xe2x80x9cVideoQ: An Automated Content Based Video Search System Using Visual Cues,xe2x80x9d Proceedings of ACM Multimedia, pp. 313-324, Seattle (1994); M. Christel et al., xe2x80x9cInformedia Digital Video Library,xe2x80x9d Comm. of the ACM, Vol. 38, No. 4, pp. 57-58 (1995); N. Dimitrova et al., xe2x80x9cVideo Content Management in Consumer Devices,xe2x80x9d IEEE Transactions on Knowledge and Data Engineering (Nov. 1998); U. Gargi et al., xe2x80x9cIndexing Text Events in Digital Video Databases,xe2x80x9d International Conference on Pattern Recognition, Brisbane, pp. 916-918 (Aug. 1998); M. K. Mandal et al., xe2x80x9cImage Indexing Using Moments and Wavelets,xe2x80x9d IEEE Transactions on Consumer Electronics, Vol. 42, No. 3 (Aug. 1996); and S. Pfeiffer et al., xe2x80x9cAbstracting Digital Moves Automatically,xe2x80x9d Journal on Visual Communications and Image Representation, Vol. 7, No. 4, pp. 345-353 (1996).
The detection of advertising commercials in a video stream is an also active research area. See R. Lienhart et al., xe2x80x9cOn the Detection and Recognition of Television Commercials,xe2x80x9d Proceedings of IEEE International Conference on Multimedia Computing and Systems, pp. 509-516 (1997); and T. McGee et al., xe2x80x9cParsing TV Programs for Identification and Removal of Non-Story Segments,xe2x80x9d SPIE Conference on Storage and Retrieval in Image and Video Databases, San Jose (Jan. 1999).
Recognition of text in document images is well known in the art. Document scanners and associated optical character recognition (OCR) software are widely available and well understood. However, detection and recognition of text in video frames presents unique problems and requires a very different approach than does text in printed documents. Text in printed documents is usually restricted to single-color characters on a uniform background (plain paper) and generally requires only a simple thresholding algorithm to separate the text from the background. By contrast, characters in scaled-down video images suffer from a variety of noise components, including uncontrolled illumination conditions. Also, the background frequently moves and text characters may be of different color, sizes and fonts.
The extraction of characters by local thresholding and the detection of image regions containing characters by evaluating gray-level differences between adjacent regions has been described in xe2x80x9cRecognizing Characters in Scene Images,xe2x80x9d Ohya et al., IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 16, pp. 214-224 (Feb. 1994). Ohya et al. further discloses the merging of detected regions having close proximity and similar gray levels in order to generate character pattern candidates.
Using the spatial context and high contrast characteristics of video text to merge regions with horizontal and vertical edges in close proximity to one another in order to detect text has been described in xe2x80x9cText, Speech, and Vision for Video Segmentation: The Informedia Project,xe2x80x9d by A. Hauptmann et al., AAAI Fall 1995 Symposium on Computational Models for Integrating Language and Vision (1995). R. Lienhart and F. Suber discuss a non-linear red, green, and blue (RGB) color system for reducing the number of colors in a video image in xe2x80x9cAutomatic Text Recognition for Video Indexing,xe2x80x9d SPIE Conference on Image and Video Processing (Jan. 1996). A subsequent split-and-merge process produces homogeneous segments having similar color. Lienhart and Suber use various heuristic methods to detect characters in homogenous regions, including foreground. characters, monochrome or rigid characters, size-restricted characters, and characters having high contrast in comparison to surrounding regions.
Using multi-valued image decomposition for locating text and separating images into multiple real foreground and background images is described in xe2x80x9cAutomatic Text Location in Images and Video Frames,xe2x80x9d by A. K. Jain and B. Yu, Proceedings of IEEE Pattern Recognition, pp. 2055-2076, Vol. 31 (Nov. 12, 1998). J-C. Shim et al. describe using a generalized region-labeling algorithm to find homogeneous regions and to segment and extract text in xe2x80x9cAutomatic Text Extraction from Video for Content-Based Annotation and Retrieval,xe2x80x9d Proceedings of the International Conference on Pattern Recognition, pp. 618-620 (1998). Identified foreground images are clustered in order to determine the color and location of text.
Other useful algorithms for character segmentation are described by K. V. Mardia et al. in xe2x80x9cA Spatial Thresholding Method for Image Segmentation,xe2x80x9d IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 10, pp. 919-927 (1988), and by A. Perez et al. in xe2x80x9cAn Iterative Thresholding Method for Image Segmentation,xe2x80x9d IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 9, pp. 742-751 (1987).
The prior art text-recognition systems do not take into account, however, the non-semantic attributes of text detected in the content of the video. The prior art systems simply identify the semantic content of the image text and index the video clips based on the semantic content. Other attributes of the image text, such as physical location in the frame, duration, movement, and/or temporal location in a program are ignored. Additionally, no attempt has been made to use video content to identify and edit video clips.
There is therefore a need in the art for improved video processing systems that enable a user to search through an archive of video clips and to selectively save and/or edit all or portions of video clips that contain image text attributes that match image text attributes selected by a user.
To address the above-discussed deficiencies of the prior art, the present invention discloses a video processing device for searching or filtering video streams for one or more user-selected image text attributes. Generally, xe2x80x9csearchingxe2x80x9d video streams refers to searching in response to user-defined inputs, whereas xe2x80x9cfilteringxe2x80x9d generally refers to an automated process that requires little or no user input. However, in the disclosure, xe2x80x9csearchingxe2x80x9d and xe2x80x9cfilteringxe2x80x9d may be used interchangeably. An image processor detects and extracts image text from frames in video clips, determines the relevant attributes of the extracted image text, and compares the extracted image text attributes and the user-selected image text attributes. If a match occurs, the video processing device may modify, transfer, label or otherwise identify at least a portion of the video stream in accordance with user commands. The video processing device uses the user-selected image text attributes to search through an archive of video clips to 1) locate particular types of events, such as news programs or sports events; 2) locate programs featuring particular persons or groups; 3) locate programs by name; 4) save or remove all or some commercials, and to otherwise sort, edit, and save all of, or portions of, video clips according to image text that appears in the frames of the video clips.
It is a primary object of the present invention to provide, for use in a system capable of analyzing image text in video frames, a video processing device capable of searching and/or filtering video streams in response to receipt of at least one selected image text attribute. In an exemplary embodiment, the video processing device comprises an image processor capable of receiving a first video stream comprising a plurality of video frames, detecting and extracting image text from the plurality of video frames, determining at least one attribute of the extracted image text, comparing the at least one extracted image text attribute and the at least one selected image text attribute, and, in response to a match between the at least one extracted image text attribute and the at least one selected image text attribute, at least one of: 1) modifying at least a portion of the first video stream in accordance with a first user command; 2) transferring at least a portion of the first video stream in accordance with a second user command; and 3) labeling at least a portion of the first video stream in accordance with a third user command.
According to an exemplary embodiment of the present invention, the at least one extracted image text attribute indicates that the image text in the plurality of video frames is one of: scrolling horizontally; scrolling vertically; fading, special effects and animation effects.
According to one embodiment of the present invention, the at least one extracted image text attribute indicates that the image text in the plurality of video frames is one of: a name of a person; and a name of a group.
According to another embodiment of the present invention, the at least one extracted image text attribute indicates that the image text in the plurality of video frames is part of a commercial advertisement.
According to still another embodiment of the present invention, the at least one extracted image text attribute indicates that the image text in the plurality of video frames is text appearing at one of: a start of a program; and an end of a program.
According to yet another embodiment of the present invention, the at least one extracted image text attribute indicates that the image text in the plurality of video frames is part of a program name.
According to a further embodiment of the present invention, the at least one extracted image text attribute indicates that the image text in the plurality of video frames is part of a news program.
According to a still further embodiment of the present invention, the at least one extracted image text attribute indicates that the image text in the plurality of video frames is part of a sports program.
The foregoing has outlined rather broadly the features and technical advantages of the present invention so that those skilled in the art may better understand the detailed description of the invention that follows. Additional features and advantages of the invention will be described hereinafter that form the subject of the claims of the invention. Those skilled in the art should appreciate that they may readily use the conception and the specific embodiment disclosed as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the invention in its broadest form.
Before undertaking the DETAILED DESCRIPTION, it may be advantageous to set forth definitions of certain words and phrases used throughout this patent document: the terms xe2x80x9cincludexe2x80x9d and xe2x80x9ccomprise,xe2x80x9d as well as derivatives thereof, mean inclusion without limitation; the term xe2x80x9cor,xe2x80x9d is inclusive, meaning and/or; the phrases xe2x80x9cassociated withxe2x80x9d and xe2x80x9cassociated therewith,xe2x80x9d as well as derivatives thereof, may mean to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, or the like; and the term xe2x80x9cprocessorxe2x80x9d or xe2x80x9ccontrollerxe2x80x9d means any device, system or part thereof that controls at least one operation, such a device may be implemented in hardware, firmware or software, or some combination of at least two of the same. It should be noted that the functionality associated with any particular controller may be centralized or distributed, whether locally or remotely. Additionally, the term xe2x80x9cvideo clipxe2x80x9d may mean a video segment, a video sequence, video content, or the like. Definitions for certain words and phrases are provided throughout this patent document. Those of ordinary skill in the art should understand that in many, if not most instances, such definitions apply to prior, as well as future uses of such defined words and phrases.