The invention relates generally to a system for processing images and, more particularly, to an apparatus and a concomitant method for identifying and using region(s) of interest to provide functionalities such as zooming, composition, selective input image formation and adaptive allocation of processing resources, e.g., bit allocation.
An image sequence, such as a video image sequence, typically includes a sequence of image frames or pictures. The reproduction of video containing moving objects typically requires a frame speed of thirty image frames per second, with each frame possibly containing in excess of a megabyte of information. Consequently, transmitting or storing such image sequences requires a large amount of either transmission bandwidth or storage capacity. To reduce the necessary transmission bandwidth or storage capacity, the frame sequence undergoes image processing, e.g., compression, such that redundant information within the sequence is not stored or transmitted. Television, video conferencing and CD-ROM archiving are examples of applications, which can benefit from efficient video sequence encoding.
Additionally, in an image processing environment where processing resources are limited or constrained by the requirements of a particular application, it is necessary to carefully allocate the available resources. Namely, although many powerful image processing methods are available, powerful image processing methods are not practical or must be sparingly and selectively applied to meet application requirements.
For example, in real-time application such as videophone or video conferencing, the talking person""s face is typically one of the most important part of an image sequence. The ability to detect and exploit such regions of importance will greatly enhance an encoding system.
For example, the encoding system in a low bitrate application (e.g., real-time application) must efficiently allocate limited bits to address various demands, i.e., allocating bits to code motion information, allocating bits to code texture information, allocating bits to code shape information, allocating bits to code header information and so on. At times, it may be necessary to allocate available bits such that one parameter will benefit at the expense of another parameter, i.e., spending more bits to provide accurate motion information at the expense of spending less bits to provide texture information. Without information as to which regions in a current frame are particularly important, i.e., deserving of more bits from a limited bit pool, the encoder may not allocate the available bits in the most efficient manner.
Furthermore, although the encoder may have additional resources to dedicate to identified regions of importance, it is often still unable to improve these regions beyond the quality of the existing input image sequence. Namely, changing the encoding parameters of the encoder cannot increase the quality of the regions of importance beyond what is presented to the encoder.
Therefore, there is a need in the art for an apparatus and a concomitant method for classifying regions of interest in an image, based on the relative xe2x80x9cimportancexe2x80x9d of the various areas and to adaptively use the importance information to allocate processing resources and to control manipulation of the input image sequence prior to encoding.
An embodiment of the present invention is an apparatus and method for classifying regions of an image as important or region(s) of interest. The parameters that contribute to such classification may initially be derived from a block classifier that detects the presence of facial blocks, edge blocks and motion blocks. Such detected blocks can be deemed as important blocks and is then collected and represented in an xe2x80x9cimportance mapxe2x80x9d or xe2x80x9cclass mapxe2x80x9d.
Additionally, other parameters can be used in the generation or refinement of the importance map. Namely, a voice detector can be employed to detect and associate a voice to a speaker in the image sequence, thereby classifying the region in the image that encompasses the identified speaker as important or a region of interest. Furthermore, additional importance information may include user defined importance information, e.g., interactive inputs from a user that is viewing the decoded images.
Once the importance information is made available, the present invention allocates processing resources in accordance with the importance information. For example, more bits are allocated to xe2x80x9cimportantxe2x80x9d regions as compared to the less xe2x80x9cimportantxe2x80x9d regions; more motion processing is applied to xe2x80x9cimportantxe2x80x9d regions; coding modes are changed for xe2x80x9cimportantxe2x80x9d regions; and/or segmentation processing is refined for xe2x80x9cimportantxe2x80x9d regions as well.
In another embodiment, the formation of the input image sequence is also accomplished in accordance with the importance information. Namely, a higher resolution for the identified regions of interest is acquired from a higher quality source, e.g., directly from an NTSC signal, to form the input image sequence prior to encoding. Such input image sequence formation allows functionalities such as zooming and composition. Thus, the relative xe2x80x9cimportancexe2x80x9d of the various areas of a frame is rapidly classified and used in resource allocation and input image formation.