Conventionally, when a video camera records video, the video is recorded as a plurality of sequential still images. Each still image comprises a plurality of pixels. For example, a single high definition television image may have a matrix of 2048×1152 pixels, or over two million pixels. Each of contains data having parameters, non-limiting examples of which include color and luminance values. A color parameter may determine, for example, what amount of red, green and/or blue (RGB) is associated with a particular pixel. These RGB values may be on a scale, for example a three byte scale wherein each value may range from zero to 255 units. Similarly, the luminance value may be on a scale, for example a byte scale, wherein each value may range from zero to 255. If other parameters are used to differentiate the pixels, these parameters may also be described with a number of bits. Just taking a simple case having RGB values at three bytes for each pixel, a single high definition image would require data corresponding to 2048×1152×3×8 bits, or over 55 million bits.
A video player may play back such a video by decoding the data for each pixel, within each image. Storing or transmitting the data for each pixel, within each image consumes a large amount of processing resources. This amount of processing would be needed for each still image. To reduce the amount of processing resources required to store, transmit or decode a video comprising a plurality of still images, encoding techniques have been developed to reduce the amount of data, corresponding to original video data. This, will be described in greater detail with reference to FIG. 1.
FIG. 1 illustrates an example video system 100.
As illustrated in the figure, video system 100 includes a video recorder 102, a video encoder 104 and a transmitter 106.
Video recorder 102 is arranged to output a video signal 110. Video encoder 104 is arranged to output an encoded signal 112. Transmitter 106 is operable to transmit a transmission signal (not shown).
Video recorder 102, for example a video camera, records sequential still images. Each image is translated into large amounts of data corresponding to the parameters, e.g. color and luminance, of each pixel. The data for each image is sent to video encoder 104 as video signal 110.
Video encoder 104 transforms the formal of the data within video signal 110. Typically, video encoder 104 will compress, i.e., greatly reduce, data within video signal 110 in order to reduce the processing power and time required by transmitter 106 and by post processing equipment (not shown). There are many conventional standards for encoding video data, non-limiting examples of which include, MPEG and H.264. Conventionally, a video encoder may compress video data by transforming still image data into different amounts for data, for example a high-level encoding scheme and a low-level encoding scheme. For example, a high-level encoding may include merely passing all the data for a single still image as one encoded frame, which will be described in greater detail below with reference to FIGS. 3 and 4. Further, a low-level encoding scheme may include merely passing changes from one still image to the next still image as one encoded frame, which will also be described in greater detail below with reference to FIGS. 3 and 4. Of course there may be multiple levels of encoding. However, for ease of discussion, two levels of encoding will be discussed herein. An example conventional encoding process will be further described with reference to FIG. 2A.
FIGS. 2A-2C illustrate a person 200 speaking, at a time t1, a time t2 and a time t3, respectively, in front of an audience 204, while being recorded by a video camera 206.
As illustrated in FIG. 2A, person 200 is speaking to audience 204. Video camera 206 corresponds to video recorder 102 of FIG. 1. The first still image recorded by video camera 206 comprises a large matrix of pixels, each of which has data corresponding to parameters such as color and luminance. Each of these parameters will have values. The values of these parameters of the pixels is the data that is sent to a video encoder (not shown in FIG. 2A), such as video encoder 104. The data corresponding to the first image is very large. A conventional video encoder may reduce the amount of data corresponding to the original still image. However, for purposes of explanation, presume that video encoder 104 generates data corresponding to all the data of the original still image. In other words, video encoder 104 encodes the first still image with high-level encoding. The data generated by video encoder 104 for each still image provided by video recorder is called an encoded frame. The encoded frame is sent downstream storage or transmission.
A conventional video encoder may generate encoded frames having a reduced amount of data as compared to the data associated with the corresponding original still images. The amount of compression is based on a proper determination of a scene change, type of video content, video encoder capability etc. In other words, if a still image changes a sufficient amount as compared to the immediately preceding still image, then the “scene” of the video is considered to be changed. A scene change may take many forms, such as movement within a video, i.e., portions of an image in a still image have a change in position as compared to their counterparts in an immediately preceding image. In these cases, if the “movement” of objects is more than a predetermined movement threshold, then the encoder may determine that the next still image should be encoded with a high-level encoding scheme. Another form of scene change may take the form of a drastic change in luminance of an image. For example, portions of an image in a still image have a change in luminance as compared to their counterparts in an immediately preceding image. In these cases, if the change is more than a predetermined luminance threshold, then the encoder may determine that the next still image should be encoded with the high-level encoding scheme. This will be described in greater detail below.
Returning to FIG. 2A, a second still image recorded by video camera 206 comprises another large matrix of pixels, each of which has data corresponding to parameters such as color and luminance. This data is also sent to video encoder 104. The data corresponding to the second image is also very large. A conventional video encoder may reduce the amount of data corresponding to the second still image by generating an encoded frame based on the difference between the first still image and the second still image. This process is called motion estimation. The motion estimation is based on the Sum of Absolute Difference (SAD) value obtained by subtracting the best-matching macroblock in the past frame from the current frame macroblock. In general, the SAD value will be relatively small for low motion sequences and relatively high in fast motion sequences. If the SAD value is relatively high, then the macroblocks are coded with a high-level encoding scheme. If the SAD value is relatively low, then the macroblocks are coded with a low-level encoding scheme.
In this example, the first still image may include person 200 and the background. The second still image will additionally include person 200 and the background. For purposes of explanation, presume in this example, person 200 is not moving very much and the lighting is not changing very much. Accordingly, the difference in the background is negligible and the difference in the image of person 200 may only lie in a slight change in the position of her mouth. Therefore, in this example, the second encoded frame corresponding to the second still image generated by video encoder 104 will only have data corresponding to the change in the pixel data corresponding to the change in position of the mouth of person 200 in the first still image and the position of the mouth of person 200 in the second still image. Therefore the amount of data within the second encoded frame is drastically smaller than the amount of data within the first encoded frame. This conventional encoding technique is repeated to generate encoded frames for each still image. Of course, if a changed between one still image and the next is too drastic, as determined by the particular video encoder, the encoder may encode drastically different still image with the high-level encoding scheme. This will be described in greater detail with additional reference to FIGS. 2B, 2C, 3 and 4.
FIG. 3 illustrates a group of still images 300 corresponding to the video of person 200 speaking in FIGS. 2A-2C.
As illustrated in FIG. 3, group of still images 300 includes a still image 302, a still image 304, a still image 306, a still image 308, a still image 310, a still image 312, a still image 314, a still image 316, a still image 318, a still image 320, a still image 322 and a still image 324.
Still image 302 corresponds to the image recorded by video camera 206 at time t1 as illustrated in FIG. 2A. Still images 304, 306, 308 and 310 correspond to the images recorded by video camera 206 at sequential times between time t1 and time t2 as illustrated in FIG. 2B. Still image 312 corresponds to the image recorded by video camera 206 at time t2 as illustrated in FIG. 28. In this example, still image 312 is shown in a brighter shade than the remaining still images to illustrate the effects of flash 208. Still images 314, 316 and 318 correspond to the images recorded by video camera 206 at sequential times between time t2 and time t3 as illustrated in FIG. 2C. Still image 320 corresponds to the image recorded by video camera 206 at time t3 as illustrated in FIG. 2C. Still images 322 and 324 correspond to the images recorded by video camera 206 at sequential times after time t3.
In this example, presume person 200 does not move much and the lighting does not change much between time t1 and time t2. However at time t2, as illustrated in FIG. 2B, a person in the audience (not shown) takes a flash photograph of person 200. A flash 208 from the flash photograph abruptly and drastically changes the lighting of person 200 and of the background. Further, flash 208 creates a new shadow 210 of person 200 in the background.
In this example, presume person 200 again does not move much and the lighting does not change much between time t2 and time t3. However at time t3, as illustrated in FIG. 2C, person 200 moves from one position to another position, and then remains there.
Clearly, as seen in FIG. 3, there are many still images within group of still images 300. A discussed above, each image is comprised of a large matrix of pixels, each of which as data corresponding to parameters, such a color and luminance. All of this data would take a great deal of power and time to process and transmit. To reduce such power and time, video encoder 104 generates encoded frames for each still image. This will now be discussed in greater detail with reference to FIG. 4.
FIG. 4 illustrates a group of encoded frames 400 that have been encoded with a conventional encoding scheme and that correspond to the group of still images 300 of FIG. 3.
As illustrated in FIG. 4, group of encoded frames 400 includes an encoded frame 402, an encoded frame 404, an encoded frame 406, an encoded frame 408, an encoded frame 410, an encoded frame 412, an encoded frame 414, an encoded frame 416, an encoded frame 418, an encoded frame 420, an encoded frame 422 and an encoded frame 424.
Encoded frame 402 corresponds to still image 302, which corresponds to the image recorded by video camera 206 at time t1 as illustrated in FIG. 2A. Encoded frames 404, 406, 408 and 410 correspond to still images 304, 306, 308 and 310, respectively, which correspond to the images recorded by video camera 206 at sequential times between time t1 and time t2 as illustrated in FIG. 2B. Encoded frame 412 corresponds to still image 312, which corresponds to the image recorded by video camera 206 at time t2 as illustrated in FIG. 2B. Encoded frames 414, 416 and 418 correspond to still images 314, 316 and 318, respectively, which correspond to the images recorded by video camera 206 at sequential times between time t2 and time t3 as illustrated in FIG. 2C. Encoded frame 420 corresponds to still image 320, which corresponds to the image recorded by video camera 206 at time t3 as illustrated in FIG. 2C. Encoded frames 422 and 424 correspond to still images 322 and 324, respectively, which correspond to the images recorded by video camera 206 at sequential times after time t3.
In this example, since still image 302 is the first still image, video encoder 104 will generate encoded frame 402 with a high-level encoding scheme. The high-level encoding scheme will require a relatively high level of processing and transmitting resources down the line.
Since person 200 moves little and since the lighting changes little between times t1 and t2: changes between still image 302 and 304 are small. Accordingly, video encoder 104 will generate encoded frame 404 with a low-level encoding scheme. Again, the benefit of encoding includes generating much less data, i.e., encoded frame 404 that is drawn to only a difference between the data corresponding to still image 302 and the data corresponding to still image 304, as indicated by arrow 426, as opposed to all of the data corresponding to still image 304. In other words, the low-level encoding scheme will require a relatively low level of processing and transmitting resources down the line. Video encoder 104 will similarly generate encoded frames 406, 408 and 410 with a low-level encoding scheme, wherein: encoded frame 406 corresponds to a difference between still image 304 and still image 306; encoded frame 408 corresponds to a difference between still image 306 and still image 308; and encoded frame 410 corresponds to a difference between still image 308 and still image 310.
In this example, although still image 312 is drastically different from still image 310 as a result of flash 208, video encoder 104 will generate encoded frame 412 with a low-level encoding scheme. However, video encoder 104 will be triggered to encode the next still image with a high-level encoding scheme.
At this point, because video encoder 104 is triggered to encode the next still image with a high-level encoding scheme, video encoder 104 will generate encoded frame 414 with a high-level encoding scheme. Again, the high-level encoding scheme will require a relatively high level of processing and transmitting resources down the line.
Since person 200 moves little and since the lighting changes little between times t2 and t3: changes between still image 314 and still image 316 are small; and changes between still image 316 and still image 318 are small. Accordingly, video encoder 104 will generate encoded frame 416 and encoded frame 418 with a low-level encoding scheme. Encoded frame 416 corresponds to a difference between still image 314 and still image 316, as indicated by arrow 428. Encoded frame 418 corresponds to a difference between still image 316 and still image 318.
At this point, still image 320 is drastically different from still image 318 because person 200 had drastically moved for still image 320. Video encoder 104 will generate encoded frame 420 with a low-level encoding scheme, but following frame 422 will be encoded with High level encoding scheme. Again, the high-level encoding scheme will require a relatively high level of processing and transmitting resources down the line.
At this point, because video encoder 104 is triggered to encode the next still image with a high-level encoding scheme, video encoder 104 will generate encoded frame 422 with a high-level encoding scheme. Again, the high-level encoding scheme will require a relatively high level of processing and transmitting resources down the line.
Finally, person 200 moves little and since the lighting changes little after time t3: changes between still image 322 and still image 324 are small. Accordingly, video encoder 104 will generate encoded frame 424 with a low-level encoding scheme, wherein encoded frame 424 corresponds to a difference between still image 322 and still image 324.
In the example discussed above with reference to FIGS. 2A-2C, group of encoded frames 400 include three frames, encoded frames 402, 414 and 422, that were encoded with a high-level encoding scheme. Each of these frames increases the required processing power and bandwidth requirement for transmission. It is a goal to reduce this amount of processing power and time. A conventional method to avoid number of high-level encoded frames will now be described with reference to FIG. 5.
FIG. 5 illustrates another group of encoded frames 500 that have been encoded with a conventional encoding scheme and that correspond to the group of still images of FIG. 3.
As illustrated in FIG. 5, group of encoded frames 500 includes encoded frame 402, encoded frame 404, encoded frame 406, encoded frame 408, encoded frame 410, encoded frame 412, an encoded frame 502, encoded frame 416, encoded frame 418, encoded frame 420, encoded frame 422 and encoded frame 424.
Group of encoded frames 500 of FIG. 5 differs from group of encoded frames 400 of FIG. 4, in that encoded frame 414 of group of encoded frames 400 is replaced with encoded frame 502. The conventional video encoder always uses a plurality of previous still images (a.k.a. number of reference frames, set by encoder configuration) to generate the next encoded frame. As shown in FIG. 5, to encode frame 502, encoder will do multiple motion estimation with previous still image data. In this example, the encoder generates encoded frame 502 based on a difference between still, image 312 and still image 314, as indicated by arrow 504; a difference between still image 310 and still image 314, as indicated by arrow 508; and a difference between still image 308 and still image 314, as indicated by arrow 506. This scheme will increase required processing power and memory space to encode each frame. But mobile platform video encoders cannot afford for having huge number of multiple reference frames because of lack of processing power and memory allocated to encode each still image in realtime.
In the conventional encoding scheme illustrated in FIG. 5, a video encoder will generate encoded frame 502 with a modified low-level encoding scheme. Performing the modified low-level encoding scheme per frame require a lower level of processing for the encoder as compared to the high-level encoding scheme per frame. However in conventional encoders, the modified low-level encoding scheme will require a higher level of processing as compared to the low-level encoding scheme.
Motion estimation using only one frame, for example as discussed above with reference to FIG. 4, requires nearly 35-40% of the total encoding time. Having more reference frames to perform motion estimation, for example as discussed above with reference to FIG. 5, requires more time and processing resources to determine the best match for the current macro block. H.264 encoding standard uses multiple reference frames, such as described with reference to FIG. 5. If provided with more processing power, encoders may opt for using multiple reference frames. Hence, although there is a provision for multiple reference frames, most mobile platform H.264 video encoders will use only one reference frame for motion estimation algorithm in order to reduce the complexity, amount of required power and the memory space.
What is needed is a system and method for efficiently encoding video data having abrupt illumination variations.