Standard video decompression algorithms, such as MPEG-2, MPEG-4, H.263 use a temporal prediction scheme, which is explained below. One common feature of these schemes is a big external memory for storing video frames at different levels of decoding. Actual platforms to run various decoding algorithms vary from standard CPUs to dedicated hardware as well as various combinations of the above. Whether the decompression scheme implementation is based on a standard CPU or a dedicated hardware solution, the price of accessing external memory is heavy in terms of both real-time and system power. Both these characteristics are very important for modern communication systems. The complexity of multimedia algorithms increases from generation to generation. For example, the newest video decoding algorithm H.264 also known as MPEG-4 Part-10) is three time more complex than “initial” MPEG-4 Part-2. Although video quality is by far superior in H.264 case, there are very serious problems in implementing this algorithm on existing platforms, whether it be a standard CPU or dedicated hardware. There is not enough real-time performance to address growing demand for resolution, frame rates and lower bandwidth. In order to increase the real-time performance of such platforms frequency of processing can be increased but it causes higher dissipated power and higher cost of the silicon. Many of such multimedia platforms such as smart phones, PDAs, etc. are very cost sensitive; hence other performance improvement ways should be explored. Additionally, power dissipation is very important for all mobile applications, because battery life is limited and there is little added value in having a multimedia enabled PDA that is capable of playing back a multimedia clip only for few minutes. The requirements are around two hours, which is an average trip time in a train, car etc.
It is conventional practice to provide a caching scheme for video decoding to reduce power dissipation and real-time. But this scheme proposes only very specific, narrow kind of video processing to be run using internal cache. Additionally, if such cache is used by other tasks running on CPU simultaneously with video decoding, it will be contaminated by other kinds of data and the whole advantage of power and real time savings is eliminated.
Having described the above constraints for the decompression part of the video signal processing, we need to point out that there are no platform-based constraints on the compression part of it. Usually, the input material, being an advertisement, entertainment clip, etc, is compressed off-line ahead of time and just distributed to various mobile platforms. Besides producing a particular standard compliant output bitstream there are no “implementation” limitations for an encoder. It will be natural to ask whether it is possible to address decoding platform constraints during actual encoding. It is true that having additional constraints may cause quality degradation of the compression. But what if such quality decrease is very little whereas decoder side advantages are high? Unfortunately, there are no methods in the prior art that would describe decoder-platform-constrained encoding of video signals.
A temporal prediction scheme, which is common for various compression algorithms of video signals, is based on the idea that the current frame of video data being decoded is predicted from previously decoded frames. The frames from which such prediction is formed are called reference frames. In the natural display order, reference frames can either temporally precede/succeed the frame being decoded. Furthermore, most standard video decompression algorithms use a block-based prediction structure, wherein the prediction for each block of video data is formed from a corresponding block of data from a reference frame as indicated by a motion vector. In a typical video decompression system, reference frames are too large to be fully accommodated in the primary (typically on-chip) memory. So, the process of forming the prediction involves:                Moving blocks of data from the secondary (typically off-chip, external) reference frame memory to the primary (typically on-chip) memory;        Performing simple averaging and/or filtering operations on the block of data; and        Writing the predicted block back to a secondary memory.        
For example, in the case of the MPEG-2 compression standard, each forward-predicted 16×16 block encoded in the frame mode needs a 17×17 block from the reference frame memory to form a suitable prediction. In the average case, it is seen that some blocks of the reference frames are used multiple times in the process of prediction and other blocks are not used at all.
Referring to FIG. 1 the operation of a conventional video decoding system in receiving and decoding compressed video information, according to the MPEG-4 standard, will now be described by way of further background. As is fundamental in the art, MPEG-4 video decoding predicts motion from frame to frame in a video sequence by way of motion vectors, which are two-dimensional vectors that provide offsets from the coordinate position from a prior, reference frame (or/and future, reference frame) to coordinates in the frame currently being decoded. The redundant, or non-moving, information from frame-to-frame is encoded by way of a transform, in this case discrete cosine transform (DCT), the inverse of which is performed by the video decoding system. The inverse transform of the redundant portion of the transmitted frame, in combination with the results of the motion vectors, produces an output frame.
According to the MPEG-4 standard, an incoming bitstream is demultiplexed in block 10, whereas decoding of motion vectors takes place in block 11 and decoding of texture is in block 14. Decoded motion vectors from block 11 are fed to Motion Compensation block 12 where needed block of information is extracted from previous reconstructed frames—block 13— according to the decoded motion vectors. Block 13 requires a big amount of memory to store reconstructed frames, which is generally implemented by means of external memory. It is impossible to keep all this huge memory inside any reasonable chip.
The actual displayable frame reconstruction takes place in block 15 where the output results of block 12 and block 14 are added with appropriate clipping.
In FIG. 1 the operation of texture decoding performed in block 14, which is firstly comprised of variable length decoding performed in block 21. The output of block 21 is fed to Inverse Scan block 22; and from block 22 the information is fed to block 23 in order to perform inverse prediction for AC and DC components. The output of block 23 is inverse quantized in block 24 and then inverse DCT takes place as a final stage of texture decoding in block 25.
A conventional algorithm for performing a video encoding is shown in FIG. 2. An input frame is processed in block 31, where motion estimation operation is performed. The exact algorithm for motion estimation usually is not standardized, and it is up to a developer to use various methods of motion estimation as long as the output bitstream from block 34 complies with a given standard. The main idea of motion estimation is to find the “best” match between processed block and any block in already encoded and decoded frames, which are stored in block 38. An encoder is supposed to do both encoding and partial decoding in order to perform motion estimation on decoded frames because decoder would have only decoded frames. Hence to avoid drift problems between an encoder and a decoder, any encoder would include decoder operations such as inverse quantization and inverse transform.
The output of block 31 is a delta frame between a current frame and previous frame/frames as well as motion vectors per video frame block. The resultant delta signal is further processed by block 32 applying usually a DCT transform or any similar function. The results of block 32 are quantized in block 33 to further reduce needed bandwidth. Motion vectors and quantized transform values are further encoded losslessly in block 34 that usually performs variable length coding or similar entropy scheme. Block 37 performs rate-control of the whole encoding scheme having a particular output bit rate target and allocating resources for motion estimation and quantization blocks. Various control parameters are transferred from block 37 to block 34 for being multiplexed into the output bitstream. Blocks 35 and 36 are performing partial decoding as described above. Rate-control mechanism in block 37 and motion estimation in block 31 are responsible for video quality given a particular target. There is no awareness of decoder implementation issues such as power dissipation on the encoding stage. The only decoder awareness is a buffer size at the input of decoder, which is normally standardized.