In the field of digital video compression and decompression, the compression format often follows standards such as MPEG-2. A newer standard under development is known as MPEG AVC, also called MPEG-4 Part 10 or ITU H.264; it is referred to here as simply AVC. MPEG-2, AVC and other digital video compression formats generally utilize motion-compensated image prediction as part of the compression and decompression processes.
In motion compensated image prediction, also called motion compensation (“MC”), the encoder finds a region of a reference image that can be translated such that the translated image region resembles a region of the image currently being compressed. The region size may be, e.g., 16×16 (also called a macroblock), or it may be smaller, such as 16×8, 8×16, 8×8, 8×4, 4×8, 4×4, or other sizes and shapes, depending on the specifics of the video compression format. The region being predicted is called an MC block. The reference picture may be any of a number of pictures that have previously been encoded, in the case of an encoder, or that have previously been decoded, in the case of a decoder. Reference pictures are normally stored in DRAM rather than on-chip memory, due to the size of the memory required to store all the possible reference pictures.
The translation in typical video compression formats involves translation or re-positioning in the horizontal (X) and/or vertical (Y) axes, by amounts that may include both whole pixel (integer) and fractional pixel amounts. Such translations are referred to as motion vectors (MV). A pixel is a picture element, also known as a pel. X & Y translations using MVs with fraction components, involve the use of multi-tap filters to accomplish the fractional pixel re-positioning. In the case of MPEG-2, a 2-tap filter in each of the X and Y axes may be used. In the case of AVC, a 6-tap filter in each of the X and Y axes may be used. As a result of these filters, the number of pixels that are needed from the reference image to produce the correct MC prediction may be greater than the size of the MC block. For example, if the MC block size is 4×4 and the fractional pixel filter uses 6 taps in each dimension, the size of the region of pixels needed to perform the prediction is (4+6−1) ×(4+6−1)=9×9.
The integer (or whole pixel) portion of the MVs can in general be of almost any value which points to any portion of the reference image. One effect of this potential variability is that the reference picture region needed for MC may fall across memory page boundaries, as well as across memory word boundaries.
As a result of the use of motion compensation, the large degree of variability possible in the values of the MVs, the variable and possibly small size of the MC blocks, and the possible use of fractional pixel filters for MC, the number of DRAM cycles required to read all of the reference image regions needed to encode or decode one picture may be very high, resulting in expensive systems for encoding and decoding digital video. Encoding or decoding of digital video is generally considered to be a real time process, i.e., one which should be completed on a specific schedule in order to ensure proper operation. The real time nature of video encoding and decoding makes it difficult or impossible to spread DRAM accesses over extended intervals of time, thereby increasing the cost of performing the DRAM accesses required for MC operations in real time.
Conventional data caches in microprocessors, such as described in “Data Type Dependent Cache Pre-fetching for MPEG Applications”, R. Cucchiara, A. Prati and M. Piccardi, IEEE (Journal unknown), 2002, are not sufficiently efficient in terms of reducing the number of DRAM cycles required to provide the reference image regions needed for MC, nor in terms of preventing unwanted DRAM cycle usage, nor in terms of minimizing the overall physical size of the cache.
Conventional CPU data caches typically use either a 2-way or 4-way set associative design. These designs do not work very efficiently. Typical CPU caches are encumbered by a need to complete a tag match and data return in one or two clock cycles, which results in increased cost, compared to the present invention.
Pre-fetching and aggregation of read requests (for DRAM efficiency) without caching are described in UK patent number GB2343808. Pre-fetching can help in some cases by predicting what data is likely to be requested and fetching data before it is requested. However such predictions are not always accurate and such pre-fetching sometimes results in reading data from DRAM that is in fact not ever requested by the video decoder. As a result the DRAM performance is actually made worse, rather than improved, in some cases, such as video sequences that are already worst case. This is very undesirable.
Aggregation of reads can help make DRAM transactions more efficient, but it does not help with re-using data that is returned from DRAM as part of one aggregated set of reads if some of that data is requested in a subsequent request by the decoder.
Further limitations and disadvantages of conventional and traditional approaches will become apparent to one of skill in the art, through comparison of such systems with the present invention as set forth in the remainder of the present application with reference to the drawings.