One of the major challenges in designing a memory subsystem is organizing data in such a way, as to enable efficient memory access that would increase the system throughput. Data organization in the memory becomes even more a bigger challenge in a UMA (Unified Memory Architecture) subsystem, where it has a direct impact on the efficiency of the system as a whole. Therefore data in the memory should be organized in such a way that, a high bandwidth client (a client is an agent that initiates data transfer between itself and the memory subsystem) benefits the most without compromising the access efficiency of the other low bandwidth clients. In other words the data organization in the memory should help reduce the DDR-SDRAM overheads for high bandwidth clients, which in turn would improve the efficiency of the memory subsystem as a whole.
In a video decompression-engine (a.k.a video decoder) a substantial portion of the system bandwidth is utilized in transacting video pixel data. The video decoder uses the neighboring macro-block (a macro block is a 16×16 pixel block) data from the previous and future frames of the video to predict the current macro-block information. Thus the right choice would be to have a memory subsystem that is macro-block oriented.
However the current column sizes in the DDR-SDRAM technology does not allow us to pack a full macro-block row information in one bank of the DRAM for a SD size picture. At the same time, a very simple linear arrangement of macro-block continuously in the same bank of the DDR-SDRAM would increase the SDRAM overheads, as an adjacent or vertical neighbor macro-block fetch would require a different row of the same bank to be activated. In such a case, the current row of the current bank needs to be precharged and a new row of the same bank needs to be activated, thus resulting in roughly 6-clocks overhead per row change. In the worst case, a particular video decode fetch could involve four macro-blocks worth data, lying in four different rows of the same bank, resulting in as high as 18 (three row switching) clocks overhead. On the other hand, if the adjacent or vertical macro-block were to exist in different banks of the DRAM, it would be possible to reduce the SDRAM overhead to zero in the best case and the worst case numbers will be much less that 18 clocks.
Conventionally, four macro blocks worth data are packed into one bank, before switching to the next bank of the DDR-SDRAM. This packing would be efficient for images, whose number of horizontal macro-block (NMBX) follows the equation,NMBX=16*N+8                (where N is any positive integer)        
The above equation ensures efficient data fetching & packing for a HD size picture (NMBX=120). However for a SD size picture (NMBX=45), the closest value of N, that satisfies the above equation=4, resulting in NMBX required=56. This means we have 11 macro-blocks, wasted for every macro-block row of the image. For a SD size picture this would be roughly 75 Kbytes of wasted memory per frame storage (roughly 20% wastage per frame).
Further limitations and disadvantages of conventional and traditional approaches will become apparent to one of skill in the art, through comparison of such systems with the present invention as set forth in the remainder of the present application with reference to the drawings.