Media applications have been driving microprocessor development for more than a decade. In fact, future media applications will place even higher computational requirements on available microprocessors. As a result, tomorrow's personal computing (PC) experiences will be even richer in audio/video effects as well as being easier to use. More importantly, future microprocessors will enable computing to merge with communications.
Accordingly, the display of images has become increasingly popular for current, as well as future, computing devices. Unfortunately, the quantity of data required for these types of media applications tends to be very large. In addition, increases in computational power, memory and disk storage, as well as network bandwidth, have facilitated the creation and use of larger and higher quality images. However, the use of larger and higher quality images often results in a bottleneck between the processor and memory. As such, image/video processing (media) applications often suffer from memory latency problems in spite of the fact that microprocessor clock speeds are continuously increasing.
Although Random Access Memory (RAM) claims to provide random access to the memory contained therein, RAM is not generally accessible in a random pattern. In fact, the time required to access different portions of the RAM may vary. For example, horizontal access of memory locations within a memory device is generally quite expedient. In contrast, vertical memory access is quite slow when utilizing conventional memory devices.
As a result, raster-scan memory arrangements for video images place pixel data linearly across the image plane within a cache memory, which often lead to numerous problems. First, a cache line holds some parts of several basic image blocks (e.g., 8×8 or 16×16). In contrast, a basic image block is contained in several cache lines. As such, accessing a block, say 8×8, is equivalent to accessing several cache lines (e.g. at least 8 cache lines in current architectures under the assumption that we are dealing with images wider than a single cache line).
Accessing an image using conventional memory devices requires at least eight memory accesses. Furthermore, it is likely that eight software pre-caching instructions are required in order to avoid having eight cache misses. Moreover, when processing an image block, conventional applications load superfluous data into the cache in addition to the block itself. Consequently, unless we process nearby blocks immediately, superfluous data will be brought into the cache memory, which reduces the cache efficiency due to the superfluous data.
One solution for providing improved cache localities when processing image data involves block based memory arrangements. Although block based memory arrangement schemes provides higher cache localities, access to a vertical set of 8- or 16-pixels using one instruction is still not supported. To access a vertical set of data, most conventional implementations use pack and unpack operations to transpose the data. Unfortunately, this is a slow procedure in most of the applications. Moreover, in some applications, such as image/video compression, the pixel data must be accessed in varying scan orders, which is nearly a random access of the pixel data. Therefore, there remains a need to overcome one or more of the limitations in the above-described, existing art.