Motion estimation (ME) is an example for a time critical application that requires very much processing power. Therefore, specialized circuitry is usually implemented in hardware in a massively parallel way as single instruction multiple data (SIMD) architectures. These architectures have commonly one processing element (PE) per value to be calculated, e.g. for comparing a pixel of a current picture to reference pixels. Usually, the corresponding pixel in the previous picture and its neighbors serve as reference pixel. In a more generalized view, any one-, two- or multi-dimensional data set serves as input to the processing. The PE for ME can access a current pixel and a number of reference pixels stored in a memory. The pixels are usually copied from a large image memory into a smaller operating memory that can be accessed faster. This copy operation takes relatively long, since the large image memory is slow. The operating memory contains a number of blocks that may, according to the employed encoding scheme, serve as reference for predicting the current block. Blocks have often a quadratic structure with 16×16, 8×8 or 4×4 pixels.
Usually the results of the PEs for a pixel block are accumulated for calculating a measure of similarity between the current block and a particular reference block, and subsequent circuitry determines the most similar reference block and, based on this block, encodes the current block.
Thus, a PE needs to have access to a number of reference pixels that are distributed all over the operating memory, which is relatively large compared to the current block. If redundant pixel storage shall be prevented, an architecture with a complicated connection circuitry is required. E.g. US2003/0174252 uses a programmable crossbar switch for distributing pixel values from a memory subsystem to the PEs. A bit mask controls which pixel data can be accessed by a PE. Crossbar switches commonly contain a high number of multiplexer elements in a deep hierarchy, which is disadvantageous for higher operating frequencies.
Usually, ME circuitry is implemented in hardware, e.g. an ASIC. However, known implementations generally suffer from the long and complicated connection paths between the PE and the memory subsystem, and the resulting limitation of maximum operating frequency. An optimized architecture is therefore desirable.