Multi-dimensional data structures, such as digital images and digital video, are commonly stored in 2-dimensional memories. When these data structures are processed, the data values are often accessed non-sequentially. For example, when a spatial filter is applied to sub-array or tile of an image, data values (pixels) must be retrieved in the correct order from the memory. Often, the data values are fed into a data pipeline for efficient processing and the filtering in performed ‘in-place’ for efficient memory use.
A common problem in the filtering of 2-dimensional images is how to handle border conditions on output, especially when the processing is done in-place on tiles within a larger image array. For example, when a 3×3 filter is applied to a 16×16 tile, 18 input pixels per row need to be processed to produce the 16 output pixels. An efficient pipelined implementation results in 1 output pixel for every input pixel in a row. This results in 18 output pixels, of which the first two are invalid since they were created from input values in the partially filled pipeline that had not been initialized.
One approach to solving this problem is to pre-load the first two input pixels of each row before processing and then only produce 16 output pixels. This requires extra steps for each output row that do not fit into the normal flow of processing. The extra steps and the time needed to restart the processing add complexity and reduce performance.
Another approach is to process all 18 output pixels but reserve a border of extra pixels around the output buffer to hold the invalid output. This border is ignored for all sequential uses of the results. This gives good performance, but corrupts the surrounding pixels, making it unsuitable for processing images in-place.
A further problem when sequentially accessing a multi-dimensional array is that each dimension has a harmonic relation to the next smaller dimension, i.e. it is an integer multiple. This means that only the rollover of an index of one dimension increments the index of the next larger dimension. This creates a chain of dependencies between the indices that affects the inputs to the adder network used to generate the next element address. This, in turn, may result in a long propagation delay in calculating the next address for the worst cases of rollover.
One approach to minimizing this problem is to limit the number of dimensions dependent upon the clock rate. This approach is simple, but does not maximize performance.
Another approach is to add additional logic to create stall cycles when the rollover will propagate over more than one index. This adds some complexity and reduces performance.
Yet another approach is add extra adders and logic to pre-calculate all possible rollovers and their effects on the final address, and then select the appropriate one based on the amount of rollover. This adds complexity.