Most of digital image processing algorithms are memory access intensive. For example, a 3×3 window mean filter algorithm with simplest software implementation would have 9 memory reads, 10 memory address calculations, 8 add operations, one divide operation, and one memory store operation for each pixel as follow:OUT[i][j]=(IN[i−1][j−1]+IN[i−1][j]+IN[i−1][j+1]+IN[i][j−1]+IN[i][j]+IN[i][j+1]+IN[i+1][j−1]+IN[i+1][j]+IN[i+1][j+1])/9;
where OUT[i][j] is the output image gray scale value at row i and column j, and IN[i][j] is the input image gray scale value at row i and column j.
With initialization of the sum of 9 input image pixels, the above algorithm can be optimized as:SUM=SUM B(IN[i−1][j−2]+IN[i][j−2]+IN[i+1][j−2]) +(IN[i−1][j+1]+IN[i][j+1]+IN[i+1][j+1]);OUT[i][j]=SUM/9;
Here, there are still 6 memory read operations, one memory write operation, 3 add operations, 3 subtract operations, one divide operation, and 7 memory address calculations.
For a general-purpose microprocessor, the memory address calculation operation, memory read and write operations take a large portion of the executing instruction flow. Traditional microprocessors can be classified into three types from an architectural viewpoint, 1) General purpose CPU(CISC/RISC), 2) DSP and 3) Parallel Array processor.
General purpose CPU normally has one Arithmetic Logic Unit (ALU) to take care of all the data manipulations and address calculations in serial. A DSP has one or more simple Adders to update the data address registers at the same time, while making data calculation in main ALU. This feature can double or triple the speed of a one dimensional (1-D) filter with a single cycle Multiplier and Accumulator (MAC).
A parallel array processor has many similar simplified ALUs. The data to be processed is fed through a hardwired data-path. Many special-function processors like FFT/Motion estimation processors or so-called general-purpose systolic/wave-front processors share the same basic idea. However, these kinds of processors are typically inflexible, can do limited types of image processing operations, and take substantial silicon area. Another example of a computation intensive application is a histogram operation on an image. A histogram is the distribution of gray levels in a given input image. The histogram of gray levels provides a representation of the appearance of an image. Histogram based image enhancement or noise filtering method has been widely used in various image processing fields and proves very effective. Histogram equalization is the most widely known method for image contrast enhancement, and is described in J. S. Lim, “Two-Dimensional Signal and Image Processing”, Prentice Hall, Eaglewood Cliffs, N.J. 1990. Furthermore, a conventional histogram extraction circuit has been disclosed in U.S. Pat. No 6,219,447 issued to Hyo-seung Lee, Apr. 17, 2001, the entire contents of which is hereby incorporated by reference.
Traditional histogram operations are also memory intensive. After initializing, for each image pixel, there will be one image memory address calculation, one image memory read operation, one histogram memory address calculation, one histogram memory read operation, one add operation, and one histogram memory write operation as: Histogram [Image[u][v]]++.
Therefore, there is a need for an efficient and fast method and apparatus for two dimensional (2-D) image processing that minimizes the memory access bottleneck.