1. Field of the Invention
The present invention relates to the processing of performance sensitive transforms and more particularly improved processing of performance sensitive transforms
2. Description of the Related Art
Today's image processing applications require ever increasing processing power as image resolution and quality demands increase. For example, a high-end production continuous-tone color digital printer prints four separate colors (CMYK) on both sides of a 24 inch wide paper at six inches per second. The combined (four colors×one byte per color×24 inches wide×six inches per second×two sides) output rate of 1152 square inches per second at a resolution of 600 pixels (or pels) per inch requires a total image throughput rate of 415 megabytes per second. This is already several times the rates of High Definition TV (HDTV) video output data streams. Fortunately, there are eight print-heads and the printer has only 16 shades per color (four bits per pel), so the output to each print engine is a more manageable 25 megabytes per second. Leaving the data encoded in JPEG during transmission to the hardware, and decoding the data in the hardware further cuts down on the total bandwidth required.
However, future printers are likely to have twice the resolution on each axis and print at least an order of magnitude faster. Thus the demand for processing power for high end color printers is increasing much more rapidly than Moore's law.
The application of these processing demands is in no way unique to printing. Image processing is now a pervasive technology in hardware domains that have neither the cooling capabilities nor the processing power of high speed color printers. These include domains without special purpose hardware; where the processing power is limited to the strength and life of a battery (e.g. personal data assistants (PDAs), or cellular telephones), or to technology long since deployed such as in orbiting satellites.
One approach to meeting the increasing demands for image processing applications is to mitigate the processing requirements of these applications themselves. That is, simplify the implementation and power requirements of the underlying digital filter (i.e. transform), and parallelize the corresponding transform algorithm. This approach is in contrast to simply improving the hardware (i.e. Moore's law) such that the algorithms execute faster.
The Discrete Cosine Transform (DCT) is a widely used transform for image processing, for example it is the transform used in both the JPEG (for example see: J. L. Mitchell, W. B. Pennebaker, JPEG Still Image Data Compression Standard, Van Nostrand Reinhold: New York© 1993) and MPEG (for example see: J. L. Mitchell, W. B. Pennebaker, D. LeGall, and C. Fogg, MPEG Video Compression Standard, Chapman & Hall: New York © 1997.) standards. By its mathematical definition, it is a computationally complex algorithm defined by cosine multiplications to accomplish the transformation of data into and from the frequency domain.
An example of an order-eight one dimensional (1-D) DCT can be described with the following mathematical definitions.
                                                              ⁢                      FDCT            _                                                                    S            ⁡                          (              u              )                                =                                    (                              Cu                /                2                            )                        ⁢                                          ∑                                  x                  =                  0                                7                            ⁢                                                          ⁢                                                f                  ⁡                                      (                    x                    )                                                  ⁢                                  cos                  ⁡                                      [                                                                  (                                                                              2                            ⁢                            x                                                    +                          1                                                )                                            ⁢                      u                      ⁢                                                                                          ⁢                                              Π                        /                        16                                                              ]                                                                                                                                                    ⁢                      IDCT            _                                                                    f            ⁡                          (              x              )                                =                                    ∑                              u                =                0                            7                        ⁢                                                  ⁢                                          (                                  Cu                  /                  2                                )                            ⁢                              S                ⁡                                  (                  u                  )                                            ⁢                              cos                ⁡                                  [                                                            (                                                                        2                          ⁢                          x                                                +                        1                                            )                                        ⁢                    u                    ⁢                                                                                  ⁢                                          Π                      /                      16                                                        ]                                                                              where    _        Cu    =                            2                                    -              1                        /            2                          ⁢                                  ⁢        for        ⁢                                  ⁢        u            =      0            Cu    =                  1        ⁢                                  ⁢        for        ⁢                                  ⁢        u            >      0                  x      =      0        ,    1    ,    …    ⁢                  ,    7              u      =      0        ,    1    ,    …    ⁢                  ,    7              S      ⁡              (        u        )              =          the      ⁢                          ⁢      DCT      ⁢                          ⁢      coefficients                  f      ⁡              (        x        )              =          the      ⁢                          ⁢      input      ⁢                          ⁢      sample      ⁢                          ⁢      data      Note the computations required for each output of the forward DCT (FDCT): eight cosine multiplications, seven additions, and one multiplication by the constant Cu, while the inverse DCT (IDCT) is equally as complex. As a result, because a transform implementation with this amount of complexity is unacceptable in most image and video compression applications, many fast and efficient implementations of the DCT have been proposed in which the complexity of the algorithm is mitigated through various means.
For example, the Vetterli and Ligtenberg fast 1-D DCT (see: Martin Vetterli and Adriaan Ligtenberg, “A Discrete Fourier-Cosine Transform Chip”, IEEE Journal on Selected Areas in Communications, Vol. SAC-4, No. 1, pp. 49-61, January 1986) reduces the total number of operations for all eight outputs to 13 multiplications and 29 additions by exploiting the trigonometric properties of the equations. The Arai, Agui, and Nakajima (AAN) DCT (see: Y. Arai, T. Agui, and M. Nakajima, “A Fast DCT-SQ Scheme for Images”, Transactions of the IEICE E 71(11):1095, November 1988) demonstrates the ability to scale the DFT to a DCT, thus producing a scaled DCT. In this DCT, the quantization step is exploited to include the scale terms necessary to convert the DFT outputs into DCT outputs.
J. Bracamonte, P. Stadelmann, M. Ansorge, F. Pellandini, “A Multiplierless Implementation Scheme for the JPEG Image Coding Algorithm”, NORSIG 2000, IEEE Nordic Signal Processing Symposium, Kolmarden, Sweden, June 2000, pp. 17-20, describes the implementation of the 1-D DCT using the AAN algorithm, but with cosine multiplications implemented in terms of dyadic rationals (i.e. shift and add operations).
“Fast Multiplierless Approximations of the DCT With the Lifting Scheme”, Jie Liang, Trac D. Tran, IEEE Transactions on Signal Processing Vol. 19, No. 12, December 2001, also discloses the implementation of a multiplierless DCT but using lifting functions.
Further, improvements to DCT processing have been described in the following co-pending and commonly-assigned patent applications: “Reducing errors in performance sensitive transformations” to Hinds et al., having application Ser. No. 10/960,253; “Compensating for errors in performance sensitive transformations” to Hinds et al., having application Ser. No. 10/960,255; and “Approximations used in performance sensitive transformations which contain sub-transforms” to Mitchell et al., having application Ser. No. 11/041,563. Ser. No. 10/960,253 discloses replacing the cosine constants in a transform equation with approximations which comprise an integer numerator and a common floating point denominator. Ser. No. 10/960,255 further improves on Ser. No. 10/960,253 by modifying the result of the DCT using an adjustment factor to compensate for errors introduced as a result of the approximation used. Ser. No. 11/041,563 also improves on Ser. No. 10/960,255 by considering each sub-transform of the transform equation separately when selecting the approximations to replace the cosine constants.
However faster and more accurate DCT implementations are an on-going need in the industry and such implementations may make use of parallel processing by loading several elements into one register such that a single operation on the register acts on each element loaded into the register. However, in order to exploit such parallel processing to its full it is necessary to keep elements small whilst at the same time controlling the introduction of error caused by lowering the precision of the elements.