The present invention relates to encoding of visual images.
Wireless data services now enable a new generation of high-performance, low-power-consumption mobile devices to access network-centric applications and content anywhere, anytime. Handheld devices include personal digital assistants (PDAs), email companions, and other data-centric mobile products such as Palm OS, Symbian, and Pocket PC products. The main functionality of such devices has been for personal information manager (PIM) applications. But as more of these devices get network connectivity options, applications such as voice and email are becoming important. Additionally, next-generation mobile phones are hybrid devices that extend the voice-centric nature of current generation (2G) handsets. These devices are connected to packet-based networks, which deliver data-services in addition to voice-services. Handsets connected to 2.5G networks such as GPRS and PHS allow always-on data network connection. This enables further proliferation of multimedia- and graphics-based applications in the consumer segment of this market. 3G Handsets have been designed from the ground up to interface to high-speed, packet-based networks that deliver speeds from 20 Kbps to 2 Mbps. These handsets, in addition to the features of 2.5G phones, have the capability to support 2-way video, share pictures and video clips, use location-based information, provide a rich web experience and support next-generation server-based applications for business like always-on email.
As mobile applications become richer and more complex, the ability to optimally process multimedia becomes a necessity on mobile devices such as PDAs and smart phones. Applications such as video mail, mapping services, reading PDF files, and graphics-rich games all require high performance graphics and multimedia capabilities. These capabilities enable new applications that benefit from rich images and system performance in ways that were previously unavailable to most handheld users. These mobile devices face the challenge of providing a compelling user experience while reducing overall system energy consumption.
To minimize transmission time and storage requirements, compression is used to efficiently store and transmit digitized images. Compression methods have been described by the Joint Photographic Experts Group (JPEG) for still images, and the Motion Picture Experts Group (MPEG) for moving images. For example, U.S. Pat. No. 5,734,755, entitled, “JPEG/MPEG Decoder-Compatible Optimized Thresholding for Image and Video Signal Compression,” shows signal encoding of still images and video sequences using DCT.
The JPEG method involves a discrete cosine transform (DCT), followed by quantization and variable-length encoding. The method requires substantial computation. JPEG compression uses controllable losses to reach high compression rates. Information is transformed to a frequency domain through a DCT. Since neighboring pixels in an image have high likelihood of showing small variations in color, the DCT output groups higher amplitudes in lower spatial frequencies. The higher spatial frequencies can be discarded, generating a high compression rate with only a small perceptible loss in the image quality.
In conventional forward DCT (FDCT), image data is subdivided into small two-dimensional segments, in one example, symmetrical 8×8 pixel blocks and each of the 8×8 pixel blocks is processed through a two-dimensional DCT independent of its neighboring blocks. Conventionally, the FDCT operation is as follows:
                              C          u                =                                            1                              1                /                                  √                  2                                                      ⁢                                                  ⁢            if            ⁢                                                  ⁢            u                    =                      0            ⁢                                                  ⁢            else                                                            C          v                =                                            1                              1                /                                  √                  2                                                      ⁢                                                  ⁢            if            ⁢                                                  ⁢            v                    =                      0            ⁢                                                  ⁢            else                                                            F          vu                =                              1            /            4                    ⁢                                          ⁢                      C            u                    ⁢                                          ⁢                      C            v                    ⁢                                    ∑                              y                =                0                                            N                -                1                                      ⁢                                          ∑                                  x                  =                  0                                                  N                  -                  1                                            ⁢                                                S                  yx                                ⁢                                                                  ⁢                                  cos                  ⁡                                      (                                          v                      ⁢                                                                                          ⁢                      π                      ⁢                                                                                          ⁢                                                                                                    2                            ⁢                            y                                                    +                          1                                                                          2                          ⁢                          N                                                                                      )                                                  ⁢                                                                  ⁢                                  cos                  ⁡                                      (                                          u                      ⁢                                                                                          ⁢                      π                      ⁢                                                                                          ⁢                                                                                                    2                            ⁢                            x                                                    +                          1                                                                          2                          ⁢                          N                                                                                      )                                                                                          
Implementing this formula in hardware or hardware/software is resource intensive and becomes exponentially more demanding as the size of the N by N block to be transformed is increased.
Since FDCT is a separable transform, it enables the computation 2-dimensional transform using a sequence of 1-dimensional transforms. A 2-D transform of an 8×8 block can be accomplished by 16 1-D transforms. First, each row is transformed using 1-D (8-point) FDCT. Results are stored in consecutive rows of an 8×8 storage array. Then 1-D transform is applied to each array column. Results are stored in consecutive columns of the output array, which then contains the resulting 2-D transform.
The operation described above implements the 2-D transform defined by the following matrix formula:F=D×P×DT Where D is the DCT coefficient matrix, P contains the 8×8 pixel array and (•)T is the matrix transpose operator. Let Dkm be D's entry in row k and column m. Then,
      D          k      ,      m        =      cos    ⁡          (                                    (                                          2                ⁢                m                            +              1                        )                    ·          k          ·          π                16            )      
The matrix D has the unitary property:D×DT=I where I is the unit matrix. Therefore, D's inverse is easily computed as D−1≡DT. As mentioned above, the 2-D transform can be implemented by a sequence of 1-D transforms. From previous expressions, 1-D FDCT formula is given by:
                                          Y            k                    =                                                    C                k                            2                        ⁢                                          ∑                                  m                  =                  0                                7                            ⁢                                                                    x                    m                                    ·                  cos                                ⁢                                                                  ⁢                                  (                                                                                    (                                                                              2                            ⁢                            m                                                    +                          1                                                )                                            ·                      k                      ·                      π                                        16                                    )                                                                    ,                  0          ≤          k          ≤          7                                                  C          k                =                  {                                                                                                                1                                              2                                                              ⁢                                                                                  ⁢                    if                    ⁢                                                                                  ⁢                    k                                    =                  0                                                                                                      1                  ⁢                                                                          ⁢                  otherwise                                                                        where xm are elements of the input vector. Yk are elements of the transform vector.
Various methods have been developed for efficient implementation of both 1-D and 2-D FDCT. All those methods attempt to exploit certain symmetries in FDCT formulas. Many methods focus on reducing the total number of multiplication operations, because these are very expensive to implement in hardware, and can be expensive in software on certain microprocessor architectures. One popular FDCT algorithm was developed by Arai, Agui and Nakajima (hereinafter AAN) in “A Fast DCT-SQ Scheme for Images,” IEEE Transactions of the IEICE, vol. E71, no. 11, 1988, pp. 1095-1097, the content of which is hereby incorporated by reference. The main advantages of this algorithm are:                1. A total of 13 multiplications are required.        2. Of those 13, 8 multiplications can be deferred to quantization process following FDCT. In practice those 8 operations are completely folded into quantization operations.        
FIG. 1 shows a prior art implementation of the AAN fast DCT process 200. As shown in FIG. 1, a vector-matrix multiplication is converted into a sequence of operations that requires fewer memory-consuming operations (such as multiplication) than the original DCT vector-matrix multiplication. The process 200 of FIG. 1 is performed using six computation stages, not counting the final scaling stage between the seventh and eighth columns. The computation stages exist between each column in the DCT process 200, where the columns correspond to clock domains that move the implementation of the AAN DCT algorithm from one computation stage to the next. Variable Xm is an element of the input vector, and Yk is an element of the transform vector. In this embodiment, five unique coefficients a1 through a5 are used as weights for one or more Xms. The arrows in FIG. 1 represent multiplication by −1. In a hardware implementation, each coefficient requires either a dedicated multiplier or a general-purpose multiplier that allows the use of a different coefficient for each multiply operation.
The two-dimensional transform of an 8×8 pixel block is accomplished by sixteen one-dimensional transforms. First, each row is transformed using a one-dimensional (8-point) DCT. The results are then stored in consecutive rows of an 8×8 storage array. The one dimensional transform is applied to each array column. Results are stored in consecutive columns of the output array, which then contains the resulting two-dimensional transform. The operations of the AAN DCT process 200 include multiply, add, multiply-accumulate, and move (no-op), as well as accumulate-multiply, in which two inputs are summed and subsequently fed into a multiplier.
Each computation stage includes eight simple dyadic operations. More specifically, eight add operations are performed in computation stage 1. Two move operations, three add operations, one multiply operation, and two multiply-accumulate operations are performed in computation stage 2. Two move operations, three add operations, one multiply operation, and two multiply-accumulate operations are performed in computation stage 3. Two move operations, two add operations, two multiply operations, and two multiply-accumulate operations are performed in computation stage 4. The accumulate-multiply operations, represented by each of the two pairs of diagonal lines connected to the coefficient a5, demand more memory resources to perform than the other operations. Four move operations and five add operations are performed in computation stage 5. Eight multiply operations are performed in computation stage 6. Further, the multiply operations are not distributed across the computation stages.