The continuous breakthrough with high density nano fabrication technology and the system-on chip (SoC) design enable a single chip to accommodate a plurality of processing units. The demands for the consumer digital products are the focus of the electronic industry. The popular products in demands include hand-held devices, such as mobile phone, digital camera, portable media player (PMP), and home theater, such as LCD TV, DVD, PVR, RG, and so on. The types of data that need to be processed on all these electronic devices are increasing, including video, audio, and text.
In response to the complex processing demands, the multi-core platform is considered as a promising solution. The multi-core platform not only uses a microprocessor with a reduced instruction set computing (RISC) based micro processing unit (MPU), such as ARM, MIPS, and PowerPC, and a digital signal processing (DSP) unit for processing signals.
Each of these processing units can be an individual multi-core platform, including multi-core RISC-based network processor, such as Broadcom, Freescale, and PMC-Sierra, or multimedia processing with RISC, DSP and even reconfigurable accelerator, such as OMAP(TI), i.Smart (Freescale), Vision (Agere), and PAC(ITRI).
To meet the ever-growing multimedia application, the dual-core processor with a RISC-based MPU and a DSP is gaining popularity. The RISC-based microprocessor, such as ARM, is responsible for the operating system (OS), man-machine interface (MMI) and other routine tasks, while the DSP processor is for executing complex mathematical computing, such as audio coding/decoding, video decoding, and so on.
In other words, the RISC-based microprocessor of the dual core platform performs tasks different from the DSP. The RISC-enhanced DSP may be powerful in DSP, but not in the general RISC processing. The DSP is optimized for real-time signal processing, which may require less power consumption and computing cost than the RISC processing. In addition, the pipeline of the DSP, although can perform complex signal processing efficiently, is not suitable for simple control. Therefore, the DSP processor is not efficient for general-purpose control processor.
The multimedia application on portable devices, such as PDA and smart phone, is common. As the portable devices are battery-powered, it is important to prolong the battery life. However, the video signal processing usually is complex and consumes a large amount of power. The advanced video compression standard H.264/AVC (advance video coding) reports that a large difference exists in the computing complexity in reconstructing a frame. FIG. 1 shows the number of cycles required for decoding a QCIF image, with the minimum number of cycles 1,020,140, and the maximum number of cycles 4,002,744. The average number of cycles is 2,446,444, and the standard deviation is as high as 710,647 cycles.
In general, the microprocessor is designed for the worse scenario. Therefore, the microprocessor usually has a large amount of idle time. When a microprocessor is idling, the operating voltage or frequency can be reduced to save power consumption.
FIG. 2 shows a decoding flowchart 200 of H.264/AVC. The coded bitstream, after entropy decoding, is decoded as two types of data. The first type of data is the syntax elements, including block file header data, motion vector, and so on, and the second type of data is quantized residual coefficient.
H.264 uses power Columbus code to decode the first type of data, and uses context adaptive variable length codes (CAVLC) to decode the second type of data. CAVLC decoding includes the following steps 101-106, with each step using a different code table.
Step 101: decoding the total number of non-zero coefficients TC and the number of ±1 T1s. The range of TC is 0-16, and the range of T1s is 0-3. This step determines the lookup table based on the nC, where nC is the average of the numbers of non-zero coefficients in the upper part and the left part of the current block that is already decoded.Step 102: based on T1s, decoding the sign. Use 0 to represent the positive sign, and use 1 to indicate the negative sign.Step 103: based on TC, decoding non-zero coefficient level. The look up table used in this step is determined by the previous decoded non-zero coefficient.Step 104: decoding the total number of zeros preceding the non-zero coefficients. The lookup table used in this step is determined based on TC.Step 105: decoding the number of zeros preceding each non-zero coefficient. The lookup table used in this step is determined by the number of zeros preceding the non-zero coefficient.Step 106: recovering the 16 zig-zag sequenced coefficients, based on the values of the previous steps.
FIG. 3 shows the quantized residual coefficients generated after entropy decoding 201. The quantized residual coefficients include the coefficients of 27 small blocks. With the exception of 16th and 17th small blocks 302 being 2×2, the rest of the small blocks are all 4×4. Also, only in the Intra—16×16 coding mode, the −1th small block 301 will be generated after the entropy decoding.
In H.264/AVC decoding process, inverse quantization 202 is the quantized residual coefficient matrix multiplied by the corresponding quantized matrix. The computation equation is shown in FIGS. 4A-4E, where matrix [cij] is the quantized residual coefficient matrix, S is determined by quantization parameter QP divided by 6, and T is the matrix after the inverse quantization, called transform residual block coefficient matrix.
Inverse quantization 202 performs the 4×4 inverse quantization computing on the −1th small block 301, performs the 2×2 DC inverse quantization computing on the 16th and 17th small blocks 302, and performs the 4×4 DC inverse quantization computing rest of small blocks 303.
The transform residual coefficients after the inverse quantization are shown in FIG. 3. The transform residual coefficients include coefficients of 27 small blocks. With the exception of 16th and 17th small blocks 302 being 2×2, the rest of the small blocks are all 4×4. Also, only in the Intra—16×16 coding mode, the −1th small block 301 will be generated after the entropy decoding.
In the H.264/AVC decoding process, the computing equation of inverse transform 203 is shown in FIGS. 5A-5C, where matrix [yij] is the transform residual coefficient matrix, and X is the residual coefficient matrix. Inverse transform 203 performs the 4×4 inverse transform computing on the −1th small block 301, performs the 2×2 DC inverse transform computing on the 16th and 17th small blocks 302, and performs the 4×4 DC inverse transform computing on the rest of small blocks 303.
In the H.264/AVC decoding process, motion compensation (MC) 204 is the sum of the inverse transform output and the predictor found in intra-frame prediction 207 or inter-frame prediction 208.
Intra-frame prediction 207 provides intra—4×4 and intra—16×16 types. Intra-4×4 is to find the predictor using the luma 4×4 small block as the unit. There are 9 prediction directions for finding predictor. Intra—16×16 is similar to intra—4×4, but using luma 16×16 small block as the unit, and has 4 prediction directions for finding predictor. The intra-frame prediction technique also provides 4 intra-frame prediction directions for chroma, which uses chroma 8×8 block as the unit.
Inter-frame 208 uses the motion vector 206 to generate prediction block in the reference frame. The unit of motion vector 206 can be an integer dot, ½ dot, or ¼ dot. As the ½ dot and ¼ dot information is not recorded during storing the frame, the ½ dot and ¼ dot information must be computed using an integer dot.
The type of intra-frame prediction 207 can be obtained from the first type of data after entropy decoding, and the motion vector of inter-frame prediction 208 can be computed from the first type of data after entropy decoding.
In the H.264/AVC decoding process, the operation of deblocking filter 205 is shown in FIG. 6A. The four vertical boundary lines a-d and the four horizontal boundary lines e-h are used to divide a 16×16 luma block into 16 4×4 luma sub-blocks. Similarly, two vertical boundary lines i, j and two horizontal boundary lines k, l can be used to divide an 8×8 chroma block into 4 4×4 chroma sub-blocks, as shown in FIG. 6B.
When executing deblocking filtering on luma blocks, the execution order is to process four vertical boundary lines a-d, and then four horizontal lines e-h. Similarly, when executing deblocking filtering on chroma blocks, the execution order is to process vertical boundary lines i, j, and then two horizontal boundary lines k, l.
During the deblocking filtering, the boundary strength (BS) is used to determine whether the filtering is required. When BS=1, 2, 3, 4, the filtering is performed. When BS=0, no filtering is performed. The BS is determined by the conditions in FIG. 7. In summary, the conventional decoding process is shown in FIG. 8.
U.S. Pat. No. 6,944,229 disclosed two methods of dynamically adjusting the voltage frequency of the processor. The first method is DVS-Dm, and the second is DVS-PD. DVS-DM is to use the previous load record to adjust the voltage frequency. By categorizing the decoding time into delay state and drop state, the delay state implies the CPU has sufficient time to decode. The greater the delay is, the more time CPU has for decoding. A delay of zero implies that the CPU has just sufficient time to decode. A drop state implies that the CPU has no time to decode and must drop the current frame. When decoding I-type and P-type frames, the voltage frequency must be adjusted to the highest, and when decoding B-type frame, the voltage frequency is tuned high. When being in the delay state and the delay is greater than 100, the voltage frequency is tuned down.
DVS-PD uses the previous load record and the estimation of decoding time to adjust the voltage frequency. Because the time required for decoding an I-type, P-type and B-type frame is different, the decoding time can be estimated by the frame type and the load record of the same type frame. The voltage frequency can be tuned similar to the DVS-DM.