The present invention relates to real-time video compression and coding with multiple processors, and more specifically relates to a method for optimum use of the computing power available for the improvement of the performance of such systems.
The most successful techniques for video compression and coding, including the international standards for moving images H.261/3 and MPEG-1/2/4, are based on motion estimation (ME), discrete cosine transform (DCT) and quantisation. FIG. 1 shows the basic procedure by which this is performed. The video input undergoes a series of processes including discrete cosine transform (DCT) 10, quantization (Q) 11, inverse quantization (Qxe2x88x921) 12, inverse discrete cosine transform (IDCT) 13, frame store (FS) 14, interpolation 15, motion estimation (ME) 16, motion compensation (MC) 17, source coding (SC) 18 and buffering 19. The intensive computation involved in these techniques poses challenges to real-time implementation.
Performance with a single processor is limited by silicon technology. Parallel processing is the way to achieve the necessary speedup for video coding with existing algorithms and computing power. However, load balancing among processors is difficult because of the data dependent nature of video processing. Imbalance in load results in a waste of computing power.
The load-balancing problem has been tackled with block-, MB (macroblock)xe2x80x94or GOB (group of blocks)xe2x80x94based approaches due to the inherent data dependency in an image neighbourhood. This is described in, for example, T. Fujii and N. Ohata, xe2x80x9cA load balancing technique for video signal processing on a multicomputer type DSPxe2x80x9d, Proc. IEEE Int""l Conf. Acoustics, Speech, Image Processing, USA, 1988, pp.1981-1984, and K.Asano et al, xe2x80x9cAn ASIC approach to a video coding processorxe2x80x9d, Proc. Int""l Conf. Image Processing and its Applications, London, 1992, pp.127-130. Such an approach has various disadvantages: (1) the computational complexity for adjoining MBs/GOBs may not be similar, especially when they are coded (inter- or intra- ) differently or lie on the border of motion; (2) extra computation and storage are required in GOB-based approaches because source coding (e.g., VLC, SAC) is sequential; (3) larger programs and working memory (very often on-chip memory space) are usually needed in real-time implementation because of the need of dedicated code for each processor; (4) some approaches require measurement of timing for every block""s processing, and this is not feasible in real-time implementation; (5) there is no opportunity for dynamic load-balancing (the actual amount of computation depends on input pictures).
In the case of two processors, in order to avoid or at least alleviate the problems (1) to (3) listed above, the encoding process for an MB can be divided. This is described in, for example, W. Lin, K. H. Goh, B. J. Tye, G. A. Powell, T. Ohya and S. Adachi, xe2x80x9cReal time H.263 video codec using parallel DSPxe2x80x9d, Proc. IEEE Int""l Conference Image Processing, USA, 1997, pp586-589, also in B. J. Tye, K. H. Goh, W. Lin, G. A. Powell, T. Ohya and S. Adachi, xe2x80x9cDSP implementation of very low bit rate videoconferencing systemxe2x80x9d, Proc. Int""l Conf. Information, Communications and Signal Processing, Singapore, 1997, pp1275-1278, and in K. H. Goh, W. Lin, B. J. Tye, G. A. Powell, T. Ohya and S. Adachi, xe2x80x9cReal time full-duplex H.263 video codec systemxe2x80x9d, Proc. IEEE First Workshop on Multimedia Signal Processing, USA, 1997, pp445-450. The division of the encoding process is as follows:
Processor 1 (p1)xe2x80x94interpolation, ME, and all auxiliary processing before ME
referred to as ME sub-process hereafter for convenience;
Processor 2 (p2)xe2x80x94DCT, IDCT, Quant, Dequant, source coding (SC), and all auxiliary processing after SC
referred to as DCT-Q-SC sub-process hereafter for convenience.
The ME sub-process in p is carried out independently, regardless of the progress in p2, until the end of a frame. The DCT-Q-SC sub-process in P2 is also carried out independently provided that there are ME-completed MBs waiting for DCT processing. FIG. 2 depicts a status table for a QCIF Frame with 99 macroblocks showing the progress of p1 and P2. A frame 20 consists of a series of macroblocks (MBs) 21. In this example, the first twelve MBs 22 have had DCT-Q-SC completed. ME but not DCT has been completed on a further seven MBs 23, so the total number of MBs which have had ME completed is nineteen. The rest of the MBs 24 have not yet had ME completed.
In real-time applications, motion search can be performed in an efficient way, such as the 3-stage search or a pruning approach (described in Lin et al, Tye et al and Goh et al mentioned above) to meet the frame-rate requirement. The amount of computation for the DCT-Q-SC sub-process can be controlled in P2 as shown in FIG. 3. When the minimum SAD (sum of absolute difference) for an MB is less than Tdct (a threshold to be determined), the DCT-Q will be skipped. The reasoning for this is that when the sum of grey-level difference for an MB against its match in the previous frame is less than a certain value, it is acceptable to code just motion vectors and no DCT coefficients need be processed.
The object of the invention is to provide a simple and yet effective method to achieve load balance dynamically and adaptively while processing time is minimised, at the same time maintaining reasonable image quality and bit count, or image quality is optimised, at the same time maintaining processing time and bit count.
In general, more than one processor can be used to implement each of the sub-processes for an MB: Processing means 1 (P1)xe2x80x94ME sub-process; Processing means 2 (P2)xe2x80x94DCT-Q-SC sub-process.
In real-time applications, the processing time and coded bit length are major considerations if image quality is acceptable. The invention provides that when the strategy of ME in P1 is set, the DCT-Q-SC subprocess can be adjusted for optimum use of P2""S computing power.
According to the invention, the amount of computation in P2 should be reduced when P1 progresses faster than P2; otherwise the amount of computation in P2 should be increased. The amount of computation in P1 is determined by the ME strategy adopted, and is therefore kept largely constant for similar images, except for some small variation due to the effect in DCT/IDCT/Quantisation/Dequantisation.
The amount of computation in P2 may be controlled by Tdct. Tdct is therefore decided adaptively for each MB in order to optimise the performance of the system, according to the dynamic load pattern between P1 and P2. Determination of load pattern should not introduce significant additional computation to the video encoding process. Preferably, therefore, measures for the idle time in P2 and the progress of ME in P1 may be used.
The invented method balances the workload for video encoding with multi-processors according to dynamic load pattern, rather than dividing the task according to predetermined statistical information. The control is therefore more effective, aiming at optimum use of the computing resources for best performance. An additional advantage is that the additional computation brought to the video encoder by this load balancing method is negligible.