Described below are the problems which exist in the case of encoding/decoding a video sequence formed of successive images, and with which the inventors of the present patent application were confronted. The disclosure is, of course, not limited to this particular case of application, but is of interest for any image encoding/decoding technique having to confront close or similar problems.
Designing a video encoder which is both real time and of high quality is a true technological challenge, in particular in the case of high-resolution videos (e.g., SD (“Standard Definition”), HD (“High Definition”)). As a matter of fact, video encoding is a particularly complex application.
It appears that processing all of the blocks of an image via a single processor is not optimal in terms of computing time. In order to bring together the necessary computing power, use is therefore often made of parallelization: several processing units operating simultaneously on various portions of the video. The computing time can theoretically be divided by the number of processing units implemented.
A first known technique for parallelizing a video encoder consists in limiting the spatial and temporal dependencies. Thus, the H.264/AVC standard (ITU-T H.264, ISO/IEC 14496-10) enables the images to be cut up into separate slices. The slices of a single image can be encoded in parallel, each slice being processed by a separate processing unit (processor). It then suffices to concatenate the bit streams resulting from the processing of the various slices.
This first known technique has the major disadvantage of limiting the encoder performance in terms of compression/quality (loss of compression efficiency). As a matter of fact, besides the weighting of the syntax elements relative to the slices, cutting into slices prohibits the use of inter-slice spatial correlation. Such being the case, the purpose of the spatial and temporal dependencies is to best utilize the correlations present in the video source. This is what makes it possible to maximize the compression efficiency of the video encoder. As a matter of fact, recent video compression formats (H.264/AVC, MPEG4 ASP, H.263) introduce strong spatial and temporal dependencies in video processing. The images are generally cut up into 16×16 pixel size blocks (macro-blocks). Successive processing of these blocks is sequential by nature, insofar as the processing of each block requires knowledge of the result of the processing of the neighboring blocks. In the same way, the images can conventionally be temporally encoded according to 3 different modes, I, P or B. The encoding of an image B requires knowledge of at least two previously encoded images P. The encoding of an image P requires knowledge of at least one previously encoded image P.
A second known technique for parallelizing a video encoder is described in the patent application published under the number WO 2004/100557, and filed by Envivio. This involves a spatial parallelization method for processing blocks on N processors, making it possible to preserve the dependencies required by video compression standards. The general principle consists in cutting the image up into bands which are perpendicular to the sequential block processing direction. This makes it possible to obtain an optimal distribution of the loads between processors. For example, if the processing of the macro-blocks is carried out sequentially, line-by-line, the image is separated into vertical bands. In addition, synchronization of the processing carried out by the N processors makes it possible to prevent one processor from attempting to process a given block while other blocks on which this block depends have not yet been processed.
This second known technique is effective, but can turn out to be insufficient, for several reasons:                the number N of processors is limited. As a matter of fact, it is not possible to use more than W processors, with W being the number of blocks per line of the image (i.e., the width of the image in blocks) in the case of line-by-line processing, with the image being cut up into vertical bands. Furthermore, the larger the number of processors used, the less efficient the parallelism, due to the initialization and termination phases during which the processors are not all used;        designing a machine with many processors is complicated and costly;        even with multiprocessor platforms comprising many processors, the total computing power is limited and can turn out to be insufficient. It is necessary to have more power in order to improve the compression performance of real-time video encoders.        
It is conventional practice to make use of coprocessors in order to increase the processing capacity of the processors. In general, each processor is assigned one coprocessor. Processors and coprocessors are generally differentiated not by the technical nature thereof (CPU, FPGA, ASIC, DSP, . . . ), but by the role of same within the system. The processor has a master processing unit role; it is responsible for the overall control of the application, as well as for a certain number of decision-making and computing tasks. The coprocessor has a slave processing unit role; it is used by the processor for the more complex computations. It should be noted that, in a processor/coprocessor configuration such as this, the communication of data between processor and coprocessor can take a considerable amount of time, which has an adverse affect on the overall performance of the device.
In actual practice, the so-called “generic” processors often enable all sorts of computations to be made, have rapid memory access and are efficient in the case of “jumps” (“if” instructions, loops). On the other hand, they are not necessarily the most powerful. Coprocessors, e.g., DSP or FPGA, are better suited to supercomputing. However, they are more hampered by jumps and do not have the same storage capacities.
However, within the context of the aforesaid second known technique, for parallelizing a video encoder, the combined use of processors and coprocessors is not easy or problem-free.
As a matter of fact, the basic solution consisting of assigning one coprocessor to each processor (and of therefore using a number N of processors equal to the number M of coprocessors) is not optimal. As a matter of fact, in order for such a solution to be effective, it would be necessary to ensure that the coprocessors are correctly dimensioned with regard to the required processing, which is unfortunately difficult, or even impossible in actual practice. If the coprocessors have an insufficient amount of power, the system will obviously not be capable of operating. If, on the other hand, the coprocessors are too powerful, they will be under-exploited and the additional cost related to the implementation of these more powerful coprocessors will be unnecessary.
Therefore, it would be appropriate to adopt a more complex solution, wherein the number N of processors would be different from the number M of coprocessors (i.e.: N≠M, with N>0 and M>0). For example, by seeking to develop products based on generic processors and FPGA type coprocessors, the inventors of the present application were confronted with a significant gap between the processing capacities of the processors and coprocessors. Furthermore, it is costly and complex to implement a coprocessor for each processor. In this regard, it would be more advantageous to use a single very powerful coprocessor for several processors. For example, a single FPGA coprocessor for four processors.
However, nothing in the prior art indicates how to manage the parallelization and synchronization of processing operations in such a context.