The present invention relates to a method of analyzing data to schedule processing of the data for more efficient use of codec processing resources.
Modern video coders use a hybrid approach of prediction and transform coding to reduce the bandwidth of coded signals. For processing purposes, a coded picture is divided into smaller units referred to as “macroblocks”, a fundamental coding unit. On the pixel level, there are two forms of prediction in video coding: temporal and spatial. In spatial prediction, pixels of already reconstructed blocks in the current picture are employed in directional extrapolation and/or averaging, in order to predict the block currently being processed. In temporal prediction, previous pictures may serve as “reference pictures” and be used to predict pixel blocks (macroblocks or smaller units) in the current picture. Temporal prediction can be described by a motion vector (displacement from the reference picture), a reference picture and/or prediction weights. Motion vectors may also be predicted. When a picture is marked as a reference picture, after reconstruction, the decoder stores it in a reference picture buffer for prediction of future pictures. The encoder prediction loop contains a decoder, replicating the decoder-side behavior at the encoder. After prediction, prediction residuals are transformed, typically for energy compaction, quantized and converted from 2D into 1D-data via a scanning order. The resulting data is then written to the bitstream via an entropy coding method. The prediction loops and the bitstream as outlined above introduce operation serialization, making it difficult to execute operations in parallel. Further, for compression efficiency, pictures may be encoded out of (display) order, which results in additional delay when the encoder/decoder has to wait for full reconstruction of reference picture. A number of techniques for mitigating this problem using concurrent processing approaches (i.e. “multi-threading”) are known.
Encoder, transcoder (a special form of an encoder that converts an already compressed bitstream according to a standard/profile/specification and encodes it into a different standard/profile/specification), and decoder implementations can be threaded in a number of different ways to take advantage of multiple processing units available in the computing devices. Presently, there are three common threading methods: 1) slice-based threading, 2) function-based threading, and 3) picture-based threading.
A slice is an independent unit on the bitstream-level, and contains a collection of macroblocks in one picture. Each picture may contain one or more slices. Slice-based threading processes multiple slices within one picture in parallel with each slice being allocated to one processor at any one time. It is more efficient if the number of slices is greater or equal than the number of processors. Further, slice-based threading requires the threads to wait or block until the completion of all threads before proceeding to the next picture, resulting in underutilized computational resources and significant wait times when the amount of computation is distributed unequally between slices. Slice-based threading introduces serialization of tasks that cannot be factored into independent threads.
Function-based threading processes stages of functions in a pipeline fashion with each stage being allocated to one processor at any one time. These functions may include bitstream parsing, data prediction, transformation and (inverse) quantization, reconstruction and post-filtering. The number of stages, i.e. the individual functions in the video pipeline and their granularity, limits scalability. Granularity, that is too coarse, results in poor resource utilization, while overly fine granularity may introduce significant threading overhead. Another problem with this approach is that there are often significant data dependencies among stages that may result in synchronization overhead (e.g. memory traffic and the like).
Picture-based threading processes multiple pictures in parallel by assigning one picture to one processor at any one time. In this scheme, a coding unit (e.g. slice, a row of macroblocks, or an individual macroblock) can be processed as soon as all reference data is available. Picture-based threading avoids or ameliorates the issues of the first two threading methods, but is coarse grained in the synchronization among the threads, which may incur unnecessary stalling of threads.
The inventors noticed a need for more efficient grouping of data when processing video (e.g., encoding, transcoding, decoding) to improve processor utilization while minimizing overhead due to data dependencies. The inventors of the present application propose several processing improvements to a video coding system as described herein.