Video codecs (COmpressor-DECompressor) are compression algorithms designed to encode/compress and decode/decompress video data streams to reduce the size of the streams for faster transmission and smaller storage space. While lossy, video codecs attempt to maintain video quality while compressing the binary data of a video stream. Video codecs are typically implemented in both hardware and software. Examples of popular video codecs are MPEG-4, AVI, WMV, RM, RV, H.261, H.263, and H.264.
A video stream is comprised of a sequence of video frames where each frame is comprised of multiple macroblocks. A video codec encodes each frame in the sequence by dividing the frame into slices or sub-portions, each slice containing an integer number of macroblocks. Each macroblock is typically a 16×16 array of luminance pixels, although other sizes of macroblocks are also possible. The number of macroblocks per slice (i.e., slice size) and number of slices per frame (i.e., slice number) is determined by the video codec. Typically, the video frame is divided into even sized slices so that each slice contains the same number of macroblocks. A slice can be measured by the percentage of the frame that the slice comprises. For example, a frame can be divided into five even slices where each slice comprises 20% of the frame.
Frames are encoded in slices to allow the frame to be later decoded/decompressed using parallel multithread processing. In multithread processing, each thread performs a single task (such as decoding a slice) so that multiple tasks can be performed simultaneously, for example, by multiple central processing units (CPUs). By dividing a frame into multiple slices, two or more slices can be decoded/decompressed simultaneously by two or more threads/CPUs. Each slice is a considered a task unit that is put into a task list that is processed by a thread pool (a set of threads). A main thread (having the task of decoding an entire frame) and the thread pool need to synchronize after all the tasks in the task list have been processed (i.e., when all the slices of a frame have been decoded).
There are, however, disadvantages to encoding a frame in slices as each slice has an amount of overhead. First, each slice requires a header that consumes memory and processing resources as it increases the encoding size and decoding time required for each frame. Second, predictive ability is lost across slice boundaries. Typically, macroblocks benefit from other macroblocks within the same slice in that information from other macroblocks can be used as predictive information for another macroblock. A macroblock in one slice, however, can not benefit from predictive information based on a macroblock in another slice. As such, the greater the number of slices per frame, the greater the amount of predictive loss per frame.
The overhead of a frame slice must be considered when determining the slice size and slice number of a frame. Dividing a frame into fewer and larger slices reduces slice overhead but causes a higher typical idle time in the threads/CPUs that decode the slices (as discussed below in relation to FIGS. 1A-B). Whereas dividing a frame into numerous smaller slices causes a lower typical idle time in the threads/CPUs that decode the slices but increases slice overhead.
FIG. 1A is a timing diagram illustrating the time required to decode two large slices comprising a video frame. A first slice is decoded by a first thread/CPU and a second slice is decoded by a second thread/CPU. The first and second slices each comprise 50% of the frame. Note that although the first and second slices are of equal size (i.e., contain the same number of macroblocks), due to processing variations, the first and second slices will be decoded at different rates so that the times for completing the decoding of the first and second slices vary. This is true even if it is assumed that the first and second slices have identical content (although typically the first and second slices have different content) and the first and second slices are processed by identical CPUs. Processing variations are caused, for example, by operating system and the other applications that are concurrently running on the system and “stealing” processing cycles of the CPUs.
Typically, each slice in the previous frame must be decoded before decoding of a next frame in the sequence can begin. This is due to the decoding methods of video codecs that use predictive information derived from previous frames thereby requiring the decoding of an entire previous frame before beginning the decoding of the next frame. As stated above, the main thread (having the task of decoding an entire frame) and the thread pool synchronize after all the slices of a frame have been decoded.
As such, a thread/CPU (referred to herein as an “idling” thread/CPU) that finishes decoding all of the slices assigned to the thread/CPU before other threads/CPUs experiences “idle time,” i.e., a period of time that it does not decode a slice. “Idle time” of a thread/CPU exists when the last slice in a frame to be decoded is in the process of being decoded by another thread/CPU and there are no additional slices in the frame to be decoded. In other words, when a thread in the thread pool cannot find a task (because the task list is empty), in order to synchronize with the other threads, it has to wait for the other threads to complete their respective tasks. In general, all but one thread/CPU in a set of threads/CPUs available for processing slices of a frame (referred to herein as decoding threads/CPUs) will experience “idle time.” For example, for a set of four threads/CPUs, three of the four threads/CPUs will experience “idle time” during the processing of a frame. The only thread/CPU in the set of threads/CPUs that will not experience “idle time” (i.e., will always be busy) is the last thread/CPU to finish processing of all slices of the frame assigned to the thread/CPU (referred to herein as the “non-idling” thread/CPU). The “non-idling” thread/CPU in the set of threads/CPUs is random and varies for each frame.
The duration of the “idle time” of a thread/CPU begins when the thread/CPU finishes decoding the last slice assigned to the thread/CPU and ends when the last slice in the frame is decoded by the “non-idling” thread/CPU (and hence the thread/CPU can begin decoding a slice of the next frame of the sequence). As such, the idle time of a CPU is determined, in large part, on the size of the last slice being decoded by the “non-idling” thread/CPU: typically, the larger the size of the last slice, the longer the idle time of the CPU.
In the example of FIG. 1A, there are two threads/CPUs available for decoding slices and each frame is divided into two slices each comprising 50% of the frame. Dividing a frame into such large slices reduces the amount of slice overhead but causes a higher typical idle time in the threads/CPUs. As shown in FIG. 1A, the first thread/CPU completes decoding of the slice before the second thread/CPU and experiences an idle time of duration x. In the example of FIG. 1B, a frame is divided into ten smaller slices each comprising 10% of the frame. Dividing a frame into such smaller slices reduces the typical idle time in the threads/CPUs but increases the amount of slice overhead. As shown in FIG. 1A, the first thread/CPU completes decoding all slices assigned to it before the second thread/CPU and experiences an idle time of duration y, where y is less than x.
As such, there is a need for a method for determining the slice size of a frame in a multithread environment that both reduces slice overhead and reduces the typical idle time of the threads/CPUs decoding the slices.
Also, in decoding an image frame, a deblocking/loop filter is used to reduce the appearance of macroblock borders in the image frame. As discussed above, a popular video codec is H.264. Typically however, during the filtering stage of the deblocking filter, macroblocks are processed/filtered sequentially with strict dependencies specified under the H.264 codec and are not processed/filtered in parallel using multithreading.