The present invention relates to video processing and in particular to real-time video processing in dedicated hardware devices.
In the design of such dedicated hardware video processing devices, it is generally desired to reduce the need for external memory components, and for internal memory.
In a video processing device embodied as an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA), input frames are stored in a frame buffer usually located in external memory, because they do not fit in the device itself. For processing, several frames are loaded line by line to be stored in an internal memory of the device, called line buffer.
FIG. 1 shows the typical data flow and storage involved in a conventional video processing device 8. The input pixels 1 received at an input port 2 are stored into a frame buffer 4, usually implemented as one or more dynamic random access memory (DRAM) chips, via a DRAM interface 3. Then, the video processor 6 fetches lines from the DRAM 4 through the DRAM interface 3, storing them temporarily in the line buffer 5. The output 9 of processor 6 is fed to the output port 7 to be transmitted to the next device to which the video processing device 8 is connected. All image transfers are done in raster order, i.e. each frame full line by full line, and each line of a frame pixel by pixel from left to right.
In such a device 8, using an external DRAM 4 is required if the video processor 6 needs to process simultaneously pixels originating from different frames. This is necessary, for example, in applications such as deinterlacing, frame rate conversion, and overdrive processing in LCD timing controllers.
If the video processor 6 also needs to have access to pixels of different lines at the same time, a line buffer 5 of substantial size needs to be present inside the device 8. Important design parameters include the size of the DRAM 4, the available bandwidth between the device 8 and the DRAM chip(s) 4, and the size of the line buffer 5.
Considering input video frames of Y lines of X pixels each, with an input frame rate of F, the input pixel rate is X×Y×F not taking into account blanking. Typical values are X=1920, Y=1080 and F=50 or 60 FPS (frames per second). Similar parameters X′, Y′ and F′ describe the output frame size and frame rate. In order to output one pixel, the video processor 6 needs to have simultaneous access to a context of C lines of the input video frames, for N different video frames. The DRAM 4 must then be able to store at least N frames of video, i.e. a total of X×Y×N pixels. At the DRAM interface, the pixel rate is X×Y×F pixels per second for writing and X×Y×N×F′ pixels per second for reading. Typical data rates are then 1 billion pixels per second, which amounts to 30 Gb/s if a pixel is represented in RGB with 10 bits per channel. High transfer rates between the device 8 and the DRAM 4 are not desirable because they may require using a higher number of DRAM chips in parallel. The video processing device (in the case of an ASIC) then needs to have a large number of pins to access all the DRAM chips.
The required size of the internal video buffer 5 is X×C×N pixels. Hosting such a large line buffer in an ASIC is expensive, because it increases the die size of the ASIC, and has a negative impact on the manufacturing yield. It is thus desirable to limit as much as possible the size of the line buffer.
One way of reducing the size of the internal line buffer is to perform sequential processing by splitting the images into tiles, instead of working on full frames in raster order. This is illustrated in FIG. 2. The input video frames 1 are written into DRAM 4 via the input port 2 and the DRAM interface 3 like in FIG. 1. However, the lines of the frames are not read in their entirety at once. Instead, the frames are split horizontally into smaller vertical windows, or tiles, and the tiles are processed in succession. The gain is that the lines of the line buffer 5 have a length smaller than the full width of the video frame, corresponding to the width of the tiles. The overall size of the line buffer 5 can then be reduced in proportion. The downside is that the tiles must overlap so that the output tiles can be merged without any boundary artifact between the tiles. This causes in increase in the data rate in proportion to the overlapping factor, which can be of 20-30%. This proportion increases with the number of tiles. In addition, the output of the video processor 6 cannot be directly sent to the output port 7 because it is not in the raster order, but rather in the order of the tiles. A reordering of the pixels is necessary, and this requires an additional transit via the DRAM 4 between the video processor 6 and the output port 7. This can also increase substantially the required bandwidth at the DRAM interface. The solution illustrated by FIG. 2 allows trading a reduction of the internal memory required by line buffers 5 with an increase of bandwidth to the external memory 4.
Compression techniques are another way of reducing both the required size of the internal memory and the bandwidth to the external DRAM chip(s). One way of using compression to this end is illustrated in FIG. 3. Between the input port 2 and the DRAM interface 3, an encoder 10 compresses the input pixel sequence for storage into DRAM 4. For operating the video processor 6, a decoder 20 receives the compressed pixel data read from DRAM 4 to restore decompressed pixel lines written into the decompressed line buffer 15 which may contain several adjacent lines forming a stripe. The video processor 6 reads pixel values from the decompressed line buffer 15, and delivers outputs pixels 9 via the output port 7.
The bandwidth to or from the external DRAM chip(s) is divided by the compression factor provided by the compression. The number/size of external DRAM chip(s) can be reduced in the same factor. Applying compression in such a context is disclosed in US 2007/0110151 A1, where a differential pulse code modulation (DPCM) scheme is used for compression.
In certain known compression techniques, the RGB pixels are converted to a YUV color space, and the color channels U are V and low-pass filtered and down-sampled by a factor of 2 horizontally. The frame is then stored in what is commonly called YUV 422 format. Other color sub-sampling schemes exist, like YUV 420 or YUV 411. See, e.g., WO 2006/090334. Recovering the RGB pixels requires to first up-sample again the U and V color planes, and to do the color space conversion from YUV back to RGB. In this way, the color information is simply down-sampled. For certain kinds of contents, such as video games, reducing the color resolution is a visible artifact. Such compression schemes allow compression factors of 1.5:1, or 2:1 in the very best case.
More efficient compression schemes such as JPEG or JPEG-2000 are widely known. They offer a visual quality close to lossless with compression factor of 3 to 5. They are not adapted though, because in most cases random access to an image region is not possible without decompressing the entire image. Also, it is desirable that the frame buffer compression process provides a constant bit rate (CBR) reduction factor in order to ensure that the peak bit rate for transmitting the frame buffers at a constant pixel rate is controlled.
There is a need for a new way of dealing with frame and line buffer constraints in video processing devices. There is also a need for a compression scheme usable in such a context, which provides a good tradeoff between compression ratio and image quality, while satisfying a CBR constraint with a fine granularity.