The present invention relates to video coding, and more particularly to H.264 and related coding methods.
Currently, H.264 is the most advanced video compression standard and is being jointly developed by MPEG and ITU-T. It offers much higher coding efficiency compared to the existing video standards such as MPEG1, MPEG2, and MPEG4. It is widely expected that H.264 will be adopted in applications such as video conferencing, streaming video, HD-DVD, and digital video broadcasting.
In H.264 the video element bitstream is defined in the form of network abstraction layer (NAL) units. A NAL unit is a syntax structure containing an indication of the type of data to follow and bytes containing that data in the form of a raw byte stream payload (RBSP) interspersed as necessary with emulation prevention bytes. It could be a sequence header, a picture header, and a slice with header and data. In the byte stream format described in Annex B of H.264, the start code is defined as byte-aligned 0x000001 (i.e. twenty-three 0 bits followed by a single 1 bit). The byte stream format consists of a sequence of byte stream NAL unit syntax structures. Each byte stream NAL unit syntax structure contains one three-byte start code prefix (0x000001) followed by one nal_unit (NumBytesInNALunit) syntax structure. Indeed, H.264 Annex B decoding the byte stream to yield NAL units has the following steps:                1. find the next 0 byte plus three-byte start code (0x000001), and discard the 0 byte.        2. discard the three-byte start code.        3. NumBytesInNALunit is set equal to the number of bytes up to and including the last byte preceding one of: a sequence of three 0 bytes, the next start code, or the end of the byte stream.        4. NumBytesInNALunit bytes are removed from the byte stream and are decoded using the NAL unit decoding process.        5. when the next three bytes are not a start code (otherwise go to step 2) and the next four bytes are not a 0 byte plus a start code (otherwise go to step 1), repeatedly discard a 0 byte until a 0 byte plus a start code are found and then go to step 1.        
The NAL unit is then decoded.
In order to avoid emulation of the three-byte start code within the NAL unit, certain rules are defined. First of all, the last byte of a NAL unit shall not be equal to 0x00. Secondly, within a NAL unit, the following three-byte sequences shall not occur at any byte-aligned position:                0x000000        0x000001        0x000002Finally, within a NAL unit, any four-byte sequence that starts with the three bytes 0x000003 other than the following sequences shall not occur at any byte-aligned position:        0x00000300        0x00000301        0x00000302        0x00000303        
An encoder can produce a NAL unit from RBSP data (RBSP data is the raw bitstream data of an NAL unit before undergoing the following procedure) by the following procedure.
The RBSP data is searched for byte-aligned bits of the following binary patterns:
‘00000000 00000000 000000xx’(where xx represents any 2 bit pattern: 00, 01, 10, or 11), and a byte equal to 0x03 is inserted between the second and third bytes to replace these bit patterns with the bit patterns                ‘00000000 00000000 00000011 000000xx’,and finally, when the last byte of the RBSP data is equal to 0x00, a final byte equal to 0x03 is appended to the end of the data.        
During decoding, a decoder should recognize the stuff byte 0x03 and discard it from the bitstream.
A simple method to prevent start code emulation would be to have two on-chip bitstream buffers. As shown in FIG. 1, on the encoder side, the encoder puts all of the RBSP data (of a NAL unit) into the first bitstream buffer. After encoding the NAL unit, the encoder parses through the RBSP bitstream data byte by byte, inserts stuff bytes 0x03 as needed to form the NAL unit data, and stores the NAL unit data in the second bitstream buffer. The bitstream is finally written to off-chip memory, e.g., SDRAM. On the decoder side, the NAL unit data is loaded into the second bitstream buffer from the SDRAM. The decoder then parses through the NAL unit data byte by byte to produce RBSP data by eliminating stuff bytes 0x03 from the NAL unit data. The RBSP data is stored in the first bitstream buffer for decoding.
However, the two-buffer method leads to problems for H.264 implementation on 16-bit devices, such as the DM270 manufactured by Texas Instruments, especially on the decoder side. The DM270 has a C54-based DSP subsystem; and a C54 supports either 16-bit or 32-bit memory access but not 8-bit memory access. The SDRAM off-chip memory on a DM270 requires data be accessed at BURST boundary (1 BURST=32 bytes in this particular case). Moreover, a circular decoder bitstream buffer is used to avoid the bitstream shifting and satisfy BURST aligned SDRAM access requirements.
On the decoder side, eliminating the stuff byte “0x03” from the bitstream results in a decrease of the active bitstream size. This creates problems for bitstream handling because of circular bitstream buffer usage, BURST aligned SDRAM access requirements. An example is shown in FIG. 2 to explain the problems. If the circular bitstream buffer is 512 words (1 word=16-bit), then the DSP first loads in 512 words from the SDRAM to the second bitstream buffer, parses through it, finds and eliminates, for example, one stuffing byte from the incoming stream, then copies the resulting bitstream to the circular buffer (first buffer). In this particular example, the circular buffer has only 511.5 active words due to the deleted stuffing byte. However, the active circular buffer size has to be a multiple of 16 bits (because a 16-bit DSP cannot access memory in bytes). Otherwise, the circular buffer won't work. In order to be able to operate the bitstream buffer in a circular manner, the decoder has to load one more byte from the SDRAM to fill up the bitstream buffer. However, this leads to the next SDRAM access start address off the BURST-aligned boundary. The decoder has the choice to not load the additional byte so as to keep the next SDRAM access start address BURST-aligned, but this will disable the bitstream buffer to work in a circular manner and greatly decrease the decoder performance. Indeed, any odd number of stuffing bytes in the given size of bitstream will lead to these problems. Note that the start code plus the nal_unit_type byte take up four bytes.
On the encoder side, the problem is not as serious as on the decoder side. After the emulation prevention on the RBSP data of an NAL unit, the encoder writes out NAL unit data (in the second buffer) to SDRAM in the size of multiples of BURST. After writing out data, there is residual data left in the second buffer (size less then BURST length). The encoder copies the residual NAL unit data back to the first bitstream buffer and starts encoding of the next NAL unit. During the emulation prevention process, the encoder should skip Oust make direct copy from the first buffer to the second buffer) the residual data of the previous NAL unit, and start the emulation prevention at the beginning of the RBSP data of the current NAL unit.