Generally, the amount of data used to represent video data is very large. Accordingly, an apparatus handling such video data compresses the video data by encoding before transmitting the video data to another apparatus or before storing the video data in a storage device. Coding standards such as MPEG-2 (Moving Picture Experts Group Phase 2), MPEG-4, and H.264 MPEG-4 Advanced Video Coding (MPEG-4 AVC/H.264), devised by the International Standardization Organization/International Electrotechnical Commission (ISO/IEC), are typical video coding standards widely used today.
Such coding standards employ an inter-coding method that encodes a picture by using not only information from itself but also information from pictures before and after it, and an intra-coding method that encodes a picture by using only information contained in the picture to be encoded. The inter-coding method uses three types of pictures, referred to as the intra-coded picture (I picture), the forward predicted picture (P picture) which is usually predicted from a past picture, and the bidirectional predicted picture (B picture) which is usually predicted from both past and future pictures.
Generally, the amount of code of a picture or block encoded by inter-coding is smaller than the amount of code of a picture or block encoded by intra-coding. In this way, the amount of code varies from picture to picture within the same video sequence, depending on the coding mode selected. Similarly, the amount of code varies from block to block within the same picture, depending on the coding mode selected. Therefore, in order to enable a data stream containing encoded video to be transmitted at a constant transmission rate even if the amount of code temporally varies, a transmit buffer for buffering the data stream is provided at the transmitting end, and a receive buffer for buffering the data stream is provided at the receiving end.
MPEG-2 and MPEG-4 AVC/H.264 each define the behavior of a receive buffer in an idealized video decoding apparatus referred to as the video buffering verifier (VBV) or the coded picture buffer (CPB), respectively. For convenience, the idealized video decoding apparatus will hereinafter be referred to simply as the idealized decoder. It is specified that the idealized decoder performs instantaneous decoding that takes zero time to decode. For example, Japanese Laid-open Patent Publication No. 2003-179938 discloses a video encoder control method concerning the VBV.
In order for the receive buffer in the idealized decoder to not overflow or underflow, the video encoder controls the amount of code to guarantee that all the data needed to decode a given picture are available in the receive buffer when the idealized decoder decodes that given picture.
When the video encoder is transmitting an encoded video data stream at a constant transmission rate, the receive buffer may underflow if the transmission of the data needed to decode the picture has not been completed by the time the picture is to be decoded and displayed by the video decoder. In other words, the receive buffer underflow refers to a situation in which the data needed to decode the picture are not available in the receive buffer of the video decoder. If this happens, the video decoder is unable to perform decoding, and frame skipping occurs.
In view of this, the video decoder displays the picture after delaying the stream by a prescribed time from its receive time so that the decoding can be done without causing the receive buffer to underflow. As described earlier, it is specified that the idealized decoder accomplishes decoding in zero time. As a result, if the input time of the i-th picture to the video encoder is t(i), and the decode time of the i-th picture at the idealized decoder is tr(i), then the earliest time at which the picture becomes ready for display is the same as tr(i). Since the picture display period {t(i+1)−t(i)} is equal to {tr(i+1)−tr(i)} for any picture, the decode time tr(i) is given as tr(i)=t(i)+dly, i.e., the time delayed by a fixed time dly from the input time t(i). This means that the video encoder has to complete the transmission of all the data needed for decoding to the receive buffer by the time tr(i).
Referring to FIG. 1, a description will be given of how the receive buffer operates. In FIG. 1, the abscissa represents the time, and the ordinate represents the buffer occupancy of the receive buffer. Solid line graph 100 depicts the buffer occupancy as a function of time.
The buffer occupancy of the receive buffer is restored at a rate synchronized to a prescribed transmission rate, and the data used for decoding each picture is retrieved from the buffer at the decode time of the picture. The data of the i-th picture starts to be input to the receive buffer at time at(i), and the final data of the i-th picture is input at time ft(i). The idealized decoder completes the decoding of the i-th picture at time tr(i), and thus the i-th picture becomes ready for display at time tr(i). However, if the data stream contains a B picture, the actual display time of the i-th picture may become later than tr(i) due to the occurrence of picture reordering (changing the encoding order).
The method of describing the decode time and display time of each picture in MPEG-4 AVC/H.264 will be described in detail below.
In MPEG-4 AVC/H.264, supplemental information not directly relevant to the decoding of pixels is described in a supplemental enhancement information (SEI) message. Tens of SEI message types are defined, and each type is identified by a payloadType parameter. The SEI is appended to each picture.
BPSEI (Buffering Period SEI) as one type of SEI is appended to a self-contained picture, i.e., a picture (generally, an I picture) that can be decoded without any past pictures. A parameter InitialCpbRemovalDelay is described in the BPSEI. The InitialCpbRemovalDelay parameter indicates the difference between the time of arrival in the receive buffer of the first bit of the BPSEI-appended picture and the decode time of the BPSEI-appended picture. The resolution of the difference is 90 kHz. The decode time tr(0) of the first picture is the time of arrival in the video decoder of the first bit of the encoded video data (the time is designated as 0); i.e., the decode time is delayed from time at(0) by an amount of time equal to InitialCpbRemovalDelay ÷90,000 [sec].
Generally, PTSEI (Picture Timing SEI) as one type of SEI is appended to each picture. Parameters CpbRemovalDelay and DpbOutputDelay are described in the PTSEI. The CpbRemovalDelay parameter indicates the difference between the decode time of the immediately preceding BPSEI-appended picture and the decode time of the PTSEI-appended picture. The DpbOutputDelay parameter indicates the difference between the decode time of the PTSEI-appended picture and the display time of the picture. The resolution of these differences is one field picture interval. Accordingly, when the picture is a frame, the value of each of the parameters CpbRemovalDelay and DpbOutputDelay is a multiple of 2.
The decode time tr(i) of each of the second and subsequent pictures is delayed from the decode time tr(0) of the first picture by an amount of time equal to tc*CpbRemovalDelay(i) [sec]. CpbRemovalDelay(i) is the CpbRemovalDelay appended to the i-th picture. On the other hand, tc is the inter-picture time interval [sec]; for example, in the case of 29.97-Hz progressive video, tc is 1001/60000.
The display time of each of the pictures, including the BPSEI-appended picture, is delayed from tr(i) by an amount of time equal to tc*DpbOutputDelay(i). DpbOutputDelay(i) is the DpbOutputDelay appended to the i-th picture. In other words, after time tr(0), each picture is decoded and displayed at time equal to an integral multiple of tc.
Depending on the purpose of video data, the encoded video may be edited. Editing the encoded video involves dividing the encoded video data into smaller portions and splicing them to generate a new encoded video data stream. For example, insertion of another video stream (for example, an advertisement) into the currently broadcast video stream (i.e., splicing) is one example of editing.
When editing inter-frame predictive coded video, particularly in the case of an inter-coded picture, the encoded picture cannot be decoded correctly by itself. Accordingly, when splicing two encoded video data streams at a desired picture position, an encoded video data editing machine first decodes the two encoded video data streams to be spliced and then splice them on a decoded picture-by-picture basis, and thereafter re-encodes the spliced video data.
However, since re-encoding can be very laborious, in particular, in the case of real time processing such as splicing, it is common to restrict the splicing point and edit the encoded video data directly by eliminating the need for re-encoding. When splicing two encoded video data streams by editing without the need for re-encoding, the first picture of the encoded video data stream to be spliced on the temporally downstream side has to be an I picture. Furthermore, the GOP structure of the encoded video data stream to be spliced on the temporally downstream side is limited to the so-called closed GOP structure in which all the pictures that follow the starting I picture are decodable without referring to any pictures temporally preceding the starting I picture. With this arrangement, it is possible to correctly decode all the pictures that follow the starting I picture of the encoded video data stream spliced on the downstream side by editing at the desired splicing point.
However, since the coding efficiency of the closed GOP structure is lower than that of the non-closed GOP structure, the non-closed GOP structure may be employed. In that case, some of the pictures immediately following the starting I picture after the splicing point are not correctly decoded, but since these pictures precede the starting I picture in display order, there will be no problem if they are not displayed. Therefore, as a general practice, after displaying the last picture of the temporally preceding encoded video data stream, the video decoder performs processing such as freezing the display, thereby masking the display of the pictures that failed to be decoded correctly.
In the prior art, even when the inter-frame predictive coded video data is edited without re-encoding, the header information is also edited so that a discrepancy does not occur between the two encoded video data stream spliced together. For example, in MPEG-4 AVC/H.264, POC (Picture Order Count) and FrameNum are appended to the slice header in order to maintain the inter-picture temporal relationship and identify the reference picture. POC indicates the relative display order of the picture. FrameNum is a value that increments by 1 each time the reference picture appears in the encoded video. Since POC values and FrameNum values need to be continuous between the spliced two encoded video data stream, there arises a need to edit all of the POC values and FrameNum values in the encoded video data stream to be spliced on the downstream side of the temporally preceding encoded video data stream.
On the other hand, in the method disclosed in non-patent document JCTVC-I1003, “High-Efficiency Video Coding (HEVC) text specification Working Draft 7,” Joint Collaborative Team on Video Coding of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, May 2012, FrameNum is abolished because a new method for identifying reference pictures has been introduced. Furthermore, since the POC value of the first picture of the encoded video data stream spliced on the downstream side need not have continuity with respect to the encoded video data stream spliced on the upstream side, there is no need to edit the slice header. In the method disclosed in the above non-patent document, a CRA (Clean Random Access) picture, a BLA (Broken Link Access) picture, and a TFD (Tagged For Discard) picture have been introduced as new picture types in addition to the IDR (Instantaneous Decoding Refresh) picture defined in MPEG-4 AVC/H.264.
Of these pictures, the CRA picture and the BLA picture are both self-contained pictures, i.e., pictures that do not refer to any other pictures, so that pictures that follow the CRA picture or the BLA picture can be decoded correctly. When the video decoder starts decoding starting with a CRA picture, for example, any subsequent pictures other than the TFD picture that immediately follows the CRA picture can be decoded correctly.
The TFD picture is a picture that appears immediately following the CRA picture or the BLA picture, and that refers to a picture appearing earlier than the CRA picture or the BLA picture in time order and in decoding order. In the case of the non-closed GOP structure that conforms to MPEG-2, the plurality of B pictures immediately following the I picture at the head of the GOP each correspond to the TFD picture.
The BLA picture occurs as a result of editing of the encoded video data. Of the spliced two encoded video data streams, the encoded video data stream spliced on the downstream side generally begins with a CRA picture, but if this CRA picture appears partway through the spliced encoded video data, its picture type is changed from the CRA picture to the BLA picture. In the method disclosed in the above non-patent document, when the BLA picture appears, the POC values are permitted to become discontinuous. Further, the TFD picture that immediately follows this BLA picture is unable to be decoded correctly from any point in the spliced encoded video data because the picture to be referred to by it is lost from the spliced encoded video data. Therefore, the video encoder may delete from the encoded video data any TFD picture that follows the BLA picture at the head of the encoded video data stream to be spliced on the downstream side.