According to video coding standards, decodable data frame types include intra coded frames (I-Frames), unidirectional predicted frames (P-Frames), and bi-directional predicted frames (B-Frames, Bi-directional predicted frames, B-frames). In video applications, an I-frame is used as the start of decodable data, and is generally referred to as a random access point. The I-frame may provide services such as random access and quick browsing. In a transmission process, errors of different frame types affect the subjective quality of a decoder differently. The I-frame is capable of truncating error propagation. Therefore, if an error occurs in the I-frame, the error has a great impact on the overall video decoding quality. The P-frame is usually used as a reference frame for other inter coded frames, and is less important than the I-frame. The B-frame is usually not used as a reference frame, and therefore the loss of the B-frame does not have an obvious impact on the video decoding quality.
Thus, it is significant to distinguish different types of frames in a data stream in a video transmission application. For example, a frame type is an important parameter for evaluating video quality, and the accuracy of determining the frame type directly affects the accuracy of the evaluation result. Differential protection may be provided for different types of frames in a video so that the video can be transmitted effectively. In addition, to save transmission resources, when the bandwidth is insufficient, some frames that do not affect the subjective quality greatly may be discarded.
The Internet Streaming Media Alliance (ISMA) and Moving Picture Expert Group-2 Transport Stream over Internet Protocol (MPEG-2 TS over IP) are two frequently used stream transmission technologies. The two protocol modes are both designed with an indicator that can indicate a video data type in the encapsulation of a compressed video data stream. The ISMA mode encapsulates the compressed video data stream by directly using the Real-time Transport Protocol (RTP), where MPEG-4 Part2 complies with RFC 3016 (Request For Comments 3016, RFC 3016), and H.264/Aural and Visual Code (AVC) complies with RFC 3984. Taking RFC 3984 as an example, an RTP header includes a sequence number and a timestamp, which can be used to determine frame loss and help to detect the frame type. The MPEG-2 TS over IP mode also includes two modes: transport stream over User Datagram Protocol/IP (TS over UDP/IP) and transport stream over Real-time Transport Protocol/UDP/IP (TS over RTP/UDP/IP). In video transmission, the TS over RTP/UDP/IP (abbreviated to “TS over RTP” hereinafter in this application) is frequently used to encapsulate a compressed video data stream into an elementary stream, further divide the elementary stream into a plurality of TS packets, and finally use the RTP to encapsulate and transmit the TS packets.
The RTP is a transport protocol for multimedia data streams, which is responsible for end-to-end real-time data transmission. An RTP packet mainly includes four parts: an RTP header, an RTP extension header, a payload header, and payload data. The RTP header mainly includes the following data: a sequence number, a timestamp and an indicator. The sequence numbers correspond to the RTP packets on a one-to-one basis. Every time when a packet is sent, the sequence number increases by 1. The sequence number may be used to detect packet loss. The timestamp may indicate the sampling time of video data. Different frames have different timestamps, which may indicate the play sequence of the video data. The indicator is used to indicate the end of a frame. The preceding information is an important basis for determining a frame type.
A TS packet includes 188 bytes. The TS packet is made up of a packet header, a variable-length adaptation header, and payload data. A payload unit start indicator (PUSI) indicates whether the payload data includes a packetized elementary stream (PES) header or program specific information (PSI). With respect to the H.264 media format, each PES packet header predicts the start of a NAL unit. Some indicators in a TS packet adaptation field, such as a random access indicator and an elementary stream priority indicator, may be used to determine the importance of transport content. For a video, if the random access indicator is 1, a subsequent first PES packet includes sequence start information, and if the elementary stream priority indicator is 1, the payload content of the TS packet includes a lot of Intra block data.
If it is determined by using the PUSI that the payload part of the TS packet includes a PES packet header, information useful for transmission may be further discovered. The PES packet is made up of a PES packet header and packet data after the PES packet header. Original stream data (video and audio) is encapsulated in the PES packet data. The PES packet is inserted in a transport stream packet. The first byte in each PES packet header is the first byte of the payload of the transport stream packet. To be specific, a PES packet header must be included in a new TS packet, and meanwhile the payload area of the TS packet must be fully filled with the PES packet data. If the end of the PES packet data cannot be aligned with the end of the TS packet, an appropriate number of padding bytes need to be inserted in the adaptation area of the TS packet so that the ends of the two are aligned. The PES priority indicates the importance of the payload in the PES packet data. For a video, the value 1 indicates Intra data. In addition, a PTS indicates the display time, and a DTS indicates the decoding time. The PTS and DTS may be used to determine the correlation between earlier video payload content and later video payload content so as to determine the payload type.
In the TS over the RTP mode, to protect the video copyright content in transmission, the payload is usually encrypted for transmission in the transmission process. To encrypt the TS packet is to encrypt the payload part of the packet. If the scrambling flag in the TS packet header is set to 1, the payload in the packet is encrypted. In this case, the payload data type can only be determined by using the size of a data packet having the same PID between adjacent PUSIs (equivalent to the size of a video frame). If the PES packet header in the TS packet is not encrypted, in addition to the length of the video frame, the PTS may also be used to help determine the frame type.
As known from the preceding description, the amount of data in data frames varies depending on the frame types. The I-frame, without intra redundancy only, generally has a larger data amount than an inter coded frame without inter redundancy, while the P-frame generally has a larger data amount than the B-frame. In view of this feature, at present, some frame type detection algorithms use the data amount of a frame to determine the frame type in the case of TS packet encryption. Two more frequently used methods are described herein.
Method 1: Obtain the length of each video frame by parsing a TS packet, and infer the frame type by using the length information. The proposed method is used to determine the frame type in the case that the payload part of a TS packet is already encrypted.
The method determines the packet loss status by parsing the Continuity Counter field in the TS packet, estimates the lost packet status by using previous group of pictures (GOP) structure information before the determination, and determines the type of the video frame with reference to available information (i.e., Random Access Indicator, RAI or Elementary Stream Priority Indicator, ESPI) of the adaptation field in the TS packet header.
Three methods below may be used to identify an I-frame.
1. Use a RAI or an ESPI to identify an I-frame.
2. If the RAI or ESPI cannot be used to identify an I-frame, buffer the data of one GOP, use a maximum value in the currently buffered data as an I-frame, where the GOP length needs to be predefined, and once the GOP length changes, the method becomes invalid.
3. Use a value indicating the maximum GOP length as a determined I-frame period, and use a frame having the maximum data amount as an I-frame in the determined period, where the determined period is a maximum one of the detected I-frame periods.
For a P-frame, three methods below may be used.
1. Among frames from a start frame to a frame immediately preceding an I-frame, select a frame having a larger data amount than all the other frames as a P-frame. With respect to determined frame modes included in a GOP structure for processing a target stream, select consecutive frames corresponding to N determined frame modes in a determined period as determined target frames, match the data amounts of the determined target frames with the determined frame modes, and determine a P-frame based on the matching therebetween. In the GOP structure, use the following mode as a determined frame mode: The mode includes all consecutive B-frames immediately preceding a P-frame and a B-frame next to the P-frame. In this case, some information of the GOP needs to be input beforehand.
2. Compare the frame data amount of each frame in a presentation mode with a threshold that is calculated based on an average of frame data amounts of multiple frames in predetermined positions in the presentation mode.
3. Based on frame data amounts, use an adjustment coefficient to adjust the threshold for distinguishing P-frames from B-frames. Adjustment coefficient: In a given range, sequentially select temporary adjustment coefficients to perform processing same as the processing of determining frame types, so as to estimate the frame type of each frame in a given determined period. Then calculate a ratio of wrongly determined frame types according to the estimation results and the actual frame types obtained from an unencrypted stream, obtain a temporary adjustment coefficient having a lowest ratio of wrong determination, and use this coefficient as a real adjustment coefficient.
A method for determining B-frames is: determining all frames other than I-frames and P-frames as B-frames.
In the case of packet loss, the preceding methods for determining frame types are capable of detecting the packet loss based on an RTP sequence number and a Continuity Counter (CC) in a TS packet header, and estimating the lost packet status by mode matching by using a GOP structure, thereby achieving correction to some extent. However, for the method using a nonadjustable threshold, GOP information needs to be input beforehand; and for the method using an adjustable threshold, coefficients need to be trained by using the frame type information obtained from an unencrypted stream, and a lot of human intervention is required. In addition, a GOP needs to be buffered before the frame types are estimated. Therefore, the methods are not applicable to real-time applications. Moreover, the I-frame is determined only once. The adjustable coefficient is a period. If a maximum data amount is directly obtained from each period and used as an I-frame, only the local features are considered, and the global features are not considered.
Method 2: The method of using thresholds to distinguish different frames may include four steps.
1. Update Thresholds:
Threshold for distinguishing an I-frame (Ithresh):
scaled_max_iframe=scaled_max_iframe*0.995, where scaled_max_iframe is the size of a previous I-frame.
If nbytes>scaled_max_iframe,
then ithresh=(scaled_max_iframe/4+av_nbytes*2)/2, where av_nbytes is the moving average of current 8 frames.
Threshold for distinguishing a P-frame (Pthresh):
scaled_max_pframe=scaled_max_pframe*0.995, where scaled_max_pframe is the size of a previous P-frame.
If nbytes>scaled_max_pframe, then pthresh=av_nbytes*0.75.
2. Detect an I-frame: In a video, there is an I-frame in each period of time. The data amount of the I-frame is larger than the average and larger than the data amount of the P-frame. If the data amount of the current frame is larger than Ithresh, the frame is considered as an I-frame.
3. Detect a P-frame: Utilize the data amount of a B-frame is smaller than the average. If the data amount of the current frame is larger than Pthresh but smaller than Ithresh, the frame is considered as a P-frame.
4. Other Frames are B-Frames.
The second method for determining frame types uses a reduction factor to control the thresholds. The factor has a direct impact on determining an I-frame. When a subsequent I-frame is larger than the current I-frame, the I-frame can be easily determined. However, when the subsequent I-frame is far smaller than the current I-frame, the I-frame can be determined only after reduction of many frames. Furthermore, the reduction factor in the algorithm is fixed to 0.995, without considering sharp changes of GOPs. Therefore, the method is not applicable in many cases. If the reduction factor is small, the ratio of undetected I-frames is low, and meanwhile, and the probability of wrongly determining P-frames as I-frames is high. If the reduction factor is large, the ratio of undetected I-frames is high (when the size of the I-frame changes sharply in a sequence), and I-frames may be wrongly determined as P-frames. Therefore, the detection accuracy is low. In addition, because only thresholds are used to determine B-frames or P-frames, in a frame structure of I/P/P/P . . . , the algorithm may wrongly determine many P-frames as B-frames, resulting in a high ratio of wrongly determined frames.