Typical video transmission systems generally include a video encoder, a transmission method (e.g. the Internet, LANs, and/or telephone lines), and a video decoder. Video transmission systems are generally used to transfer voice, video, and/or other data between remote parties. Video transmission may include live streaming, which allows remote parties to transmit and receive video transmission in real time, and video teleconferencing (also referred to as video conferencing), which allows two or more remote parties to participate in a discussion.
Data transmitted in a video transmission system may be formatted in data packets rather than bit streams for transmission over a network. Each packet may contain a frame of the video. When compressing the video data into frames, inter-frame or intra-frame compression can be used. Inter-frame compression means that each frame references surrounding frames in order to produce images in the proper order. Intra-frame compression creates frames that contain all information needed to produce an image temporally.
Due to packet loss or delay, the received video quality can suffer over fixed and/or mobile packet networks. This reduced video quality is exemplified by the artifact of frame freezing and the consequent temporal jerkiness observed by the receiving party. In applications with a low delay requirement, such as live streaming or video conferencing, any frame that is not completely received by its display deadline is considered lost and may require that the receiver choose an error concealment method to recover the frame. One error concealment method displays the previous frame that was correctly received in place of the lost frame. But in such cases, the subsequent frames to the lost frame, if predictively coded using the previous frame, will have a decoding error even if correctly received. In order to avoid this error propagation problem, all subsequent frames after a lost frame must also be replaced by the last correctly received frame until the next intra-frame is received. This artifact is referred to as “frame freeze due to packet loss.” In applications allowing more elastic delay, such as streaming of pre-coded video, when a frame arrives past its display deadline, the receiver continuously displays the previous frame, until the actual new frame arrives. This artifact is referred to as “frame freeze due to packet delay.” Both artifacts manifest as temporal jerkiness on the received video.
Video quality metrics may be used to evaluate the impact of frame freezing due to either packet loss or packet delay. There are several methods and systems for measuring the impact of frame freeze on the perceived quality of video. These methods and systems fall into two categories called reference video quality metrics and no-reference video quality metrics (NR metrics). Reference video quality metrics provide a quality assessment based on a comparison of the transmitted or degraded video with the original pristine reference video at the receiver. NR metrics evaluate the quality of the video based solely on the transmitted or degraded video only. NR metrics are important for quality assessment in real applications, as the pristine video is often not available at the receiving device.
Previous use of NR metrics has been based on the duration of each freeze event and the number of freeze events. However, these are not dependent on the video content and are undesirable since, for different video characteristics, the same freeze frame pattern could have different impacts on the quality.
Another version of this NR metric utilizes a more advanced version of frame detection that uses the squared value of the 1-step frame differences and adding an extra encoding pass for the received video. This version uses different thresholds according to frame types of neighboring frames after that additional encoding. By using a non-zero and dynamic threshold, the system becomes more robust because there are less false freeze frame positives. While providing more accurate frame freeze detection, this method is too complex for use in a real-time system. Finally, the NR metric standardized by ITU-T, which relies on packet header information, estimates the frame freezing quality degradation by calculating the ratio of the number of damaged video frames and the total number of video frames as well as the packet loss event frequency. This ITU-T metric also does not consider the video content characteristic nor does it differentiate between random individual frame drops and consecutive frame losses.
The present system and method utilizes a more robust method of extracting video features and mapping these features onto a pre-trained neural network in order to provide a video metric. The present system and method operates directly on the video content and explicitly considers the differences in the video content for more accurate video quality metrics. Further, it provides more consistent results than prior art by using a pre-trained neural network to provide the final video quality assessment. The present system and method provides a NR metric with low complexity that can be utilized with real time processing constraints.