1. Field of the Invention
This invention relates to technology to mitigate degradation and distortion of video transmission over wireless networks.
The demand for real-time video transmission over wireless networks is rapidly increasing. In order to efficiently utilize wireless bandwidth, video data is compressed using sophisticated encoding techniques such as H.264 AVC.
Transmission of compressed video over wireless channels is highly susceptible to channel errors, which introduce the packet losses, delay, jitter-causing error propagation and severe quality degradation. In H.264, the error resiliency schemes, such as flexible macroblock ordering (FMO), data partitioning, and concealment, help to minimize the effects of channel impairments. Each video frame is divided into independently coded slices that consist of certain number of macroblocks. The distortion contributed by a slice loss is characterized by its spatio-temporal coding dependencies within the compressed bitstream. For example, a slice loss in a reference frame is likely to propagate to the subsequent frames that are predicted from it.
Since compressed video is vulnerable to impairments, it is useful to gauge the quality degradation caused by slice losses. A widely used measure of distortion introduced by a slice loss is the cumulative mean squared error (CMSE), which takes into account the effect of temporal error propagation measured over the group of pictures (GOP). However, computing CMSE for each slice loss is a computationally intensive process that introduces delay because it requires decoding all the frames of a GOP. The subject technology comprises a robust, low-complexity, and low-delay model to predict the CMSE contributed by the loss of each individual H.264 AVC video slice. In this scheme, a generalized linear model (GLM) is built using the video factors extracted during the encoding of current video frame, such as motion vectors, initial mean squared error, temporal duration, maximum residual energy, and signal characteristics. For a database of video sequences, the model is trained by using the measured CMSE contributed by the loss of each individual slice (known as ground truth) for all the slices of a video sequence. To keep the complexity of the model low, no video factors that require decoding of future frame(s) in the GOP are used.
The goals of the subject invention are: to develop a robust model for slice CMSE prediction and to demonstrate the effectiveness of slice prioritization schemes using the slice CMSE prediction model. Performance of the proposed slice CMSE prediction model are to be compared with the measured slice CMSE values for the test videos encoded using two different GOP structures, varying GOP lengths, and bit rates. This model is used to design two priority assignment schemes for video slices based on their predicted CMSE values, such schemes to be applied at GOP-level and frame-level, respectively. The performance of slice prioritization schemes is also studied for two applications: (i) unequal error protection (UEP) over AWGN channels, where slices are given protection against channel errors according to their priority, and (ii) the slice discard scheme when the network is congested.
This model can be used for many real-time video applications including: (i) predicted CMSE values of slices can replace their measured CMSE values for determining the optimal fragment or aggregated packet sizes in cross layer packet fragmentation or aggregation schemes; (ii) predicting the slice loss distortion enables real-time slice prioritization for a differentiated-services network, packet scheduling in a transmitter, and traffic shaping for streaming applications; (iii) assigning a priority to the slices allows an intermediate router within a network to discard some low priority slices to minimize the quality degradation of transmitted video stream in an event of network congestion; (iv) it facilitates allocation of access categories in IEEE 802.11e or quality of service (QoS)-aware medium access control (MAC); and (v) the UEP scheme where more parity bits are assigned to higher priority slices to protect them against channel errors.
Video quality is influenced by various network dependent and application oriented factors, such as packet losses, video loss recovery techniques, and encoding configurations. A considerable effort has been made to understand the relationship between packet losses and quality degradation.
The relative importance of bandwidth, latency, and packet loss on the average consumer judgments for video conferencing has also been the subject of study.
Studies were conducted to evaluate the video quality for wireless networks. Real-time video monitoring has two basic requirements. First, the video quality model should account for various network and application parameters accurately. Second, the complexity of parameter calculation should be kept at a minimum. Fulfilling these requirements will facilitate an accurate and efficient mapping of a model to the video quality. In an effort to develop a quality monitor, the authors in [4] estimated the mean squared error (MSE) using three approaches, namely, a full-parse technique that estimates several spatio-temporal parameters from the received bitstream, a quick-parse technique that extracts the start codes and the header information, and a no-parse technique that accesses spatio-temporal parameters that rely on packet loss ratio (PLR) inside the network. This was extended in [5] by using (i) a tree structured classifier, called the classification and regression trees (CART) that labeled each possible packet loss as being either visible or invisible, and (ii) a GLM to predict the probability that a packet loss will be visible to a viewer. The effect of dual packet losses, occurring in close proximity, was investigated in [6]. The significance of scene characteristics was explored in [7] by examining packet loss impairments in MPEG-2 and H.264 compressed videos. It was shown that packet losses at scenes with still camera are more likely to be visible and scene changes with packet losses shortly before or after a scene change are less visible. A versatile GLM was developed in [3] for predicting the slice loss visibility to human observers. The model was trained on a dataset of different videos using the MPEG-2 and H.264 encoding standards, two GOP structures, and other encoding configurations by considering loss of one slice at a time. Various video parameters affecting the packet loss visibility were considered in the model.
2. Background of the Invention
The current state of knowledge is as follows.
The disclosed technology is a robust, low-complexity, and low-delay generalized linear model (GLM) for predicting cumulative mean squared error (CMSE) contributed by the loss of individual H.264 AVC encoded video slices. The model is trained over a video database by using a combination of video factors that are extracted during the encoding of a frame. The slices are prioritized within a group-of-pictures (GOP) based on their predicted CMSE values.
This scheme is extended to the frame-level by estimating the expected number of slices contributed towards the different priorities based on the slice's CMSE, frame type and location within the GOP, and video motion activity.
The accuracy of the CMSE prediction model is analyzed using scatter plots, analysis of variance (ANOVA), and leave-one-out cross-validation to determine the normalized root mean squared deviation (NRMSD). The priority misclassification of the slices is computed, and shows that 2° and 3° misclassifications are minimum. The schemes are validated by applying the unequal error protection (UEP) using rate compatible punctured convolutional (RCPC) codes to the prioritized slices and evaluate their performance over a noisy AWGN channels.
An application of the slice prioritization scheme is demonstrated by implementing a slice discard scheme, where the slices are dropped from the router when the network experiences congestion. The simulation results show that the slice CMSE prediction model is robust to varying GOP structures, GOP lengths, and encoding bit rates. The PSNR and VQM performance of the slices prioritized using the predicted CMSE are similar to those of the measured CMSE values for different videos and channel SNRs.
The scheme of the subject invention is motivated by the subjective video packet loss visibility model proposed in [3]. However, the disclosed approach differs in the following important ways.
First, instead of predicting the slice loss visibility, the model disclosed herein predicts the CMSE distortion introduced by a slice loss. Note that the CMSE is a widely used measure of distortion.
Second, the model disclosed herein assumes that the response variable (i.e., CMSE) is a continuous random variable from an exponential family of distributions. Upon selecting the Gaussian distribution, since the response variable (i.e., packet loss visibility) was Boolean in [3], the binomial family was chosen as the preferred distribution for the subject model and was implemented using logistic regression.
Third, the random forest [8] has been used in the subject model to determine the importance of video factors, which helped us in selecting additional factors based on the interactions among the most important factors.
Fourth, unlike [3], which uses the factors from future frames (e.g., DisToRef and OtherSceneConceal), the scheme of the subject model does not use any such factors.
Due to the preceding points of novelty, the model disclosed herein can be useful in frame-based slice priority assignment for applications (e.g., video conferencing) that cannot tolerate delay of even a few video frames. As a result, this model can be automatically retrained, without involving human observers, if the CMSE prediction accuracy drops below a threshold. Such retraining may be required when the video characteristics, encoder configuration, network congestion, packet loss patterns, and/or application changes significantly.