Television content or video content can be transmitted across an IP network from a content provider to a device which is used by an end user. The device may be a personal computer, a wireless communications device, a set-top box, a television with set-top box functionality built in, a smart TV, or a smart set-top box. The television content or video content may have audio content associated therewith which is usually transmitted therewith. Where the transmission occurs in “real time”, meaning that the content is displayed before the transmission is complete, this is referred to as streaming.
Video streaming across communications networks is becoming increasingly common. To ensure the end-to-end quality of video streamed over a communications network, the network operator and the video service provider may use video quality models. A video quality model generates an objective assessment of video quality by measuring artifacts or errors from coding and transmission that would be perceptible to a human observer. This can replace subjective quality assessment, where humans watch a video sample and rate its quality.
Video quality models have been known for some time in the academic world but it is only recently that their use has been standardized. Perceptual video quality models are described in the International Telecommunications Union (ITU) standards J.144, J.247 and J.341. Perceptual models have the advantage that they can use pixel values in the processed video to determine a quality score. In the case of full-reference models (as in the ITU standards mentioned above) a reference signal is also used to predict the degradation of the processed video. A big disadvantage of perceptual models is that they are computationally demanding and not suitable for deployment on a large scale for the purposes of network monitoring.
A more light-weight approach is therefore currently being standardized in ITU-T SG12/Q14 under the working name P.NAMS. The model takes as its input network layer protocol headers and uses these to make a quality estimation of the transmitted video. This makes the model very efficient to implement and use, but on its own the quality estimation of the transmitted video is rather coarse. Therefore ITU-T SG12/Q14 will also standardize a video bit stream quality model under the working name P.NBAMS. This model uses not just the network layer protocol headers but also the encoded elementary stream or “bit stream”. Using both sets of inputs has the advantage that it will be fairly light-weight at the same time as obtaining a better estimate of the quality of the video as compared to the P.NAMS model.
Block based coding is the dominating video encoding technology with codec standards such as H.263, MPEG-4 Visual, MPEG-4 AVC (H.264) and the emerging H.265 standard being developed in the ITU Joint Collaborative Team on Video Coding (JCT-VC). Block based coding uses different types of pictures (which employ different types of prediction) to be able to compress the video as efficiently as possible. Intra pictures (I-pictures) may only be predicted spatially from areas in the picture itself. Predictive pictures (P pictures) are temporally predicted from previous coded picture(s). However, some macro-blocks in P-pictures may be intra-encoded. Bidirectional predictive pictures (B-pictures) are predicted from both previous and following pictures. An I-picture with the restriction that no picture prior to that may be used for prediction is called an Instantaneous Decoding Refresh (IDR) picture. I and IDR pictures are often much more expensive to encode in terms of bits than the P-pictures and B-pictures.
To increase error resilience in error prone communications networks, I or IDR pictures are inserted periodically to refresh the video. I or IDR pictures are also inserted periodically to allow for random access and channel switching. Moreover, I or IDR pictures are inserted when the cost (both in terms of induced distortion and bit allocation) of encoding a picture as P-picture is greater than the cost of encoding it as an I or IDR picture. This occurs when the spatial redundancy of the picture is higher than the temporal redundancy of the picture with its reference pictures. This typically happens when the picture under consideration is a scene change, also known as a scene cut, which means that the depicted scene is quite different from its previous picture. Whether the forced intra pictures should be inserted in time is not defined by the video coding standard (which defines only the decoding procedure), but it is up to the encoder to decide.
On average, television content typically contains a transition between scenes, known as a scene change, every 3-5 second. Scene changes may occur instantly between two pictures or be faded over several pictures. Because it is usually the case that no good temporal prediction can be made from one side of a scene change to another, a smart encoder will often try to align a scene cut with an I- or IDR-picture.
WO 2009/012297 describes a method and system for estimating the content of frames in an encrypted packet video stream without decrypting the packets by exploiting information only from the packet headers. An I-frame is denoted as the start of a new scene if the length of the prior Group of Pictures (GOP) is abnormally short and the penultimate GOP length is equal to its maximum value. However, the major shortcoming with this method is that the scene changes which occur in normal GOP lengths cannot be identified. For example, if the normal GOP length is 25 frames, then a scene change which occurs in frame number 25, 50, 75, 100, etc. cannot be detected. Moreover, a shorter GOP length does not necessarily mean that the picture under consideration is a scene change, thus leading to many false positives.