The proliferation of video communication applications over the last years has necessitated the development of robust video quality measures to assess the Quality of Experience (QoE), defined as the service quality perceived by the user. The assessment of video quality is a critical aspect for the efficient designing, planning, and monitoring of services by the content providers.
Nowadays, hybrid video quality assessment models/systems use a combination of packet information, bit stream information and the decoded reconstructed image. In general, in a hybrid video quality assessment algorithm the features extracted or calculated from the bit stream (e.g., motion vectors, macroblock types, transform coefficients, quantization parameters, etc.), and the information extracted by packet headers (e.g., bit rate, packet loss, delay, etc.) are combined with the features extracted from the output reconstructed images in the pixel domain. However, if the former features do not temporally correspond to the latter due to loss of temporal synchronisation, then the evaluation of quality would not be accurate. Thus, the first step in every hybrid video quality assessment algorithm is the synchronisation of the video bit stream with the decoded reconstructed images.
A block diagram of a hybrid video quality assessment system is depicted in FIG. 1. At the end-user side, a probe device captures the incoming bit stream, and then parses and analyses it in order to extract and compute some features. These features are input to the module which is responsible for the temporal synchronisation of the video bit stream with the output video sequence.
Moreover, the decoding device, e.g., the set-top-box (STB), decodes the received bit stream and generates the processed video sequence (PVS) which is displayed by the output device. The PVS is also input to the module which is responsible for the temporal synchronisation so that it can be temporally synchronised with the video bit stream.
In general, the main reason for the loss of temporal synchronisation between the bit stream and the PVS is the delay. When the video stream is transmitted over a best-effort network, such as the Internet, the arrival time of each packet is not constant and may vary significantly. The variability over time of the packet latency across a network is called jitter. To ensure a smooth playback of the sequence without jerkiness, most video systems employ a de-jitter buffer. The received bit stream is written to the input buffer based on the arrival time of each packet, and the picture data corresponding to a frame are read out of it into the decoder at predetermined time intervals corresponding to the frame period. The display timing of each picture is determined by the timestamp field recorded in the packet header. That is, the timestamp value corresponds to the delay time period which elapses from the detection of picture start code until the picture display timing.
In the above described video decoding system, the display timing of each video frame is determined according to the data which is included in the video bit stream for determination of the display timing. Since the time for the display of a frame is not fixed, the PVS can not always be matched exactly to the original bit stream.
In the literature, the problem of temporal synchronisation between a source and a distorted video sequence has been previously studied and is also referred to as video registration. In M. Barkowsky, R. Bitto, J. Bialkowski, and A. Kaup, “Comparison of matching strategies for temporal frame registration in the perceptual evaluation of video quality, Proc. of the Second International Workshop on Video Processing and Quality Metrics for Consumer Electronics, January 2006, a comparison between block matching and phase correlation for video registration is presented and examined in terms of performance and complexity. Also, a frame-matching algorithm to account for frame removal, insertion, shuffling, and data compression was presented in Y. Y. Lee, C. S. Kim, and S. U. Lee, “Video frame-matching algorithm using dynamic programming,” Journal of Electronic Imaging, SPIE, 2009, based on the minimization of a matching cost function using dynamic programming. In J. Lu, “Fast video temporal alignment estimation,” (U.S. Pat. No. 6,751,360 B1), a fast temporal alignment estimation method for temporally aligning a distorted video with a corresponding source video for video quality measurements was presented. Each video sequence is transformed into a signature curve by calculating a data-point for each frame as a cross-correlation between two subsequent frames. The temporal misalignment of the distorted video is then determined by finding the maximum value of the normalized cross-correlation between the signature curves of the examined video sequences. Another method for identifying the spatial, temporal, and histogram correspondence between two video sequences is described in H. Cheng, “Video registration based on local prediction errors,” (U.S. Pat. No. 7,366,361 B2). The PVS is aligned to the reference video sequence by generating a mapping from a selected set of one or more original frames to the processed set so that each mapping minimizes a local prediction error. In K. Ferguson, “Systems and methods for robust video temporal registration,” (US-A-2008/0253689), frame and sub-image distillation measurements are produced from the reference and test video sequences. Then, they are linearly aligned using local Pearson's cross-correlation coefficient between frames. Additionally, in C. Souchard, “Spatial and temporal alignment of video sequences,” (US-A-2007/0097266), a motion function is defined to describe the motion of a set of pixels between the frames of the test and the reference video sequence and a transform is used to align the two images.
In J. Baina et al, “Method for controlling digital television metrology equipment, U.S. Pat. No. 6,618,077 B1, 2003”, a method for the extraction of parameters from an MPEG-2 Transport Stream is proposed to generate synchronisation signals. However, this method is only applicable when the video elementary stream is packetized in a MPEG-2 Transport Stream and cannot be applied to any transportation protocol. Contrary to that, the proposed method can be applied to any video bitstream without the need for a specific transportation or application protocol. Moreover, the above method provides synchronisation signals to a video quality monitoring algorithm to indicate which pictures (video frames) of the video signal should be used for the quality prediction. In contrast to that, the proposed method identifies the part of the bitstream that corresponds to each picture under consideration from an external decoder. Finally, this method does not exploit the bitstream information to synchronise the video bitstream with the picture from the external video decoder whereas the proposed invention exploits the bitstream to perform the synchronisation. The exploitation of the video bitstream enables the consideration of the effects from packet losses and can be applied in case of transmission errors.
Another method for the alignment of two data signals was presented in “M. Keyhl, C. Schmidmer, and R. Bitto, Apparatus for determining data in order to temporally align two data signals, WO 2008/034632 A1, 2008”. In contrast to that, the proposed invention provides synchronisation between the picture from an external video decoder and the input video bitstream. Moreover, the above method performs the synchronisation in the pixel domain, thus it requires a full decoding of the input video bitstream. In contrast, the proposed method provides two embodiments (second and third embodiment) in which the synchronisation is performed without full decoding and from the packet headers.
Yet another method for synchronising digital signals was presented in “J. Baina et. al, “Method for synchronising digital signals”, US 2003/0179740 A1, 2003. It is a full-reference method, i.e. the reference signal is required to perform the synchronisation. Contrary to that, the present invention proposes a no-reference method for the synchronisation between a video bitstream and the decoded pictures from an external video decoder, thus, the reference signal (video sequence) is not necessary. Moreover, the above method requires the extraction of a parameter from the bitstreams for the synchronisation and, therefore, cannot be applied in case of encrypted bitstreams. In contrast, the method in the proposed invention describes an embodiment for the synchronisation of an encrypted bitstream with the PVS.