Streaming media is more popular than ever, as both consumer and enterprise users increase content consumption. It is used on social media such as YouTube, Twitter, and Facebook, and of course also by the providers of on-demand video services such as Netflix. According to some reports, Netflix and YouTube together make up half of peak Internet traffic in North America. Moreover, the number of subscription video on demand homes is forecast to reach 306 million across 200 countries by 2020.
When the transmission capacity in a network fluctuates, for instance for a wireless connection, the media player can often select to adapt the bitrate, so that the video can still be delivered, albeit with sometimes worse quality (lower bitrate, lower resolution etc.). An example is shown in FIG. 1A for a 60-second video, where the segment heights represents the bitrate, and each segment is 5 second long. In almost all cases, the quality will vary in a corresponding way, i.e. higher bitrate will give a higher quality, and lower bitrate will give a lower quality.
It is therefore of vital importance for providers to estimate the users' Quality of Experience (QoE), which is fundamentally the subjective opinion of the quality of a service. For this purpose, subjective test may be used, where a panel of viewers are asked to evaluate the perceived quality of streaming media. Typically, the quality is given on a scale from 1 (“bad”) to 5 (“excellent”), and is then averaged over all viewers, forming a Mean Opinion Score (MOS). However, these subjective tests are costly, both in time and money, and, to circumvent this, objective QoE estimation methods (“objective quality models”) have been developed.
Mean Opinion Score (MOS) is a measure of subjective opinion of users about a service or application performance. It has been widely used to evaluate the quality of multimedia applications. The ITU-T Recommendation P. 800 has standardized the use of MOS on a 5-point Absolute Category Rating (ACR) scale for evaluation of the audio-visual test sequences. The ACR scale ranges from 5 (Excellent) to 1 (Bad). This method is particularly relevant in scenarios where a user is presented with one test sequence at a time and then asked to rate it.
Different objective quality models are normally used for audio and video. The models estimate the quality degradation due to the coding itself, taking into account parameters such as bitrate (audio and video), sampling rate (audio), number of channels (audio), resolution (video), frame rate (video), GOP size (video, a parameter related to video coding), etc. The output from the audio or video quality model for a complete session (as in the picture above) is typically a list of objective MOS scores, where each score represents the quality for an individual media segment (i.e. each score represents the quality during 5 seconds in the figure above). Examples of the audio and video coding quality models can be found in the ITU-T P.1201 recommendation.
When created, the audio and video quality models are trained on a set of subjective tests. This is accomplished in the following manner: a specific number of parameters are varied and multimedia clips are produced using these parameters. These clips are then graded by viewers during a subjective test, and the quality models are then made to as closely as possible (in some sense) match the results from the subjective tests.
Typically, the models are trained on shorter signal segments, typically around 5 to 10 seconds, where the media quality is more or less constant during the clip. This means that the models in principle only give accurate results when presented with segments of corresponding durations, and where no major quality variations are present. To obtain an objective score for a multimedia clip that is much longer than this, an aggregation model is needed. Due to non-linear human perception processing it is not just possible to e.g. average the individual segment scores.
An aggregation model also combines the audio and video model quality scores into combined media scores, representing the total perception of the media. Another task for the aggregation model is to take into account degradations due to buffering. Buffering occurs when the transmission speed in the network is not high enough so that more data is consumed in the media player than what is delivered by the network. This will cause “gaps” in the media play-out during which the media player fills up its data buffer, as exemplified in FIG. 1B. The aggregation model will consequently in the end need to take both these effects into account, both a varying intrinsic audio and video quality, and degradations due to bufferings, as in the more complex example shown in FIG. 10.
The buffering can be either initial buffering (before any media is presented to the user) or possible rebufferings during play-out.