The amount of video data sent over internet, broadcasted networks and mobile networks are increasing for every year. This trend is pushed by the increased usage of over-the-top (OTT) services like Netflix, Hulu and YouTube as well as an increased demand for high quality video and a more flexible way of watching TV and other video services.
To keep up with the increasing bitrate demand for video it is important to have good video compression. Recently, JCT-VC in collaboration with MPEG developed the high efficiency video coding (HEVC) version 1 video codec which efficiently cuts the bitrate in half for the same quality compared to its predecessor AVC/H.264.
HEVC, also referred to as H.265, is a block based video codec that utilizes both temporal and spatial prediction. Spatial prediction is achieved using intra (I) prediction from within the current picture. A picture consisting of only intra coded blocks is referred to as an I-picture. Temporal prediction is achieved using inter prediction (P), also referred to as uni-predictive prediction, or bi-directional inter prediction (B), also referred to as bi-predictive prediction, on block level. In inter prediction a prediction is made from a single previously decoded picture. In bi-directional inter prediction the prediction is made from a combination of two predictions that may either reference the same previously decoded picture or two different previously decoded pictures. The previously decoded picture(s) is(are) decoded before the current picture and may come before or after the current picture in display time (output order). A picture containing at least one inter coded block but no bidirectional coded inter blocks is referred to as a P-picture. A picture containing at least one bidirectional inter block is referred to as a B-picture. Both P-pictures and B-pictures may also contain intra coded blocks. For a typical block, intra coding is generally much more expensive in bit cost compared to inter coding, which is generally more expensive than bi-predictive coding.
An instantaneous decoding refresh (IDR) picture is an I-picture for which a following picture may not reference a picture prior to the IDR picture. A clean random access (CRA) picture is an I-picture that allows a random access skipped leading (RASL) picture to reference a picture that follow the CRA picture in decoding order and precedes the CRA picture in display or output order. In case the decoding starts at the CRA picture, the RASL pictures must be dropped since they are allowed to predict from pictures preceding the CRA picture that may not be made available for prediction when the CRA picture is used for random access. Broken link access (BLA) pictures are I-pictures that are used for indicating splicing points in the bitstream. Bitstream splicing operations can be performed by changing the picture type of a CRA picture in a first bitstream to a BLA picture and concatenating the stream at a proper position in the other bitstream.
An intra random access point (IRAP) picture may be any one of IDR, CRA or BLA picture. All IRAP pictures guarantees that pictures that follow the IRAP in both decoding and output order do not reference any picture prior to the IRAP picture in decoding order. The first picture of a bitstream must be an IRAP picture, but there may be many other IRAP pictures throughout the bitstream. IRAP pictures provide the possibility to tune in to a video bitstream, for example when starting to watch TV or switching from one TV channel to another. IRAP pictures can also be used for seeking in a video clip, for example by moving the play position using the control bar of a video player. Moreover, an IRAP picture provides a refresh of the video in case there are errors or losses in the video bitstream.
Video sequences are typically compressed using a fixed maximum picture distance between IRAP pictures. More frequent IRAP pictures make channel switching faster and increases the granularity of seeking in a video clip. This is balanced against the bit cost of IRAP pictures. Common IRAP picture intervals vary between 0.5 to 1.0 seconds.
One way of looking at the difference between IRAP and temporal predictive pictures, such as P and B pictures, is that the IRAP picture is like an independent still picture, while a temporal predictive picture is a dependent delta picture relative to previous pictures.
FIG. 1 shows an example video sequence where the first picture is an IRAP picture and the following pictures are P-pictures. The top row shows what is sent in the bitstream and the bottom row shows what the decoded pictures look like. As can be seen, the IRAP picture conveys a full picture while the P-pictures are delta pictures. Since the IRAP picture does not use temporal picture prediction, its compressed size is usually many times larger than a corresponding temporal predictive picture.
By looking at actual coded sequences one can get an indication of how much more bits that are spent on IRAP pictures compared to P-pictures. Let us look at the common conditions bitstreams for the HEVC codec that are provided by the JCT-VC standardizations group. An estimation of the bit-rate savings achievable by converting every IRAP picture except the first picture to P-pictures for two sets of sequences is reported in Tables 1 and 2. As an example, Table 1 shows that encoding the Kimono test sequence with only the first picture as an IRAP picture results in 10.5% lower bitrate compared to the same sequence encoded with IRAP pictures used once per second.
TABLE 1HEVC HM11.0 8b YUV 4:2:0SequenceFormatFpsQP22QP27QP32QP37Kimono1920 × 108024−10.50%−11.40%−12.10%−12.10%Nebuta2560 × 160060−0.60%−1.00%−2.80%−8.90%ParkScene1920 × 108024−13.70%−20.40%−25.80%−29.30%PartyScene832 × 48050−6.60%−10.30%−14.80%−19.60%PeopleOnStreet2560 × 160030−2.50%−3.80%−4.30%−4.40%RaceHorses416 × 24030−4.00%−5.80%−6.70%−7.70%RaceHorses832 × 48030−2.50%−4.30%−6.50%−8.40%SlideEditing1280 × 720 30−56.50%−57.70%−57.60%−59.90%SlideShow1280 × 720 20−14.80%−17.20%−20.50%−20.30%SteamLocomotive2560 × 160060−2.60%−5.00%−7.80%−10.40%Traffic2560 × 160030−12.80%−21.90%−28.90%−33.90%Average−11.55%−14.44%−17.07%−19.54%
TABLE 2SCC HM14.0 8b YUV 4:4:4SequenceFormatFpsQP22QP27QP32QP37Basketball_Screen2560 × 144060−26.30%−34.00%−40.10%−44.80%EBURainFruits1920 × 108050−8.90%−12.30%−14.90%−17.10%Kimono1920 × 108024−3.80%−4.20%−4.40%−5.90%MissionControlClip22560 × 144060−5.70%−7.10%−8.70%−9.30%MissionControlClip31920 × 108060−7.20%−8.70%−11.50%−17.10%sc_console1920 × 108060−4.10%−4.40%−5.10%−5.50%sc_desktop1920 × 108060−32.70%−31.40%−29.80%−28.10%sc_flyingGraphics1920 × 108060−0.60%−0.80%−1.40%−2.10%sc_map1280 × 720 60−10.10%−10.70%−10.30%−13.00%sc_programming1280 × 720 60−3.60%−5.20%−8.40%−13.00%sc_robot1280 × 720 30−13.40%−21.20%−27.20%−31.30%sc_slideshow1280 × 720 20−16.10%−18.10%−20.10%−19.10%sc_web_browsing1280 × 720 30−14.20%−17.00%−20.40%−19.70%Average−11.28%−13.47%−15.56%−17.38%HM—HEVC test modelFPS—frames per secondSCC—screen content codingYUV—luma component (Y) and chroma components (U, V)QP—quantization parameter
There is, thus, a need for an efficient video coding and decoding and in particular such video coding and decoding that achieves a balance between the number of random access points and the bit cost of such random access points.