The consumption of video content distributed over traditional IP networks has been increasing for some time, due in part to the availability of VOD (Video On Demand) services and the multiplication of devices on which such video content may be viewed. For example, video content may be accessed from various types of devices (such as smart phones, tablets, PC, TV, Set Top Boxes, and game consoles for example) that are connected through various types of networks (for example, broadcast, satellite, cellular, ADSL, and fiber).
Video content can be expressed or represented in many different ways. For example, the digital representation of video content is dependent on many factors, one of which is resolution. The resolutions available today include for example 480p (720×480 pixels), 576p (720×576 pixels), 720p (1280×720 pixels), 1080i (1920×1080 pixels split in two interlaced fields of 540 lines), 1080p (1920×1080 pixels), 2160p (3840×2160 pixels) and 4320p (7680×4320 pixels). The resolutions 720p, 1080i and 1080p are generally referred as “HD” (High Definition) or “HDTV” (High Definition Television), the resolution 1080p being more specifically referred to as “Full HD” (Full High Definition). Resolutions 2160p and 4320p are generally called “UHD” (Ultra High definition) or “UHDTV” (Ultra High Definition Television), 2160p being more specifically called “4K UHD” (4 kilo Ultra High Definition) and 4320p being more specifically called “8 k UHD” (8 kilo Ultra High Definition).
Due to the large size of raw digital video, digital video content is generally accessed while represented in a compressed form. The digital representation of video content is therefore also associated with a video compression standard. The most widely used video standards belong to the “MPEG” (Motion Picture Experts Group) family, which notably comprises the MPEG-2, AVC (Advanced Video Compression also called H.264) and HEVC (High Efficiency Video Compression, also called H.265) standards. Generally speaking, a more recent format offers more encoding features and/or provides a better compression ratio. For this reason, more recent formats are considered more advanced, e.g., the HEVC format is more recent and more advanced than AVC, which is itself more recent and more advanced than MPEG-2. Therefore, HEVC yields more encoding features and greater compression efficiency than AVC. The same is true with respect to AVC in relation to MPEG-2. Compression standards of the MPEG family are block-based compression standards; other block-based compression standards exist, such as the VP8, VP9 and VP10 Google formats.
Even within the same video compression standard, video content can be encoded differently. The same digital content may be encoded at different bitrates. As another example, the same digital content may also be encoded using only I-Frames (I-Frame standing for Intra-Frame), I and P-Frames (P standing for Predicted Frame) or I, P and B frames (B standing for Bi-directional frames). More generally, the number of available encoding options increases with the complexity of the video format.
The diversity of devices which may be used to access and play video content yields a multitude of optimal video formats for which support is desirable, as each type of device may have a preferred video format driven by its screen size. Each type of device may also have its own resolution and its own decoding features. For example, a set-top box is now able to decode UHD resolutions. Meanwhile, an average smart phone such as Nokia™ Lumia 520 has a resolution of 480×800 pixels, while an up-market smart phone such as the iPhone™ 6 has a 1080p resolution. The vast diversity of video display devices therefore results in a large diversity of video display resolutions.
The diversity of video decoding devices also leads to a diversity of decodable video formats. Recent and advanced decoding devices are already able to decode HEVC video, a standard smart phone may be able to decode AVC video only, and a large number of set-top boxes are still only able to decode MPEG-2 video.
Depending on the available communication link, a video decoding device will receive video content at different bitrates. For example, a set-top box or a PC using Wi-FI or a wired connection may receive digital video of a high quality and, therefore, a high bitrate. In contrast, a mobile device using a 3G connection may only be able to receive video of a low quality at a low bitrate.
The high diversity of video decoding devices, video compression standards, resolutions, and available bitrates therefore leads to a large combination of possible video types and formats in which digital video may need to be represented to serve a wide range of customers across heterogeneous networks.
The purpose of OTT (Over-The-Top) video is to deliver video content for any user of an IP network. In OTT delivery, a video is available in a number of different representations. A representation is a digital expression of video content according to certain characteristics such as, for example, video resolution, bitrate, compression format, encoding options, and packaging. A client receives information about the available representations; such information may be made available for example via a manifest file. The video is split in a succession of small files (called segments) and encoded at different bitrate/resolutions corresponding to the different representations listed in the manifest. The client then downloads and plays the small files in the representation of interest. In case of variations in the available bandwidth, a video client can download video segments encoded at various bitrates to match the variations of available bandwidth. A representation also defines a packaging for video content, such as a file format and extension. Note that devices are typically not able to parse all file formats.
Video content is typically delivered over an IP network using a Content Delivery Network (CDN). A CDN is generally composed of a large number of servers. A Head-End video transcoder produces the different representations and sends them to an Origin Server which is the entry point on the CDN. Information about the contents stored at the Origin Server is made available to clients or playback devices. When a large number of clients are present in a region, a CDN usually comprises a server called an Edge Server to service clients in that region. The role of the Edge Server is to receive client requests from that region and send those client requests to the Origin Server. When multiple client requests for the same content are received by an Edge Server, the Edge Server is able to send only one request for the requested content to the Origin Server and dispatch the requested content locally to each requesting clients using a caching mechanism. This approach has the advantage of saving bandwidth within the CDN between the Origin Server and the Edge Server(s).
It is possible to achieve OTT delivery by encoding and packaging all possible representations of the video on the head-end transcoder, then sending, upon the requests of clients, video data all across the CDN. Undesirably, however, doing so would tend to saturate the bandwidth and the storage of the CDN. Indeed, with this approach each representation that has been requested by at least one client is sent throughout the CDN, thereby requiring a large amount of data to be sent through the CDN. Data is commonly stored on servers while it is delivered. Thus, if a large number of representations are sent over the CDN, a large number of OTT files will be stored on Origin Servers, Cache Servers, and Edge Servers during the delivery of the video.
A solution to this issue is Just-In Time Packaging (JITP) and Just-In-Time Transcoding (JITT). JITP and JITT are disclosed by Robinson et al., HEVC Benefits and the Path to Deployment, Society of Cable Telecommunication Engineers, Proceedings of the Cable-Tec Expo '14. In the solution disclosed by Robinson, only the highest resolution of the video in a single packaging is transmitted between the Head-End and the Edge Servers, and the video content is packaged and transcoded on the fly at the Edge Servers, for example to another video format or resolution, to meet the clients' requests. This solution offers the advantage of limiting the bandwidth consumed between the Head-End and the Edge Servers and storage consumption on the network.
A classical JITT solution induces a very heavy computing load on an Edge Server. Video encoding is a very complex operation that consumes a large amount of computational resources. The compression of video at an acceptable quality/rate ratio is performed through a large number of decisions on block types, size, and motion vectors, which are performed by a large number of encode options and iterations. All these loops, iterations, and decisions consume a lot of resources. Performing on-the-fly transcoding at an Edge Server as disclosed by Robinson becomes impossible at a reasonable cost and good quality/bitrate ratio when the number of representations in OTT delivery approaches the number required by a practical implementation.
European Patent Application No. 15305550.4, also filed by the Applicant of the instant application on Apr. 14, 2015, which is hereby incorporated by reference for all purposes as if fully set forth herein, discloses a solution to the above-mentioned problem. European Patent Application No. 15305550.4 discusses a video coding device that encodes a video in a first representation and computes symbols that define a correspondence between the decisions taken in the first representation and decisions in at least one further representation, having for example a resolution or video codec which is different from the first representation. Thus, an Edge Server, having access to the video encoded in the first representation and symbols of correspondence between decisions, is able to perform Just-In-Time Transcoding and serve end users with video content at an acceptable quality-rate ratio in a number of different representations while limiting the storage requirements for the video content.
The approach discussed by European Patent Application No. 15305550.4 is able to serve video content to users in a large number of different representations at an acceptable quality/rate ratio while limiting the storage needs on servers. However, encoding video using an Open GOP scheme offers an improved quality compared to encoding video using a Closed GOP scheme, but is difficult to use in this context when the content is available at variable representations/resolutions.
Conventional video coding methods use three types of frames: I or Intra-predicted frames, P or Predicted frames, and B or bi-directional frames. I frames can be decoded independently, like a static image. P frames use reference frames that were displayed previously. B frames use reference frames that may have been previously displayed or may have yet to be displayed. The use of reference frames involves encoding only the difference between a block in the current frame and a combination of blocks from reference frames.
A GOP is defined as the Group of Pictures between one I-frame and the next one in encoding/decoding order. In the prior art, the ability to switch between resolutions is primarily available using an encoding scheme called “closed GOP” at segment boundaries. Closed GOP refers to any block based encoding scheme where the information to decode a GOP is self-contained. This means that a GOP contains one I-frame, P-frames that only reference that I-frame and P frames within the GOP, and B-frames that only reference frames within the GOP. Thus, there is no need to obtain any reference frame from a prior GOP to decode the current GOP. When such a “Closed GOP” scheme is used at a segment boundary, it means that the first GOP of a new segment does not need any information from the previous GOP which belongs to the previous segment. By contrast, in the coding scheme called Open GOP, the first B frames which are displayed before the I-frame in a current GOP can reference frames from prior GOPs.
In state of the art implementations of an OTT scheme based on adaptive streaming technologies, an Open GOP coding scheme can be used inside a segment if the segment contains several GOPs, but it is not possible to use an Open GOP coding scheme for the first GOP of each segment in an manner that allows switching between resolutions, or more generally representations, at segment boundaries. To illustrate, a B frame in an Open GOP after a switch of resolutions would reference frames in the previous GOP that have a different resolution, which would lead to an inconsistency when decoding. To obtain consistent reference frames for the first B frames in a GOP after a switch, the player would need to request the prior segment in the resolution of the new segment, which is practically not feasible if bandwidth is constrained.
It is nearly impossible, by using segments whose first GOP is encoded in a Closed GOP scheme, to obtain in the same time an acceptable latency and a good quality of video. It is possible to obtain good video quality by dividing video content in very long segments that contain a large number of Open GOPs and a single closed GOP. However, using large segments leads to poor latency due to segment length. On the other hand, using a closed GOP encoding scheme with short segments leads to a better latency but poorer video quality experience, as quality changes between successive GOPs may be visible since there are no B frames to link the two GOPs by using reference frames in each of them.
European Patent Application No. 15305190.9, also filed by the Applicant of the instant application on Feb. 10, 2015, which is hereby incorporated by reference for all purposes as if fully set forth herein, discusses a video decoder which is able to decode video encoded in an Open GOP scheme despite changes of resolutions between segments. This video decoder is able to receive and decode video encoded in Open GOP at different resolutions, or more generally, different representations. The combination of JITT transcoding and such a decoder enables video content to be served in OTT in a number of different representations while limiting the storage and computational needs of the servers while providing an optimal quality-bitrate ratio of video, using good video coding decisions, and supporting an Open GOP coding scheme.
However the specifications of existing standards have not yet been adapted to such decoders. Thus, it is in many cases impossible for an OTT Edge Server receiving requests from a large number of different clients to ensure that the corresponding video decoders will properly decode video when changing resolutions or representations in an Open GOP scheme.