Transmission of media content (e.g., video, audio, and/or data, etc., collectively or individually referred to herein also as content) between different nodes on a network may be performed in a variety of ways. The type of content that is the subject of the transfer and the underlying network conditions usually determine the methods used for communication. For instance, for a simple file transfer over a lossy network, one emphasis is on reliable delivery. The packets may be protected against losses with added redundancy or the lost packets may be recovered by retransmissions. In the case of audio/video content delivery with real-time viewing requirements, one emphasis is on low latency and efficient transmission to enable the best possible viewing experience, where occasional losses may be tolerated.
The structure of the packets and the algorithms used for real-time content transmission on a given network may collectively define a chosen content streaming protocol. Although various content streaming protocols available today differ in implementation details, they can generally be classified into two main categories: push-based protocols and pull-based protocols. In push-based streaming protocols, once a connection is established between a server (e.g., server device or server software) and a client (e.g., client device or client software), the server remains active on the session and streams packets to the client until the session is torn down or interrupted for example by a client pausing or skipping in the content. In pull-based streaming protocols, the client is the active entity that requests the content from the server. Thus, the server response depends on the client request, where otherwise the server is idle or blocked for that client. Further, the bitrate at which the client wishes to receive the content is determined entirely by the client. The actual rate of reception depends upon the client's capabilities, the load on the server, and the available network bandwidth. As the primary download protocol of the Internet, HTTP is a common communication protocol upon which pull-based content delivery is based.
In pull-based adaptive streaming, the client makes a decision about which specific representation of any given content it will request next from a server, where each representation may be received at the client in the form of a plurality of requested segments or chunks (e.g., 2-10 seconds in duration, such as a plurality of video frames of a given scene). Such a decision may be based on various parameters and/or observations, including the current (observed/available) bandwidth and the amount of data currently residing in a client buffer. Throughout the duration of a given viewing experience, the client may upshift or downshift (e.g., switch to a representation using a higher or lower bitrate) or stay at the same bitrate based on the available bandwidth and buffer conditions, among other factors. As a result of the bitrate transitions, encoded video quality as seen by the client's decoder may change considerably, most notably with scenes of high motion compared to more static scenes (e.g., in constant bitrate implementations). Even in variable bitrate encoding schemes, despite an advertised (e.g., via a manifest) long-term average encoding bitrate, each of the chunks of a given representation may vary considerably in bitrate. In other words, while the long-term average quality and bitrate behave as conjugate variables, over the temporal chunking intervals (as used in adaptive streaming), quality and bitrate may diverge significantly.
Adaptive streaming (e.g., adaptive video streaming) generally structures a content stream as a multi-dimensional array of content chunks (e.g., piece of content, where a chunk may be one or more Groups of Pictures (GoP) as known in MPEG-compatible systems, or a “fragment” in MPEG-4 (MP4) systems, or other suitable sub-divisions of an entire instance of content, also can be called a fragment or a segment). A chunk represents temporal slices of the content (e.g., 2-10 seconds in duration), which has been encoded or otherwise processed to produce differing levels of quality, different resolutions, etc., and in particular, has different sizes requiring different amounts of bandwidth to deliver to one or more client devices. Virtually all current adaptive streaming systems today use a two-dimensional matrix, with one dimension consisting of the time, and the other dimension consisting of (target) encoding rate. In addition, current adaptive streaming systems use a variety of storage structures for the content, such as directories with individual files for each chunk, fragmented MP4 files (e.g., a standardized file format), or custom packaging schemes. The structure of the content matrix, along with associated metadata describing each chunk, is contained in a separate structure, generally referred to as a manifest. The manifests are typically divided into representations each of which describes one row of the content matrix (e.g., all the chunks encoded at a bitrate X). There exist various schemes and emerging standards for the manifests.
Continuing the overview, during the bitrate transitions, the encoding quality may change. In particular, if an encoder adopts a constant bitrate (CBR)-based encoding scheme to exactly match the advertised (e.g., via the manifest) bitrates for each representation, the quality may vary widely over time within each representation. When the client stays at the same bitrate, the quality may vary from a high-motion or high-complexity scene to a low-motion or low-complexity scene.
In variable bitrate (VBR) encoding schemes, the bitrate is allowed to vary in the short term to keep quality close to constant. The representation bitrate in VBR systems consists of longer-term, average encoding bitrates. For example, at a 1 Mbps representation, some of the chunks may be encoded at 500 Kbps, whereas some of the chunks may be encoded at 1.5 Mbps and these chunks might have a comparable quality level. In such VBR systems, large bitrate variations among the chunks belonging to a given representation are observed. However, schemes to address quality fluctuations in the context of representation changes (e.g., upshifts and downshifts) are not presently known.
Accordingly, certain embodiments of adaptive streaming systems address these and/or other issues.