The speed of consumer broadband internet access varies widely. In September 2008, a leading DSL provider in the United States offered consumers four DSL options ranging from maximum download speeds of 768 Kb/s to 6 Mb/s. During the same period a leading cable provider offered cable modem service with maximum download speeds ranging from 768 Kb/s to 10 Mb/s. In both these cases, the quoted download speeds are maximum rates and are not guaranteed. Furthermore, download speeds are generally not guaranteed to be sustainable for any duration of time.
The delivery of quality video assets over a data communication network, such as the Internet is hindered by both the wide variation in consumer broadband internet access speeds as well as the fact that for any given consumer, a download rate is not guaranteed to be sustainable at a consistent or known rate. These limitations have forced producers of on-line video content to produce a given video asset at a number of data rates (also referred to as bit rates or encoding rates) that can be offered as alternatives to consumers. When consumers opt to watch on-line video content, they are given the choice to select among versions of the content having different bit rates. A consumer may then choose to watch the content at the highest bit rate that is less than their known maximum data rate. For example, during 2008, a major sports broadcaster produced, for each game, live content at bit rates of approximately 1.21 Mb/s, 800 Kb/s and 400 Kb/s.
Typically, the higher the encoded bit rate, the higher the video quality. The overall quality of consumers' viewing experiences has been hindered, because consumers typically have to choose from amongst a small set of data rates, and because among these rates, the consumers must choose one that happens to be less than their expected sustainable broadband internet download speed. If the consumer's download speed is not sustained at a speed that is at least equal to the video bit rate, then the watching experience will occasionally be interrupted by pauses as more video is fetched from the source. These pauses, often referred to as re-buffering, also impact the quality of the viewing experience. Since it is unlikely that end users will actually experience their maximum achievable download speed, they are forced to choose a bit rate much lower than their maximum download speed unless they are willing to suffer periodic stream re-buffering. The implication of having to choose a lower video bit rate means that a consumer's actual download capacity may not be fully utilized, and therefore the quality of video service may not be maximized.
Adaptive Streaming is a technique that attempts to optimize a consumer's actual bit rate from moment to moment. The technique involves encoding a given video asset at a range of video bit rates. During the consumer's playback, the delivery system dynamically switches between the various rates depending on the actual download speeds the consumer experiences while watching the content. In this scenario, the consumer does not have to initially choose a lower quality video experience. The consumer simply chooses to watch a given video asset, and the best quality video stream that is achievable based on their momentary download speed is dynamically delivered to them. If their download speed goes down, the video stream that is being delivered to them is switched to a lower bit rate stream. If the consumer's download speed goes up, a higher bit rate stream is delivered.
A digital video signal, also known as a video stream, includes a sequence of video frames. Each frame has a timestamp associated with it describing the time when the frame is to be displayed relative to other frames in the stream. When two streams of the same video signal having different bit rates are provided, as in Adaptive Streaming, switching between streams should be seamless, such that frames continue to be displayed in the proper order and are displayed at the time specified in the timestamp. In order to cleanly switch to a new stream, a frame-accurate relationship should exist between the current stream and the new stream. That is, proper display of the video signal requires knowledge of the next frame in the new stream. Thus, if a delivery system is currently displaying frame N of a stream, the delivery system needs to know where frame N+1 exists in the new stream to be switched to. Having a frame-accurate relationship between video streams means that there is a frame-to-frame correspondence between frames in multiple different video streams that are generated from the same input source but that may have different encoding parameters, such as bit rate, picture size, etc.
The task of having a frame-accurate relationship is simple when the source video asset being encoded is a file based asset, meaning that all frames already exist on a storage medium, such as a hard disk. A file asset has a fixed set of frames and timestamps associated with those frames. The asset can be encoded many times, perhaps even on different machines, and, in each output file, a given frame N will have the same timestamp in the encoded output.
For example, referring to FIG. 1, a source video file asset 10 including M frames is encoded by a first encoding system 12A and a second encoding system 12B. The first encoding system 12A encodes the source video 10 into a first encoded video asset 20A including M frames and the second encoding system 12B encodes the source video 10 into a second encoded video asset 20B, also including M frames. The M frames of the first encoded video asset 20A correspond to the M frames of the second encoded video asset 20B on a frame-by-frame basis with identical timestamps.
A live asset, such as a live video feed, does not have a fixed set of frames and timestamps associated with those frames. However, when the capture of live video starts, it is typical for the first frame captured to be considered frame 1 having a timestamp of 0. Thereafter frame numbering and timestamps increment just as if the asset was from a file. For example, FIG. 2 illustrates capture of video from a live video source. Referring to FIG. 2, capture started at frame A+1 of the source live stream. The first captured frame in the captured video file is typically referred to as frame 1 and has a timestamp of 0.
The task of having a frame-accurate relationship is therefore straightforward when the source video asset being encoded is live and where the frames feeding multiple encoders are sourced from a single capture system. The overall encoding architecture can be a single system including multiple encoders (as illustrated in FIG. 3) or multiple encoding systems (as illustrated in FIG. 3) but in each case there remains a single capture source for the video frames. In the systems illustrated in both FIG. 3 and FIG. 4, a captured video stream is encoded at different bit rates using first and second encoders (Encode 1 and Encode 2). In the system of FIG. 3, the two encoders are implemented in a single capture and encoding system, while in FIG. 4, the two encoders are implemented as separate encoding systems that receive captured video frames from a single common capture system.
The quality and/or scalability of an adaptive streaming model may be directly related to the number of encoding rates that can be produced for a given asset. For example, producing just three encoding rates such as 200 Kb/s, 800 Kb/s and 1.4 Mb/s (e.g., 600 Kb/s spacing between encoding rates) is not as scalable as having 5 rates at 200 Kb/s, 500 Kb/s, 800 Kb/s, 1.1 Mb/s and 1.4 Mb/s (300 Kb/s spacing) which is not as scalable as having 9 rates at 200 Kb/s, 350 Kb/s, 500 Kb/s, 650 Kb/s, 800 Kb/s, 950 Kb/s, 1 Mb/s, 1.25 Mb/s and 1.4 Mb/s (150 Kb/s spacing). More bit rates are better from a playback standpoint, because the visual transitions between streams may be less noticeable.
The number of output streams in the single live encoding system illustrated in FIG. 3 is limited by the overall processing capabilities of the encoding system (processor, memory, I/O, etc). This system architecture also does not handle failure well. If a fault occurs at the single live capture system, all of the output streams may be lost.
At first glance, the system depicted in FIG. 4 offers apparent infinite scalability and more robust failure handling. Any number of encoders can be added to the architecture, and if a single system fails, only a single adaptive stream is lost, although if the single capture system fails, like the architecture of, all adaptive streams may be lost. However, since there is a single capture system providing frames and associated timestamps, this architecture does allow for restart of a failed system. The restarted system can start encoding again and start providing streams that are frame-accurate relative to streams generated by the other encoders.
However, in practice, the architecture shown in FIG. 4 may be impractical, as it relies on a single capture system feeding uncompressed captured video to multiple encoding systems. Uncompressed video is very large (HD uncompressed video in 4:2:2, 8-bit format requires nearly 1 Gb/sec for transmission), and the network requirements to deliver uncompressed video feeds to a scalable number of encoding machines are not practical.
A modified system architecture is illustrated in FIG. 5. This architecture uses common live encoding systems (Live Encoding System 1 and Live Encoding System 2) that are not fed uncompressed video from a single capture system. If a single capture and encoding system fails, then a subset of adaptive streams may be lost. There is also no single point of failure in the capture and encoding components that impacts all adaptive streams. However, this architecture still has a variety of limitations. For example, in order to have a frame-accurate relationship among encoded outputs, each live encoding system must start encoding on exactly the same frame. By doing so, frame N of one output will be the same frame N in another output. If the encoding systems do not start synchronously with each other, this requirement will not be met.
Starting capture at a specific time code can solve the problem of synchronous start across multiple encoders, because all encoders start on exactly the same frame. However, such a method precludes the possibility of a system restarting after failure.