With advent of digital video streaming technology (such as video-on-demand (VOD) systems), users are able to see and hear digital videos, more or less, as the data is being received from a video server.
When video is streamed, the incoming video stream is typically buffered on the user's receiving device (e.g., computer or set-top box) while data is downloaded into it. At some defined point (generally, when the buffer is full), the video contents are presented to the user. As the video content plays, the receiving device empties the data stored in the buffer. However, while the receiving device is playing the stored video, more data is being downloaded to re-fill the buffer. As long as the data can be downloaded at least as fast as it is being played back, the file will play smoothly.
MPEG
The predominant digital video compression and transmission formats are from a family called MPEG (Moving Picture Experts Group). It is the name of family of standards used for coding audio-visual information (e.g., movies, video, music, and such) in a digital compressed format.
For the convenience of explanation of video streaming, the MPEG-family video stream is generally discussed and described herein. However, those who are skilled in the art understand and appreciate that other such digital video compression and transmission formats exist and may be used.
Of course, there are other digital video compression and transmission formats, such as the H.264 codec. Those of ordinary skill in the art will understand how the concepts discussed herein with relationship to MPEG apply to other formats.
GOP and Frames
A MPEG video stream is typically defined by a series of segments called Groups of Pictures (GOP). Typically, a GOP consists of a set of pictures intended to be displayed in sequence over a short duration (e.g., ½ second) when displayed at their intended speed.
A GOP typically includes three types of frames:                an intra frame (I-frame);        predictive frames (P-frames); and        bi-directionally predictive frames (B-frames).        
There is no specific limit to the number of frames which may be in a GOP, nor is there a requirement for an equal number of pictures in all GOPs in a video sequence.
The I-frame is an encoded still image. It is not dependent upon any other frame that the decoder has already received. Each GOP typically has only one I-frame. It is sometimes called a random access point (or “RAP”) since it is an entry point for accessing its associated GOP.
From the point of view of a video-stream decoder, the P-frames are predicted from the most recently reconstructed I- or P-frame. A P-frame (such as frame 120p) requires data from a previously decompressed anchor frames (e.g., I-frames or P-frames) to enable its decompression.
Switching to the point of view of video stream encoder and transmitter, the B-frames are predicted from the closest two I- or P-frames—one frame in the past and one frame in the future. A B-frame (such as frame 132p) requires data from both preceding and succeeding anchor frames (e.g., I-frames or P-frames) to decode its image. It is bi-directionally dependent.
Of course, other digital video compression and transmission formats (such as H.264 codec) may employ other labels, some different types, and different relationships between frames. For example, in H.264, the frame types, frame dependence relationships, and frame ordering are much more decoupled than they are in MPEG. In H.264, the I-frames are independently decodable and are random access points. Also, frames have defined presentation order (like MPEG does). However, the other frames relate differently than do the MPEG P-frames and B-frames.
So, those of ordinary skill in the art will understand how the concepts discussed herein with relationship to MPEG apply to other formats.
Transmission and Presentation Timelines
FIG. 1 illustrates two manifestations of the same MPEG video stream. The first is the transmission timeline 100t and the other is the presentation timeline 100p. 
The transmission timeline 100t illustrates a video stream from the perspective of its transmission by a video-stream encoder and transmitter. Alternatively, it may be viewed from the perspective of the receiver of the transmission of the video stream.
As shown in FIG. 1, the I-frames (e.g., 110t and 150t) are typically temporally longer than the other frames in the transmission timeline. Since it doesn't utilize data from any other frame, it contains all of the data necessary to produce one complete image for presentation. Consequently, an I-frame includes more data than any of the other frames. Since the I-frame has more data than others, it follows that it typically requires greater time for transmission (and, of course, reception) than the other frame types.
FIG. 1 also shows P-frames (such as 120t) and B-frames (such as 130t and 132t) of the transmission timeline 100t. Relative to the B-frames, the P-frames are temporally longer in the transmission timeline because they typically include more data than the B-frames. However, P-frames are temporally shorter than I-frames because they include less data than I-frames. Since the B-frames rely on data from at least two other frames, they typically do not need as much data of their own to decode their image as do P-frames (which rely on one other frame).
FIG. 1 also illustrates the presentation timeline 100p of the video stream from the perspective of its presentation by the video decoder and presenter. In contrast to their transmission duration, the presentation duration of each frame—regardless of type—is exactly the same. In other words, it displays at a fixed frequency.
The incoming frames of the video stream are decoded, buffered, and then presented at a fixed frequency (e.g., 24 frames per second (fps)) to produce a relatively smooth motion picture presentation to the user. In MPEG 2 used to, convey NTSC video, the field rate is fixed, and each MPEG 2 picture may produce 1, 2, or 3 fields. Field pictures are required to produce 1 field, and frame pictures may produce 2 or 3 fields. Thus, the frame picture presentation rate may not be fixed, but it is not dictated by the transmission rate of the frame pictures.
FIG. 1 also illustrates a typical decoded GOP 105 of MPEG in its presentation timeline. This GOP example includes an I-frame 110p; six P-frames (e.g., 120p); and 14 B-frames (e.g., 130p and 132p). Typically, each GOP includes a series of consecutively presented decoded frames that begin with an I-frame (such as frame 110p).
Order of Transmission and Presentation
As shown in FIG. 1, the order in which the frames are presented typically does not directly match the order in which the frames are transmitted. The arrows shown in FIG. 1 between the frames of the transmission timeline 100t and the presentation timeline 100p illustrate a typical way that frames are reordered between reception and presentation. The tail of each arrow has a bullet (i.e., circle) anchor at the end of a transmitted frame. The head of each arrow has an arrowhead pointing to its corresponding presentation frame.
For example, the transmission I-frame 110t corresponds to the presentation I-frame 110p. In reality these are the same frames, but their timeline representations indicate their different manifestations.
Returning to the explanation of this example, the transmission P-frame 120t corresponds to the presentation P-frame 120p. The transmission B-frames 130t and 132t corresponds to the presentation B-frames 130p and 132p. As shown in FIG. 1, these B-frames 130t and 132t are encoded, transmitted, received, and decoded after their P-frame 120t in the transmission timeline 100t, but their corresponding presentation B-frames 130p and 132p are presented before their P-frame 120p in the presentation timeline 100t. Note that the encoder typically receives the frames in non-compressed form in the same order that the frames are eventually displayed, and the encoder typically performs the frame re-ordering before compressing the frames.
Furthermore, the next GOP to be transmitted starts with I-frame 150t, but two B-frames 134t and 136t typically come along after this new GOP has begun. As illustrated in FIG. 1, the straggling B-frames 134p and 136p are presented in sequence and before the presentation of the I-frame 150p of the new GOP.
GOP Presentation Delay
FIG. 1 shows that the I-frame 110t of an example GOP is first received beginning at point T1 in time; however, it is not first presented until point T2. The time gap between the two points is called herein the “GOP presentation delay” and is labeled 170 in FIG. 1. It represents the delay from when the receiver first begins receiving the first frame of a GOP (which is typically the I -frame) until the device first presents the first frame of the GOP.
There are many reasons for this delay. Some are a natural consequence of the video streaming technology and others are imposed into the process to address known technical issues. Some of reasons for the GOP presentation delay include:                contrast between the time required to receive a frame transmission and the time required to display a frame;        the time required to decode a frame (especially considering inter-frame dependencies for decoding); and        built-in delay to facilitate smooth presentation of frames without needed to wait for frame transmission or decoding.        
The details of these reasons and the knowledge of other reasons are known to those of ordinary skill in the art.
Video-Stream Presentation Start-up Delay
To tune channels in a video-streaming environment (such as digital cable), a receiver receives a video stream and waits for an access point into the stream. A channel change cannot occur until an access point is received. From the perspective of the user, this can lead to lengthy channel change times.
FIG. 2 illustrates an example of a video-stream presentation start-up delay at 280. The start-up delay is the effective delay experienced by a user. It includes a delay between when a particular video stream is requested and the actual presentation of the first frame of a GOP from the particular video stream. As shown in FIG. 2, the start-up delay 280 includes the GOP presentation delay 270 (discussed above).
Referring to FIG. 2, this example is explained. A GOP, starting with I-frame 210t, is being transmitted. This is shown in the transmission timeline 200t. The receiver tunes into this video stream at request point R. This selection is illustrated as a user selecting a video-stream channel using a remote control 260.
Again, this is an example illustration for explanatory purpose. This point R could be at any moment in time after the beginning (i.e., after the beginning of its I-frame 210t) of a GOP.
The receiver must wait for a random access point (or RAP) in order to access the video stream. In this example, each GOP has one RAP. An I-frame is an example of a typical RAP. Therefore, each GOP has one I-frame. So, the receiver must wait for the next I-frame (at the beginning of the next GOP) before it can access the video-stream transmission as shown by transmission timeline 200t. 
Once the receiver has an I-frame in its buffer, it may refer back to it for dependency decoding of P- and B-frames. Consequently, a conventional system must wait for a RAP before it can start buffering frames (that are useful).
In FIG. 2, the receiver starts buffering the next GOP at point M1 with I-frame 250t. Thus, the first frame that may be eventually presented to the user is I-frame 250t, because it is the first RAP in the stream after the point at which the receiver joined the stream. Because of the GOP presentation delay (discussed above), it actually starts presenting the GOP (with I-frame 250p of presentation timeline 200p) at point M2—which is also the presentation start-up point S of the start-up delay 280.
As demonstrated by the screens 262–266, the start-up delay is the effective delay experienced by a user. The user selects a video-stream channel at request point R (using, for example, a remote 260) and sees a blank screen, as shown by screen 262. Of course, there may be information presented here (such as electronic programming information), but since it is not yet the desired video-stream content it is effectively blank.
Screen 264 shows that screen remains blank even after the next GOP is currently being received. Screen 266 shows that the first image of frame 250p is finally presented to the user.
The average length of this start-up delay is directly proportional to the average GOP length. Some video-stream providers employ relatively long average GOP lengths. For these instances, this delay is even more acute because the user is waiting longer for the next GOP to come round after she has changed channels.
It short, this start-up delay is very annoying to the typical users and tries their patience.