Referring to FIG. 1, a television programming provider, such as a national satellite aggregator, typically produces a continuous set of programming signals (also known as “network feeds”) for distribution by a service provider over a video transmission network 5 to a wide audience of viewers. Conventionally, the programming signal begins as an uncompressed video sequence 6 and at least one corresponding uncompressed audio sequence (not shown). The sequence 6 consists of a series of sequential pictures i and is assembled at a production facility 7.
After assembly, the uncompressed video sequence 6 is compressed by a video encoder, which may be a conventional video encoder (CVE) 8. The CVE 8 encodes each picture i (i=1, 2, . . . ) creating a corresponding coded picture (also known as an access unit) of bi bits using a conventional video coding algorithm defined by a video coding standard such as MPEG 2 or H.264. Any corresponding audio sequences are compressed by an audio encoder (not shown). The video and audio encoders are synchronized by a common clock signal.
In order to maximize coding efficiency, many modern video coding algorithms encode pictures as one of 3 different picture types: intra-coded, predictive-coded and bi-directionally predictive-coded. An intra-coded picture (or I-picture) contains a complete description of the original picture. A predictive-coded picture (or P-picture) contains a description of the picture compared to a temporally earlier reference picture. This allows the encoder to use considerably fewer bits to describe a P-picture than would be required for an equivalent I-picture. A bi-directionally predictive-coded picture (or B-picture) contains a description of the picture compared to a temporally earlier reference picture and a temporally later reference picture. This allows the encoder to use approximately an order of magnitude fewer bits to describe a B-picture than an equivalent I-picture. However, in order to use information from a temporally later picture to encode a B-picture, the temporally later picture must be encoded before the B-type picture.
Referring to FIG. 2 as an example, pictures 38 (i =1, 2, . . . 19) of a partial uncompressed video sequence are shown in display order 40 and the corresponding coded pictures 41 are shown in encode order 44. For each picture 38 the CVE determines the appropriate type for the corresponding coded picture 41 and the coded picture's place in encoding order. In the example, the CVE encodes picture 1 as an I-picture I1, then picture 4 as a P-picture P4 using picture 1 as a reference. Next, the CVE encodes pictures 2 and 3 as B-pictures B2 and B3 using picture 1 as the temporally earlier reference and picture 4 as the temporally later reference. Then picture 7 is encoded as a P-picture P7, using picture 4 as a reference, pictures 5 and 6 are encoded as B-pictures B5, B6 using pictures 4 and 7 as the references, and so on. P-pictures and B-Pictures are said to be dependent on the picture or pictures used as reference(s).
Referring again to FIG. 1, the bits of each coded picture leave the CVE as a video elementary stream 46 at either a constant bit rate R or a variable bit rate R(t). The video elementary stream and any corresponding audio elementary streams (not shown) are input to a system encoder 48. The system encoder 48 packetizes the elementary streams into packetized elementary stream (PES) packets, each PES packet containing one or more access units of a given type. Each PES packet includes a packet header and packet data from one of the elementary streams. The PES packets are then multiplexed together and placed in transport stream (TS) packets for transmission across the network 5. For each picture i, the CVE determines the picture's playout time relative to the other pictures and relative to a system time clock (STC). The playout time is inserted into the coded picture's PES packet header in the form of a presentation time stamp (PTS). The encoder's STC is periodically sampled to generate a program clock reference (PCR) which is embedded in the transport stream containing the associated PES. A downstream decoder 16 will use a phase-locked loop to generate its own STC based on the received PCRs and thereby synchronize to the encoder's STC. The decoder then compares the PTS of each coded picture in the received transport stream to the recovered STC to determine the correct time to display the coded pictures so the audio and video playout may be synchronized.
The video and audio data packetized by the system encoder 48 represent a single program 50. After leaving the system encoder 48, the TS packets are combined with other TS packets, representing other programs, in a statistical multiplexer 67 to form a multi-program transport stream (MPTS). The MPTS is input to an up-link station 68 and used to modulate a carrier. The up-link station 68 transmits the modulated carrier 72 to a distributor head-end 76, via a satellite 77. At the head-end 76 the modulated carrier 72 is demodulated and demultiplexed, and the program 50 is re-encapsulated in a single program transport stream (SPTS) 78. The SPTS 78 is transmitted from the head-end 76 across a network 80 to customer premises over a transmission medium, such as optical fiber, copper wire, or coaxial cable. At the customer premise 14, the SPTS 78 is input to the decoder 16. The decoder 16 is often provided by the distributor (e.g. as part of a ‘set-top’ box (STB)). The decoder uses the SPTS 78 to generate the recreated video sequence 18.
Since dependent coded pictures depend on the unencoded reference pictures, the decoder 16 must decode the reference pictures before the dependent picture can be decoded. Therefore, although the coded pictures are transmitted, and subsequently decoded, in the encoding order 44 (FIG. 2), the downstream decoder 16 may not simply display the decoded pictures in the order they are received. For coded pictures transmitted earlier in the sequence than they are to be displayed (e.g. P4 in the example shown in FIG. 2), a decode time stamp (DTS) td,i, relative to the STC, is inserted into the coded picture's packet header in addition to the PTS. Coded picture P4 will be decoded at time td,4, relative to the recovered STC, and the recreated picture 4 stored in a decoded picture buffer (DPB) 60 (FIG. 1) until the picture's PTS, while using the recreated picture as a reference where necessary. For those pictures for which no reordering is necessary, assuming instantaneous decoding they can be presented (or displayed) immediately (e.g. B2 and B3 in FIG. 2), their DTS and PTS would be identical and therefore only the PTS is transmitted and the PTS is used to determine the decode time.
As the bits of the coded pictures stream into the decoder 16, the decoder will place the bits in the coded picture buffer (CPB) 54 until the recovered STC reaches the pictures' decode time, at which point the bits of the coded picture are instantaneously removed from the CPB 54 and decoded. The behavior of the CPB is defined by H.264 for AVC. For MPEG 2, there is an equivalent virtual buffer defined by H.262. The CVE 8 assumes the decoder's CPB 54 is of size B bits. The CVE 8 tracks the fullness of the assumed decoder CPB by maintaining its own “virtual buffer.”
FIG. 3 shows the relationship between the fullness of the encoder's virtual buffer and the decoder's CPB for the example shown in FIG. 2. It is well understood in the art that the fullness of the encoder's virtual buffer at time t with respect to the STC will mirror the fullness of the decoder's CPB 54 at time t with respect to the recreated STC. For example, at time t1, relative to the encoder's STC, the encoder's virtual buffer contains (B/2)+C bits, whereas at time t1, relative to the decoder's recreated STC, the decoder's CPB contains (B/2)−C bits. It is the encoder's responsibility to control the video elementary stream in order to prevent underflow of the decoder's CPB by preventing its own virtual buffer from overflowing. Underflow of the encoder virtual buffer is acceptable because generally it only results in a brief pause in data transmission.
To prevent the CPB from underflowing (or overflowing) the CVE uses a conventional rate control algorithm that controls the allocation of bits to each coded picture. In addition to controlling the buffer fullness, the rate control algorithm also works to maintain a given target bit rate R (or, for a variable bit rate system, a peak bit rate Rp and some average bit rate less than Rp) for the program while optimizing the overall picture quality. The rate control algorithm can also interact with a statistical multiplexer to find an optimal balance between the quality of the video elementary stream and the bit rate requirements of the MPTS.
Referring to FIG. 4, the uncompressed video sequence 6 includes feature content 22, such as episodes of recurring television programs F1, F2, F3 and F4 that are to be transmitted sequentially, interspersed with advertising content blocks 28 (i.e. one or more commercials, public service announcements, station identification messages, etc.). At the production facility (7, FIG. 1), the programming provider uses conventional video editing techniques to insert the advertising content blocks 28 into the feature content 22 at predetermined intervals, as shown at 6.
The advertising content blocks 28 that are inserted into the uncompressed video sequence 6 at the production facility typically take the form of a series of video sequences having relatively short duration (e.g. 8 distinct video sequences each having a duration of 30 seconds or 1 minute). As part of a commercial arrangement between the programming provider and the service providers, some advertising content blocks may contain some low priority advertising content 92, such as advertisements provided by the television network itself (or the block may not be full, e.g. an advertising content block may contain 4 minutes of video sequences and 1 minute of “black” 100). This allows the service providers to overwrite the low priority advertising content 92 (or the “black” data 100) in the programming signal with their own targeted advertising content. This ‘ad-insertion’ capability is advantageous for the service providers because they can provide targeted advertising content specifically aimed at their customer base.
Referring again to FIG. 1, the traditional approach to ad-insertion in the compressed video domain is to use a conventional transport stream splicer 116 to effect an ideally seamless splice between the content of the “primary” compressed video transport stream (i.e. the SPTS 78) and the content of a “secondary” compressed video transport stream 120 containing targeted advertising content. A “seamless” splice is invisible to a person viewing the programming signal—that is, no visual artifacts are created by the splice and the viewer is unaware he or she is not viewing content from the uncompressed video sequence 6. The secondary video transport stream 120 is usually streamed out from a video on demand (VOD) server 124. U.S. Pat. Nos. 6,678,332 and 6,792,047 describe examples of the splicing technology applicable to the conventional approach. Industry standards, such as ISO/IEC 13818-1 and ANSI/SCTE 35, may be used to define how potential splice points are identified in the SPTS 78 by the CVE 8, for instance by adding “digital cue tones” to the primary transport stream 78 temporally ahead of the splice points. Regardless of the specific means by which the potential splice points are signaled, the conventional transport stream splicer 116 detects the signals and the potential splice points and, when appropriate, splices the secondary transport stream 120 into the primary transport stream 78.
At the minimum, a conventional transport stream splicer 116, capable of effecting a seamless splice in the compressed video domain, needs to partially decode the SPTS 78, for instance to calculate buffer fullness. Because the ad-insertion needs to takes place ‘on the fly’ as the SPTS 10 is en route to the customer premise 14, conventional transport stream splicers are complex and computationally expensive. This precludes cost-effective implementation of conventional splicing applications as close to the customer premises as would be desirable for the service providers.
Referring again to FIG. 1, the farther downstream in the service provider network 80 the ad-insertion occurs, the more specifically the service provider can target a particular customer. For instance, if the ad-insertion occurs at the service provider's head-end 76 (as shown in FIG. 1), then all of the service provider's customers may receive and view the same targeted advertising content. If the service provider's network has multiple zones 108a, 108b the service provider may splice in different advertising in each zone, targeting the demographic characteristics of the zones, for the zones respectively at intermediate points 112 of the network 80. It is well understood in the art that, due to the nature of a compressed video transport stream, ad-insertion in the compressed video domain is not as straight forward as the process of inserting the advertising content blocks into the uncompressed video sequence.
Thus what is needed is a technique for allowing seamless splicing in the compressed video domain, anywhere in the chain between the encoder and the decoder without requiring a complex and computationally expensive splicer application. Specifically, ad-insertion would be most beneficial within the customer premise 14 therefore allowing individually targeted advertising content.