Recent advances in cable and satellite distribution of subscription and “on-demand” audio, video and other content to subscribers have given rise to a growing number of digital set-top boxes (STBs, sometimes referred to as Digital Consumer Terminals or “DCTs”) for decoding and delivering digitally broadcast programming. These set-top boxes often include additional circuitry to make them compatible with older analog encoding schemes for audio/video distribution. As the market for digital multimedia content of this type grows and matures, there is a corresponding growth of demand for new, more advanced features.
Video-on-demand (VOD) and audio-on-demand are examples of features made practical by broadband digital broadcasting via cable and satellite. Unlike earlier services where subscribers were granted access only to scheduled encrypted broadcasts (e.g., movie channels, special events programming, etc.), these on-demand services permit a subscriber to request a desired video, audio or other program at any time. Upon receiving the request for programming (and, presumably, authorization to bill the subscriber's account), the service provider then transmits the requested program to the subscriber's set-top box for viewing/listening. The program material is typically “streamed” to the subscriber in MPEG format for immediate viewing/listening, but can also be stored or buffered in the set-top box (typically on a hard-disk drive or “HDD”) for subsequent viewing/listening.
Digital video broadcasts are typically transmitted (via cable or satellite) using a digital video compression scheme for encoding. Video compression is a technique for encoding a video “stream” or “bitstream” into a different encoded form (preferably a more compact form) than its original representation. A video “stream” is an electronic representation of a moving picture image.
The Motion Picture Association of America (MPAA) is a trade association of the American film industry, whose members include the industry's largest content providers (i.e., movie producers, studios). The MPAA requires protection of video-on-demand (VOD) content from piracy. Without security to protect content against unauthorized access, MPAA member content providers will not release their content (e.g., movies) for VOD distribution. Without up-to-date, high-quality content, the VOD market would become non-viable.
Encryption methods are continually evolving to keep pace with the challenges of video-on-demand (VOD) and other consumer-driven interactive services. With VOD, headend-based sessions are necessarily becoming more personalized. In this scenario, video streams are individually encrypted and have their own set of unique keys.
One of the best known and most widely used video compression standards for encoding moving picture images (video) and associated audio is the MPEG-2 standard, provided by the Moving Picture Experts Group (MPEG), a working group of the ISO/IEC (International Organization for Standardization/International Engineering Consortium) in charge of the development of international standards for compression, decompression, processing, and coded representation of moving pictures, audio and their combination. The ISO has offices at 1 rue de Varembé, Case postale 56, CH-1211 Geneva 20, Switzerland. The IEC has offices at 549 West Randolph Street, Suite 600, Chicago, Ill. 60661-2208 USA.
The international standard ISO/IEC 13818-2 “Generic Coding of Moving Pictures and Associated Audio Information: Video”, and ATSC document A/54 “Guide to the Use of the ATSC Digital Television Standard” describes the MPEG-2 encoding scheme for encoding and decoding digital video (and audio) data. The MPEG-2 standard allows for the encoding of video over a wide range of resolutions, including higher resolutions commonly known as HDTV (high definition TV).
In MPEG-2, encoded pictures are made up of pixels. Each 8×8 array of pixels is known as a block. A 2×2 array of blocks is referred to as a macroblock. MPEG-2 video compression is achieved using a variety of well known techniques, including prediction (motion estimation in the encoder, motion compensation in the decoder), 2-dimensional discrete cosine transformation (DCT) of 8×8 blocks of pixels, quantization of DCT coefficients, and Huffman and run-length coding. Reference frame images, called “I-frames” are encoded without prediction. Predictively-coded frames known as “P-frames” are encoded as a set of predictive parameters relative to previous I-frames. Bidirectionally predictive coded frames known as “B-frames” are encoded as predictive parameters relative to both previous and subsequent I-frames. In MPEG-2 encoded video streams, all video data is packaged into fixed-size 188-byte packets for transport.
The MPEG-2 standard specifies formatting for the various component parts of a multimedia program. Such a program might include, for example, MPEG-2 compressed video, compressed audio, control data and/or user data. The standard also defines how these component parts are combined into a single synchronous bit stream. The process of combining the components into a single stream is known as multiplexing. The multiplexed stream may be transmitted over any of a variety of links such as, for example, Radio Frequency Links (UHF/VHF), Digital Broadcast Satellite Links, Cable TV Networks, Standard Terrestrial Communication Links, Microwave Line of Sight (LoS) Links (wireless), Digital Subscriber Links (ADSL family), Packet/Cell Links (ATM, IP, IPv6, Ethernet).
A fundamental component of any MPEG bit stream is an elementary stream (ES). A “program” comprises a plurality of ESs. Each ES is provided as an input to an MPEG-2 processor (e.g. a video compressor) which formats the ES into a series of Packetized Elementary Stream (PES) packets. A PES packet may be a fixed (or variable) sized block, with up to 65536 bytes per block and a six byte protocol header (first field of the PES Header). Typically, a PES contains an integer number of ESs.
The PES header starts with a three-byte start code, followed by a one-byte stream ID and a two-byte length field (the protocol header). The MPEG-2 standard defines a number of stream IDs. Following the protocol header are PES Indicators that provide formatting/encoding information about the stream, to assist in decoding. These PES Indicators include information about whether encryption is used, the encryption method, the priority of the current PES packet, an indicator of whether the payload starts with an audio or with a video start code, copyright information, and an indicator of whether the PES is an original or a copy. A one-byte flag field completes the PES header. The information in the PES header is, generally speaking, independent of the transmission method being used.
The MPEG-2 standard defines two forms of multiplexing (combining of ESs into a single stream):                MPEG Program Stream A group of tightly coupled PES packets referenced to a common time base. Such streams are suited for transmission in a relatively error-free environment and enable easy software processing of the received data. This form of multiplexing is used for video playback and for some network applications.        MPEG Transport Stream Each PES packet is broken into fixed-sized transport packets, providing the basis of a general-purpose technique for combining one or more streams, possibly with independent time bases. This is suited for transmission in which there may be potential packet loss or corruption by noise, and/or where there is a need to send more than one program at a time.        
The Program Stream is widely used in digital video storage devices, and also where the video is reliably transmitted over a network (e.g. video-clip download). Digital Video Broadcast (DVB) uses the MPEG-2 Transport Stream over a wide variety of underlying networks. Since both the Program Stream and Transport Stream multiplex a set of PES inputs, interoperability between the two formats may be achieved at the PES level. The discussion herein is directed mainly to processing the MPEG Transport Stream (TS).
A transport stream consists of a sequence of fixed sized transport packets of 188 bytes. Each packet comprises 184 bytes of payload and a four-byte header. One of the items in this four-byte header is the 13 bit Packet Identifier (PID) which plays a key role in the operation of the Transport Stream.
Typically, two elementary streams are sent in the same MPEG-2 transport stream (e.g., two elementary streams containing video and audio packets, respectively). Each packet is tagged with a PID value that identifies it as being associated with a specific PES. Typically, audio packets are tagged with a unique PID and video packets are tagged with a different PID. The actual PID values are arbitrary, but they necessarily have different values. Usually there are many more video packets than audio packets, so the two types of packets are usually not evenly spaced in time.
Accordingly, an MPEG transport stream (TS) is not time-division multiplexed, and packets with any PID may appear in the TS at any time. If no source packets are available, null packets (denoted by a PID value of 0x1FFF) are inserted into the TS to maintain a constant TS bit rate. PESs in a TS are not synchronized with one another; indeed the encoding and decoding delay for each PES may be different (and usually is different).
Single and Multiple Program Transport Streams
A TS may correspond to a single TV program, or multimedia stream (e.g. with a video PES and an audio PES). This type of TS is normally called a Single Program Transport Stream (SPTS).
An SPTS contains all of the information required to reproduce the encoded TV channel or multimedia stream. It may contain only audio and video PESs, but there are usually other types of PESs as well. Each PES in a TS shares a common time base. Although some equipment outputs and uses SPTS, this is not the normal form of stream transmitted over a DVB link.
In most cases one or more SPTS streams are combined to form a Multiple Program Transport Stream (MPTS). This larger aggregate also contains all the control information (Program Specific Information (PSI)) required to coordinate a DVB system, along with any other data which is to be sent.
Most transport streams consist of a number of related elementary streams (e.g. the video and audio portions of a TV program). Decoding of the elementary streams typically needs to be co-coordinated (synchronized) to ensure that the audio playback is in synchronism with the corresponding video frames. The elementary streams may be tightly synchronized (usually necessary for digital TV programs, or for digital radio programs), or unsynchronized (in the case of programs offering downloading of software or games, as an example). To aid in synchronization, time stamps may optionally be sent in the transport stream.
There are two types of time stamps:                The first type is usually called a reference time stamp. This time stamp is the indication of the current time. Reference time stamps are to be found in the PES syntax (ESCR), in the program syntax (SCR), and in the transport packet adaptation Program Clock Reference (PCR) field.        The second type of time stamp is called Decoding Time Stamp (DTS) or Presentation Time Stamp (PTS). These time stamps are inserted close to the material to which they refer (normally in the PES packet header). They indicate the exact moment where a video frame or an audio frame has to be decoded or presented to the user respectively. These rely on reference time stamps for operation.        
To decode a particular transport stream, the PID values associated with relevant elementary streams (e.g., audio and video elementary streams) must be determined. The transport stream is then “filtered” for transport packets having those PID values. The “filtered” packets are then decoded. To aid in identifying which PID corresponds to which program, a special set of streams, known as Signaling Tables, are transmitted with a description of each program carried within the MPEG-2 Transport Stream. Signaling tables are transmitted via an independent PES, and are not synchronized with, e.g., audio and video elementary streams associated with a program stream (i.e., they are provided via an independent control channel).
Video or audio payload data is organized into PES packets before being broken up into fixed length transport packet payloads. A PES packet may be much longer than a transport packet. When segmenting PES packets for placement in transport packet payloads, the PES header is always placed immediately following a transport header. Subsequent portions of the PES packet are then distributed into a series of transport packets. Any “slack” space in the final transport packet of the series is padded with bytes=0xFF (all ones).
Each transport packet starts with a sync byte=0x47. (In the ATSC US terrestrial DTV VSB transmission system, this byte is not processed, but is replaced by a different sync symbol especially suited to RF transmission.)
At the receiving end of a multiplexed, MPEG-2 transmission stream (TS), the transmission stream must be de-multiplexed in order that digital data can be extracted therefrom.
For example, a multi program transmission stream (MPTS) stream may comprise a video packet, followed by an audio packet, followed by another video packet, followed by a program association table (PAT), followed by a program map table (PMT), followed by other packets (such as program guides), followed by another video packet, etc.
The tables, called Program Specific Information (PSI) in MPEG-2, consist of a description of the elementary streams that need to be combined to build programs, and a description of the programs. The PAT lists the PIDs of tables describing each program. The PMT defines the set of PIDs associated with a program (e.g., audio, video, . . . ).
Each PSI table is carried in a sequence of PSI Sections, which may be of variable length (but are usually small, c.f. PES packets). Each section is protected by a CRC (checksum) to verify the integrity of the table being carried. The length of a section allows a decoder to identify the next section in a packet. A PSI section may also be used for downloading data to a remote site. Tables are sent periodically by inserting them into the transmitted transport stream.
The transport packet comprises a header, adaptation fields, and a payload. The transport packet header comprises a sync byte, flags, a continuity counter, and a 13-bit packet ID (PID). PID 0x0000 is reserved for transport packets carrying a program association table (PAT). The PAT identifies PIDs associated with Program Map Tables (PMTs), which in turn identify PIDs of ESs associated with particular elements (e.g., audio, video, etc.) of a program.
Accordingly, decoding a transport stream involves:                finding the PAT by selecting packets with PID=0x0000;        determining PIDs for the PMTs;        determining the PIDs for the elements of a desired program from its PMT (for example, a basic program will have a PID for audio and a PID for video); and        detecting packets with the desired PIDs and routing them to an appropriate decoding process (i.e., an audio decoder for audio PES data and a video decoder for video PES data).        
An outgrowth of digital set-top box (DCT) technology is set-top boxes (STBs) with embedded PVRs/DVRs (Personal Video Recorder/Digital Video Recorder), whereby video content can be recorded directly to a storage device (e.g., hard disk or local memory) for subsequent playback. As with conventional video recording applications (e.g., video cassette recorders—VCRs), it is often desirable to record one program “stream” while viewing another—an application that operates on two video streams simultaneously.
Another common application of modern set-top boxes, televisions, etc., is Picture-In-Picture (PIP), where an inset (thumbnail) display of a first video stream is overlaid on a full-screen display of a second video stream. Like simultaneous viewing and recording, PIP operates on two video streams simultaneously.
Historically, for analog television broadcasts, these dual-stream applications required “dual tuner” functionality—one tuner for receiving the program to be viewed, the other to receive the program to be recorded. Since most VCRs include an independent tuner for recording and a broadband pass-through capability, the “dual tuner” requirement is effectively satisfied. To provide the same capability, embedded DVR and PIP applications (when built into a single unit) must provide for the ability to decode at least two digital video streams simultaneously, either or both of which may be encrypted.
Generally, encryption of an MPEG-2 transport stream involves encryption of the data content of a transport stream, but not the structure thereof. That is, only the data payload portion of transport packets in a transport stream is encrypted, but the structure of the transport packets themselves (header, flags, framing, etc.) is sent in the clear (unencrypted). Encrypted and non-encrypted stream data can be mixed in a transport stream.
As described hereinabove, the encryption method (if any) used to encrypt a particular PES is identified in the PES header. Once it has been determined that a PES contains an encrypted payload (e.g., encrypted video or audio), then all transport packets with PIDs associated with that PES must be routed through a decryption mechanism prior to decoding. Typically, this decryption mechanism is a dedicated encryption engine, e.g., an integrated circuit (IC) chip or dedicated hardware specifically designed to perform the decryption function. One example of a chip with this type of decryption capability is Motorola's MC 1.7 (MediaCipher v1.7) Conditional Access Control chip.