In order to understand the present invention it is useful to review the state of the art of video tape transport devices and video tape formats. It is also beneficial to review certain aspects of video compression. Both are discussed below.
Video Tape Transport and Formatting
FIG. 1 shows a conventional video tape transport and scanner assembly 10, which scanner includes a rotary drum 14 with recording/playback heads A,B positioned thereon. (The invention is illustrated herein using a two head A,B drum 14 although the principles described herein are equally applicable to drums having a different number of heads such as four.) A video tape 12 is wrapped partially around the circumference of a drum 14, e.g., 180.degree. around the circumference of the drum 14. The tape 12 is transported around the circumference of the drum 14. As shown, the tape is transported at angle .theta. to a line perpendicular to the axis of rotation of the drum 14. While the tape is transported in the indicated direction, the recording/playback heads A,B rotate in the indicated direction. As each head A or B rotates in the proximity of the portion of the tape 12 wrapped around the cylinder, the head A or B scans over a portion of the tape 12. Each head A or B scans a portion of the tape 12 in a round robin fashion.
FIG. 2 illustrates the scanning of the heads A,B over the tape 12 in greater detail. As shown, each head A or B scans a diagonal segment of the tape 12 referred to as a track 18. During recording, a signal (e.g., an analog composite NTSC or PAL video signal) is recorded by the heads A and B onto the tracks 18 as each head A or B scans a track 18. (Illustratively, alternate tracks are recorded with opposite magnetic polarities, as shown by oppositely slanted diagonal lines, to reduce inter-track interference.) Likewise, during playback, the signal is reproduced from the tracks as the heads A and B scan each track 18. The angle .theta..sub.r of the track 18 with respect to the tape axis depends on the relative transport speed of the tape 12 and the rotational speed of the drum 14 and the angle of transport .theta..degree.. As may be appreciated, to reproduce a signal properly, the transport and scanner assembly 10 must cause each head A,B to substantially scan each diagonal track 18 in sequence and in relative alignment with the angle of the track 18, as shown by the arrow 25. To that end, an automatic tracking frequency (ATF) word is illustratively recorded on each track 18, which ATF word is reproduced by a head A or B during playback. The ATF word of each track produces a signal with a particular frequency. As shown in FIG. 1, this signal is fed to a feedback circuit 22 which controls the drum servo 24 and the capstan servo 26. The feedback circuit 22 compares the frequency of the ATF signal to a target frequency. Depending on this comparison, the relative speed of the drum 14 and the tape 12 transport is either increased or decreased to ensure that the heads A,B scan each track 18 successively.
In a conventional analog VTR, each field of video occupies an equal amount of space on the tape. In particular, each field is recorded on a single track; there is a one-to-one correspondence between tracks and fields. Thus, during playback, the scanning of one head A or B produces a video signal for presenting one field of the video on a display device.
The trick modes of concern herein are fast forward and fast reverse playback modes in which the video information is played back at a faster rate than the normal playback speed. In order to provide such fast or n.times.normal speed playback, only a fraction of the video information is presented on the display device. For instance, during 3.times.normal speed playback, only one third of the video information is presented on the display device. During fast forward or fast reverse, the relative speeds of the tape and the rotation of the heads is much faster than during normal speed playback. Thus, the heads do not scan in relative alignment to a single track but rather cross a number of tracks as illustrated by the arrow 30 in FIG. 2. Note that each track corresponds to a single field. Furthermore, there is a correspondence between the location within a given track 18 in which a particular portion of the video information is recorded and the location of the portion of the field that the particular video information portion reproduces. Stated another way, assume the scan of a head A or B crosses the first, middle and last thirds 45, 50 and 55 of three tracks 60, 65, 70, which tracks 60, 65, 70 correspond to first, second and third consecutively displayed fields. This scan produces a video signal with video information for the first third of the first field, the middle third of the second field and the last third of the third field. Thus, the video signal produced by an n.times.normal speed playback includes a relative piece of the video signal from n different fields, each at a respective portion of the image. Because there is a large correlation from field to field, this "piece-meal" video signal can be presented as an intelligible image; the viewer will perceive a single, reasonably correlated image of low fidelity, even though the user is viewing concatenated portions from n fields.
Video Compression
Advantageously, an audio-visual presentation or program bearing signal is digitized and compressed before the video signal is recorded on the video tape. For example, the video and audio portions of an audio-video program may be compressed according to the Motion Picture Experts Group (MPEG) II recommendations. See ISO/IEC DIS 13818-2: Information Technology--Generic Coding of Moving Pictures and Associated Audio Information. The contents of this document are incorporated herein by reference. Illustratively, such encoding and storage produces a hierarchically organized signal. Furthermore, yet another layer in the hierarchy may be provided, namely, a storage or channel layer and a tape format, for formatting such a compressed MPEG II signal for physical storage on the tracks of a tape. In summary, the different layers of the hierarchy are as follows:
MPEG II provides a specification for the elementary stream and transport stream layers and is believed to be a preferred way to compress and organize video and associated audio information. Therefore, this invention is illustrated using the above hierarchy and in particular, using MPEG II compliant elementary streams and transport streams. Each of these streams is discussed in greater detail below. Furthermore, because this invention is directed to trick mode playback, i.e., n.times.normal speed playback, only video reproduction is of concern. Therefore, audio and other non-video data is not discussed for purposes of brevity.
Video Elementary Streams
MPEG II provides for compressing video by reducing both spatial and temporal redundancy. A good tutorial for MPEG II video compression is contained in D. Le Gall, "MPEG: A Video Compression Standard for Multimedia Applications", April 1991, "Communications of the ACM". The contents of this document are incorporated herein by reference. A spatial encoder 80 is shown in FIG. 3 including an orthogonal transform circuit 82, a quantizer 84 and a variable length encoder circuit 86. Likewise, a spatial decoder 90 is shown including a variable length decoder 96, an inverse quantizer 94 and an inverse orthogonal transform circuit 92, which perform the inverse function of their counterparts 86, 84 and 82. To spatially encode a picture, the picture is divided into blocks of pixels, e.g., 8.times.8 blocks of pixels. Each block of pixels is orthogonally transformed (e.g., using a discrete cosine transform or DCT) to produce a number of transform coefficients. For example, as shown in FIG. 3, a matrix of transform coefficients are produced by the orthogonal transform circuit 82 for an 8.times.8 block of pixels. As shown, the horizontal spatial frequencies of the coefficients increase in the right hand direction and the vertical spatial frequencies of the coefficients increase in the downward direction of the matrix. From a psycho-visual perspective, the lower spatial frequency coefficients tend to be more important than the higher spatial frequency coefficients for purposes of decompressing the block to reproduce the original block. Furthermore, the higher frequency coefficients tend to be close to zero in magnitude. The coefficient for the lowest vertical and horizontal frequency is the most important coefficient, and is referred to as the DC coefficient (because it contains information regarding the average intensity of the block of pixels) The other coefficients are referred to as AC coefficients.
As shown by the arrows, the coefficients are read out of the orthogonal transform circuit 82 in a zig-zag fashion in relative increasing spatial frequency, from the DC coefficient to the highest vertical and horizontal spatial frequency AC coefficient AC.sub.77. This tends to produce a sequence of coefficients containing long runs of near zero magnitude coefficients. The coefficients are quantized in the quantizer 84 which, amongst other things, converts the near-zero coefficients to zero. This produces coefficients with non-zero amplitude levels and runs (or subsequences) of zero amplitude level coefficients. The coefficients are then (zero) run-level encoded and variable length encoded in the variable length encoder 86.
Blocks which are solely spatially encoded such as described above are referred to as intrablocks because they are encoded based only on information self-contained in the block. An intra-picture or I picture is a picture which contains only intrablocks. (Herein, "picture" means field or frame as per MPEG II nomenclature).
In addition to spatial coding, an encoder can also reduce temporal redundancy via temporal coding. In temporal coding, it is presumed that there is a high correlation between groups of pixels in one picture and groups of pixels in another picture of a sequence of pictures. Thus, a group of pixels can be thought of as moving from one relative position in one picture, called an anchor picture, to another relative position of another picture, with only small changes in the luminosity and chrominance of its pixels. In MPEG II, the group of pixels is a block of pixels, although such blocks need not be the same size as those on which spatial coding is performed. (For instance, temporal coding may be performed on "macroblocks" equal in size to four of the blocks which are used for spatial coding. Thus, if spatial coding is performed on 8.times.8 pixel blocks, temporal encoding is performed on 16.times.16 pixel macroblocks.) The temporal coding proceeds as follows. A block of pixels, in a picture to be encoded, is compared to different possible blocks of pixels, in a search window of a potential anchor frame, to determine the best matching block of pixels in the potential anchor frame. This is illustrated in FIG. 4. A motion vector MV is determined which indicates the relative shift of the best matching block in the anchor frame to the block of the picture to be encoded. Furthermore, a difference between the best matching block and the block in the picture to be encoded is formed. The difference is then spatially encoded.
Blocks which are temporally encoded are referred to as interblocks. Interblocks are not permitted in I pictures but are permitted in predictive pictures or P pictures or bidirectionally predictive pictures or B pictures. P pictures are pictures which each only have a single anchor picture, which single anchor picture is presented in time before the P picture encoded therewith. Each B picture has an anchor picture that is presented in time before the B picture and an anchor picture which is presented in time after the B picture. This dependence is illustrated in FIG. 5 by arrows. Note that pictures may be placed in the elementary stream in a different order than they are presented. For instance, it is advantageous to place both anchor pictures for the B pictures in the stream before the B pictures which depend thereon (so that they are available to decode the B pictures) even though half of those anchor pictures will be presented after the B pictures. While P and B pictures can have interblocks, some blocks of P and B pictures may be encoded as intrablocks if an adequate matching block cannot be found therefor.
Note, the amount of compressed information in the above encoding processes varies from picture to picture. I pictures tend to require significantly more bits than P and B pictures. Furthermore, it is possible for an encoder to arbitrarily encode inputted video pictures as I,P or B pictures. However, many implementations at least specify that I pictures should be produced every predetermined number of pictures. In particular, MPEG II defines a video stream syntax wherein a group of pictures (GOP) start code is provided followed by a predetermined number of I,P and B pictures. Such GOP's have an I picture as the very first picture.
Also note that only I pictures can be independently decompressed. In order to decode P and B pictures, the anchor frames, on which they depend, must also be decompressed.
Transport Stream
MPEG II provides two higher layer streams called the program stream and the transport stream. However, it is believed that most storage and transmission uses of MPEG II compressed video and audio will utilize the transport stream. Therefore, this invention is explained in the context of the transport stream. A good tutorial of MPEG II transport streams is contained in A. Wasilewski, MPEG-2 Systems Specification: Blueprint for Network Interoperability, COMM. TECH., February, 1994. The contents of this document are incorporated herein by reference.
According to the MPEG II standard, each digital elementary stream is first placed into program elementary stream (PES) packets of arbitrary length. The PES packet data, and other data, relating to one or more programs may be combined into one or more transport streams. The transport stream is organized into fixed length (more precisely, 188 byte length) packets. Each of the transport stream packets includes a four byte header and a 184 byte payload.
Each transport packet can carry PES packet data, e.g., video or audio data compressed and formed into streams according to MPEG II syntax, or program specific information (PSI) data. The PSI data, header portions of the PES packet data as well as other portions of a given transport packet may be used to provide information other than elementary stream data which is necessary to decode the PES packet data such as, snapshots of the encoder clock, time stamps for decoding and presenting units (e.g., video pictures) of PES packet data relative to the encoder clock, information regarding which video and audio streams (and other data, such as closed captioned text) contained in the transport stream are related to the same program and where such streams may be found within the transport stream, conditional access information for descrambling or decrypting encrypted PES packet data, etc. A single transport packet may only contain PES packet data for a single stream and PES packet data and PSI data must be placed in separate transport packets. The transport stream packets may also contain optional adaption fields for carrying, amongst other things, private data.
The transport stream contains only limited forward error correction (FEC) information. This is because transport streams are designed to be used ubiquitously in any kind of communication network/system or storage device such as satellite transponders, asynchronous transfer mode (ATM) networks, magnetic and optical disk drives, switched telephone networks, non-switched local area networks, etc. Each of these networks and devices has their own physical format and can introduce different kinds of error and noise. Therefore, FEC has been specifically omitted from the transport layer and is instead provided at the storage or channel layer. Thus, inter-operability is provided (so that a transport stream can be stored and reproduced from any storage device and then transported by any combination of networks and systems, each such system encapsulating and decapsulating the transport stream at that systems endpoints) without a great deal of overhead (i.e., without utilizing a large part of the bandwidth of the transport stream).
Storage (Channel) Layer and Tape Format
MPEG II does not provide a syntax or semantics for this layer. However, the Standard Definition VCR (SDVCR) specification, as developed by several well-known VCR manufacturers and research organizations, may be considered as a de-facto standard. See, HD-Digital VCR Conference, "Basic Specifications for Consumer-Use Digital VCR," August, 1993. Two data rate streams and formats, namely a 25 M bit/sec and a 50 M bit/sec stream and format, for recording on video cassette recorder (VCR) tape have been proposed.
Recently a draft for Advanced Television has been submitted for approval as the HDTV standard for the United States. The submitted draft complies with MPEG II and produces a video elementary stream with a nominal data rate of approximately 18.4 M bits/sec. Assume that a constant rate Dolby AC-3 compressed audio elementary stream of 384 K bits/sec is to be combined with the video elementary stream into an MPEG II transport stream. Combined with the transport stream overhead (assuming no adaption fields) the bit rate of the transport stream is 19.2 M bits/sec. Such a transport stream may be easily encapsulated in the SD VCR data stream for recording on video tape with about 5 M bits/sec extra left over space.
The problem with utilizing the SD VCR channel layer and format for encapsulating and formatting the ATV MPEG II transport stream is that there is no provision for supporting trick play modes on the VTR during playback. First, the information stored on the tape is highly variable from picture to picture. It is therefore difficult to concatenate the picture portions reproduced and decoded from each track portion as the heads obliquely scan a number of tracks. This is because there simply is no relation between the location of information on a track of the tape and the location in the picture to which the information corresponds. Furthermore, it is also not practical to simply display every n.sup.th encoded picture from the video tape during n.times.normal speed playback. This is because an MPEG II compliant stream contains P and B pictures which can only be decoded and presented using the appropriate anchor pictures from which they were encoded. It is difficult to locate such anchor frames without reproducing the recorded signal in sequence. Considering the oblique scanning constraints of the scanner assembly during high speed playback, this makes playback of only selected pictures very difficult.
It is therefore an object of the present invention to overcome the disadvantages of the prior art. Specifically, it is an object of the present invention to provide a physical storage format/storage layer stream for storing compressed video which facilitates trick mode playback. It is an object of the present invention to provide for proper tracking, in accordance with the tape storage format described herein, to enable recovery of necessary information for trick mode playback. It is another object of the present invention to generate replica information of the compressed video for separate storage according to the described format herein, which replica information is used for trick mode playback.