1. Field of the Invention
This invention relates to the field of compressed motion video and, more specifically, to pre-compressed, stored video for video-on-demand applications.
2. Description of the Related Art
Digital video signals are typically compressed for transmission from a source to a destination. One common type of compression is xe2x80x9cinterframexe2x80x9d coding, such as is described in the International Telecommunications Union-Telecommunications (ITU-T) Recommendations H.261 and H.262, or the Recommendation H.263. Interframe coding exploits the spatial similarities of successive video frames by using previous coded and reconstructed video frames to predict the current video signal. By employing a differential pulse code modulation (DPCM) loop, only the difference between the prediction signal and the actual video signal amplitude (i.e. the xe2x80x9cprediction errorxe2x80x9d) is coded and transmitted.
In interframe coding, the same prediction is formed at the transmitter and the receiver, and is updated frame-by-frame at both locations using the prediction error. If a transmission error causes a discrepancy to arise between the prediction signal at the transmitter and the prediction signal at the receiver, the error propagates temporally over several frames. Only when the affected region of the image is updated by an intraframe coded portion of the transmission (i.e. a frame coded without reference to a previous frame), will the error propagation be terminated. In practice, this error propagation may result in an annoying artifact which may be visible for several seconds in the decoded, reconstructed signal.
Shown in FIG. 1 is a schematic representation of a conventional hybrid interframe coder 10. Only the fundamental elements of the coder are shown in FIG. 1. However, this type of hybrid coder is known in the art, and the omitted elements are not germane to understanding its operation.
The coder of FIG. 1 receives an input video signal at summing node 12. The output of summing node 12 is a subtraction from a current frame of the input signal, of a motion-compensated version of a previous frame of the input signal (discussed in more detail hereinafter). The output of summing node 12 is received by discrete cosine transform block 14 (hereinafter DCT 14). The DCT 14 is a hardware, software, or hybrid hardware/software component that performs a discrete cosine transform on the data received from the summing node 12, in a manner well-known in the art. The result is the transform of the incoming video signal (one block of elements at a time) to a set of coefficients which are then input to quantizer 16. The quantizer 16 assigns one of a plurality of discrete values to each of the received coefficients, resulting in an amount of compression provided by the quantizer which depends on the number of quantization levels used by the quantizer (i.e. the xe2x80x9ccoarsenessxe2x80x9d of the quantization). Since the quantizer maps each coefficient to one of a finite number of quantization levels, there is an error introduced by the quantizer, the magnitude of which increases with a decreasing number of quantization levels.
In order to perform the desired interframe coding, the output of quantizer 16 is received by an inverse quantizer 17 and an inverse discrete cosine transform element (hereinafter xe2x80x9cinverse DCTxe2x80x9d) 18. Inverse quantizer 17 maps the quantizer index into a quantizer representative level. The inverse DCT 18 is a hardware, software, or hybrid hardware/software component that performs an inverse discrete cosine transform on the data received from inverse quantizer 17, in a manner well-known in the art. This inverse transform decodes the coded data to create a reconstruction of the prediction error. The error introduced into the signal by quantizer 16 reduces the quality of the image which is later decoded, the reduced quality being a side effect of the data compression achieved through quantization.
The decoded version of the video signal is output by summing node 19, and is used by the coder 10 to determine variations in the video signal from frame to frame for generating the interframe coded signal. However, in the coder of FIG. 1, the decoded signal from summing node 19 is first processed using some form of motion compensation means (hereinafter xe2x80x9cmotion compensatorxe2x80x9d) 20, which works together with motion estimator 21. Motion estimator 21 makes motion estimations based on the original input video signal, and passes the estimated motion vectors to both motion compensator 20 and entropy coder 23. These vectors are used by motion compensator 20 to build a prediction of the image by representing changes in groups of pixels using the obtained motion vectors. The motion compensator 20 may also include various filtering functions known in the art.
At summing node 12, a frame-by-frame difference is calculated, such that the output of summing node 12 is only pixel changes from one frame to the next. Thus, the data which is compressed by DCT 14 and quantizer 16 is only the interframe prediction error representing changes in the image from frame to frame. This compressed signal may then be transmitted over a network or other transmission media, or stored in its compressed form for later recall and decompression. Prior to transmission or storage, the interframe coded signal is also typically coded using entropy coder 22. The entropy coder provides still further compression of the video data by mapping the symbols output by the quantizer to variable length codes based on the probability of their occurrence. After entropy coding, the signal output from entropy coder 22 is transmitted along with the compressed motion vectors output from entropy coder 23.
In practice, if a compressed video signal such as the one output from the coder of FIG. 1 is transmitted over unreliable channels (e.g. the internet, local area networks without quality of service (QoS) guarantees, or mobile radio channels), it is particularly vulnerable to transmission errors. Certain transmission errors have the characteristic of lowering the possible maximum throughput (i.e. lowering the channel capacity or xe2x80x9cbandwidthxe2x80x9d) of the transmission medium for a relatively long period of time. Such situations might arise due to a high traffic volume on a store-and-forward network such as the internet, or due to an increasing distance between a transmitter and receiver of a mobile radio channel.
In order to maintain a real-time transmission of the video information in the presence of a reduced bandwidth, the transmitter must reduce the bit rate of the compressed video. Networks without QoS guarantees often provide messaging channels that allow the receiver or the network to request a lower transmission bit rate from the transmitter. For example, real-time protocol (RTP), designed by the Internet Engineering Task Force and now part of the ITU-T Draft International Standard H.225.0 xe2x80x9cMedia Stream Packetization and Synchronization on Non-Guaranteed Quality of Service LANSxe2x80x9d, can be used to xe2x80x9cthrottlexe2x80x9d the transmitter bit rate. For a point-to-point transmission with real-time coding, the video source coder can usually accommodate the request for a reduced bit rate by using a coarser quantization by reducing the spatial resolution of the frames of the video or by periodically dropping video frames altogether. However, if the video has been coded and stored previously, the bit rate is chosen in advance, making such a request difficult to satisfy.
To accommodate the desire for a variable bit rate in the transmission of stored video, a xe2x80x9cscalablexe2x80x9d video representation is used. The term xe2x80x9cscalablexe2x80x9d is used herein to refer to the ability of a particular bitstream to be decoded at different bit rates. With scalable video, a suitable part of the bitstream can be extracted and decoded to yield a reconstructed video sequence with a quality lower than what could be obtained by decoding a larger portion of the bitstream. Thus, scalable video supports xe2x80x9cgraceful degradationxe2x80x9d of the picture quality with decreasing bit rate.
In a video-on-demand server, the same original motion video sequence can be coded and stored at a variety of bit rates. When a request for the sequence is made to the server, the appropriate bit rate would be selected, taking into account the current capacity of the network. A problem arises, however, if it becomes necessary to change the bit rate during the transmission. The server may switch from a first bitstream having a first bit rate to a second bitstream having a second bit rate due to a different coarseness of quantization or different spatial resolution. However, if the sequences are interframe coded, the switchover produces annoying artifacts due to the difference in the image quality of the two bitstreams. These can be avoided by the regular use of intraframe coded frames (generally referred to as xe2x80x9cI-framesxe2x80x9d), in which the entire image is coded, rather than just the differences from the previous frame. The Moving Picture Experts Group (MPEG) standard (i.e. ITU-T H.262) calls for the regular inclusion of I-frames, typically every few hundred milliseconds. However, the use of I-frames, requiring a significant amount of data, dramatically increases the overall bit rate. For example, an I-frame might require six times as much data as an interframe coded frame. In such a case, coding every fifth frame as an I-frame would double the bit rate.
U.S. Pat. No. 5,253,058, to Gharavi, discloses a scalable video architecture which uses a base layer and an enhancement layer (called a contribution layer) which must be encoded by a separate encoder. The method does not support different frame rates for the video at different quality levels but, rather, for different spatial resolutions. More importantly, in this method, the enhancement layer cannot be transmitted and decoded independently; it always requires the transmission and decompression of the base layer first. This makes bandwidth-adaptive serving a complicated task, leads to inefficient compression, and ultimately affects the performance of the whole system.
It is therefore an object of this invention to allow the coding of video sequences for storage and retrieval over networks without QoS guarantees, such that the bit rate provided by the server can be changed during the transmission of the sequence without resorting to the use of I-frames, but while minimizing artifacts produced by the different degrees of quantization used in coding different bitstreams at different bit rates.
The present invention avoids the aforementioned artifacts by providing a set of transition data that can be interframe decoded between decoding of a first bitstream (at a first bit rate) and a second bitstream (at a second bit rate). The transition data compensates for visual discrepancies between a decoded version of the first bitstream and a decoded version of the second bitstream. Thus, after a first bitstream has been decoded, the transition data is decoded, and then the second bitstream. The second bitstream provides a continuation of the video sequence that was begun with the first bitstream, and the transition data compensates for visual artifacts that would otherwise be present due to the difference in the bit rates of the first and second bitstreams.
In one embodiment of the invention, the transition data is created by periodically imputing the characteristics of a first (typically lower bit rate) bitstream to a second (typically next higher bit rate) bitstream. During interframe coding of the first bitstream, coded data is decoded and employed by the first bitstream coder for use in comparing to data in a subsequent frame, thus allowing the differences between the frames to be determined. The decoded (i.e., reconstructed) video signal has image characteristics due to the relatively coarse quantization used during coding of the first bitstream, or due to a different spatial resolution. This embodiment therefore uses the reconstructed signal as a source from which to periodically code a frame of the second bitstream. That is, while the second bitstream is normally coded directly from the analog video signal, frames of the signal are periodically coded using the signal reconstructed from the first bitstream. In effect, a lower bit rate frame is xe2x80x9cinsertedxe2x80x9d into the higher bit rate data stream. These frames are therefore referred to herein as xe2x80x9clower bit rate insert framesxe2x80x9d (LBIFs).
The LBIFs inserted into the second bitstream provide points of correspondence between the image data of the two bitstreams in that the effects of the coarser quantization (or different spatial resolution) of the first bitstream are periodically introduced to the second bitstream. These LBIFs therefore provide points in the temporal progression of the video sequence at which a change from one bitstream to the other may be made, without the introduction of any significant visual artifacts into the decoded video. Thus, when switching from the first bitstream to the second bitstream, it is most desirable to have the first frame received from the second bitstream be a frame that follows an LBIF. Similarly, when switching from the second bitstream to the first bitstream, it is desirable to have the last frame received from the second bitstream be an LBIF. In this way, the two frames will be as closely related as possible.
This embodiment of the invention preferably makes use of LBIFs in a video-on-demand server. Multiple bitstreams are stored to be decoded using different relative bit rates. For all but the bitstream having the lowest bit rate, LBIFs are periodically inserted into the bitstreams from the bitstream having the next lower bit rate. Thus, the server has the same video sequence at different bit rates, with LBIFs to enable switching between the bitstreams. As the server is streaming the video data at one bit rate, a request for a different bit rate (higher or lower) is satisfied by switching to another stored bitstream at the temporal point in the video sequence corresponding to an LBIF in the bitstream having the higher bit rate. Effectively seamless bit rate xe2x80x9cthrottlingxe2x80x9d is therefore accomplished with a minimization of artifacts.
In an alternative embodiment, the multiple bitstreams are transmitted simultaneously over a transmission medium, such as the airwaves. The bitstreams are multiplexed together, and demultiplexed at the site of a decoder. With all of the bitstreams being available at the decoder location, the switching from one bitstream to the next is accomplished in the manner described above, only by switching between the received, demultiplexed bitstreams. Preferably, each frame of each bitstream is accompanied by coded data regarding the nature of the frame (i.e. whether it is a frame after which one may switch to a next higher bit rate bitstream, a next lower bit rate bitstream, or not at all).
In another alternative embodiment, the input video signal is periodically coded in intraframe mode, such that frames of data are generated which correspond to interframe coded frames of the lowest rate bitstream, but which include all of the data necessary to independently recreate that frame of the video sequence. This embodiment does not have the high level of data compression of the preferred embodiment, but allows for random access. LBIFs are used in the higher rate bitstreams as points at which one may switch between the bitstreams with a minimum of quantization-based artifacts. However, the intraframe coded frames allow a user to begin the video sequence at any of the temporal points corresponding to the location of the intraframe coded frames. If a higher bit rate is thereafter desired, the rate may be increased at the appropriate LBIF locations, as described above. This embodiment is also useful in that it allows for fast forward and fast rewind of the video sequence by displaying the intraframe coded frames only, thus allowing a user to search quickly through the video sequence.
In yet another embodiment of the invention, LBIFs are not inserted into the existing bitstreams. Instead, at least one (and typically a plurality of) xe2x80x9cswitchxe2x80x9d frames are created. That is, transition data is stored on the server separate from the bitstreams containing the video data, and is used to provide an interframe decoding system with data that compensates for the difference in reconstructed frames of the two bitstreams. This compensation is typically for a given frame of video data at any point in time, each switch frame (or xe2x80x9cS-framexe2x80x9d) therefore providing a point of continuity between the bitstreams only for that frame. The S-frame is preferably the difference between the two bitstreams for similar frames. Since a given frame represents a xe2x80x9ctime indexxe2x80x9d (a specific temporal point in the video sequence), any difference between frames that are reconstructed for a given time index from the first and second bitstream comes from the different bit rates (e.g., a difference in quantization levels or spatial resolution). Thus, taking the difference between reconstructed frames of the same time index (or consecutive time indexes) for the two bitstreams provides the information necessary to compensate the decoder for bitstream transition related artifacts.
In one version of the S-frame embodiment, the S-frames do not have a common time index with a frame from each of the higher and lower bitstreams, and the coding of the difference between reconstructed frames in enhanced by motion compensation. Thus, the direction of transition (e.g., from the higher bit rate bitstream to the lower bit rate bitstream) determines which difference must be taken. That is, since the lower bit rate and upper bit rate frames used to construct the S-frame are from consecutive (not simultaneous) time indexes, it is necessary to subtract the motion compensated frame having the earlier time index from the frame having the later time index to generate the right S-frame. Therefore, if the S-frame is intended to create a point at which the decoding may change from the lower bit rate bitstream to the higher bit rate bitstream, the S-frame is generated by subtracting a motion compensated lower bit rate frame (having an earlier time index) from a higher bit rate frame (having a later time index).
If the S-frame is generated using frames from the lower bit rate bitstream and the higher bit rate bitstream that have the identical time index, a two-directional point of continuity is created between the bitstreams by the S-frame. In that case, motion compensation is omitted, and a single S-frame can be used to transition from either the lower bit rate bitstream to the higher bit rate bitstream, or vice versa. In such an embodiment, the transmitted S-frame has the same time index as the last frame of the first bitstream, and is transmitted before a first frame of the second bitstream, which typically has a subsequent time index. If the S-frame was created by subtracting a frame of the first bitstream from a frame of the second bitstream, it may be processed directly by the decoder. However, for an S-frame used to switch from the second bitstream to the first bitstream, the frame is first inverted by the decoder before being added. This ensures that the correct compensation is being provided by the S-frame.