1. Field of the Invention
The present invention generally relates to decoding of digital video image information and, more particularly, to decoding of digital motion video signals with high quality and arbitrary scaling and positioning while achieving low latency and minimal memory capacity requirements.
2. Description of the Prior Art
Transmission of data in digital form is generally favored over transmission in analog form at the present time in view of the inherent degree of noise immunity and potential for facilitating error correction and data compression. These qualities are particularly desirable in regard to image data where small artifacts caused by noise or transmission errors are particularly evident to the eye. However, the volume of data that may be necessary to express a full range of color and intensity in an image at high resolution is particularly large and data compression is particularly important for acceptable performance of systems for communicating, manipulating and presenting still or moving images.
In fact, the amount of data in an image is so large that a high degree of compression is required for practical management of transmission and storage during decoding of digitized data, particularly for moving images or video programs. Accordingly, a number of data compression standards have been developed to deliver a sufficient degree of compression and allow image decoding within required response times. For example, a standard referred to as JPEG (Joint Photographic Experts Group) has been developed and widely adopted for compression of still images. This standard allows substantial flexibility in coding in order to allow an arbitrarily high degree of data compression with minimized degradation of image quality. Similarly a standard known as MPEG (Moving Pictures Experts Group) has been developed for coding sequences of images to be reproduced by a display device in sufficiently rapid succession to achieve the illusion of motion, referred to as motion video. The MPEG standard particularly exploits redundancy between frames or fields to achieve a higher degree of compression and higher decoding speed.
Due to differences in the implementation of these standards and the way image data is utilized between still and video images, decoding in accordance with the JPEG standard is usually implemented in software with a data processor and display such as a personal computer (PC) while decoding in accordance with the MPEG standard is generally implemented in hardware such as a so-called set-top box (STB) in connection with a television set or a computer monitor. A set-top box comprises an audio decoder and a video decoder for signals which are fed to them from a multiplexed transport signal carrying numerous (e.g. often several hundred or more) audio and video channels.
An audio and video channel pair is coupled together by control parameters in their respective data streams that permits re-synchronization of the two streams with respect to each other but with a substantially fixed transmission delay. In the video stream, at a basic level, the signal for re-synchronization of audio and video is referred to as the presentation time stamp (PTS).
By far the simplest and most widely implemented method to achieve a PTS synchronization point is to choose a fixed point in time relative to the video decode and display of a motion video image and to make a synchronization decision based upon whether the fixed point falls before or after an internal clock signal. The chosen fixed point typically is located between one video field and the next sequential video field and generally defines a frame of two fields which make up a single image of the motion video image sequence. Another typical point is the change between the decode of one compressed video picture and the decode of the next sequential compressed video picture. In most cases, these criteria result in the same point, depending on the decode latency for each picture.
At this PTS synchronization point, a decision is made to either continue with normal decode processing or to take corrective action such as omitting or skipping over an image if decoding is slow or repeating the output of an image if decoding is running too fast. To make this decision, the difference between the PTS and the internal clock signal is tested at the same point relative to the displayed picture and a confident decision can be made at every image to take corrective action as soon as excessive differences are detected.
The above synchronization technique has worked well for full-screen images decoded by early video decoders. However, some functions such as arbitrarily scaled images and picture-in-picture displays at arbitrary screen locations, which were developed and provided in television sets receiving analog transmissions which do not require significant decoding, are now being demanded in STBs. Arbitrary change of dimensions (e.g. image aspect ratio) has also been demanded (particularly for so-called letterbox format presentation of uncropped movie film frames which are desirably presented in a band across the center of a television screen and well above the bottom of the image area). That is, it is now required that the STB display a scaled image or picture-in-picture of any desired size, shape and location on the display screen in addition to the data processing intensive decoding of images from compressed data.
As is understood by those skilled in the art, frame structured compressed image data is decoded in an order which is different from the order in which the image elements are presented in a scanned display pattern such as an interlaced raster. Therefore, decoded data must be stored in the STB for a period which varies within a field or frame until the proper location is reached in the scan pattern. The maximum storage period or latency of data from decoder to display is referred to as decode latency or decoder latency and is important in determining the capacity and configuration of memory required to perform the required storage and read out of data in synchronism with the display while decoding time may vary widely.
The functionality requirement of arbitrarily scaled and positioned image presentation has important implications in regard to decoder and display synchronization and image quality. As can be appreciated by those skilled in the art, if the bottom of any arbitrarily scaled and located image is significantly above the bottom of the display frame, a correspondingly significant amount of decoding time is lost since the entire image must be decoded and processing for scaling and positioning performed well before the end of the display frame or field (where the PTS synchronization point is typically located) is reached. Further, as the picture position rises on the screen, the time gap between the end of the picture display and the PTS synchronization point, also referred to as the frame switch point, increases as does the error in the calculation for monitoring PTS synchronization.
For example, in a PAL, NTSC or other standard system using 1.5 frames of latency between decode and display of reference images, often referred to as “I” (independent) or “P” (predicted) images, (which are fully decoded rather than wholly or partially interpolated or predicted from prior and/or subsequent images), there is only 0.5 images of latency between decode and display of bidirectionally interpolated (“B”) images, for which a future image must be decoded to enable interpolation, while the PTS synchronization point, where a decision is made whether or not to decode the next field or frame, is located at the vertical retrace interval of the display. Therefore, it can be easily understood that if the bottom of the scaled and/or positioned image is above the bottom of the raster display area, the decoding of a B image must be completed significantly before the vertical retrace interval and frame switch PTS point to avoid corruption of the top (e.g. first displayed) field of the B picture.
That is, in the case of a frame structured encoded picture where both the top and bottom fields (sometimes referred to as odd and even fields) are decoded simultaneously, the image B0 (an interpolated image using both preceding and following images) is required to be fully decoded significantly before the next frame switch point to maintain a 0.5 image latency; without which the top field of the B0 picture being displayed in an interlaced fashion would be corrupted. In other words, the decoded image must be buffered for 0.5 frames (one field) in order to read out data in the proper order for display because both fields of a frame are decoded concurrently and the decoding must be complete by the time the lower edge of the top frame of the scaled image is reached for interlaced display. The lost decode time in this case is inherent for a decode latency of 0.5 and increases as the picture placement rises on the screen.
Longer decode latency is a possible solution although it may complicate synchronization and, more importantly, carries substantial hardware costs that are commercially unacceptable to provide the amount of memory required to store the decoded image (and audio) data for the increased period. For example, to increase latency from 1.5 fields to 2.0 fields in the above scenario would increase the amount of high speed access memory required by at least an amount to contain a display field and, at a minimum, similarly increase the amount of circuitry necessary to control and appropriately access the memory, particularly during corrective action for synchronization even in the rudimentary forms described above.
Further, in order to maintain the image decode latency of only 0.5 for partially interpolated pictures, the size of the frame buffer must be increased as the size of the displayed result is decreased and the upper boundary of the scaled image is made to appear lower on the display. The specific component of the frame buffer that must be increased in capacity is referred to as a spill buffer and is disclosed in U.S. Pat. No. 5,576,765 which is hereby fully incorporated by reference.
In essence, a spill buffer is required even for full frame images since new decoded data can not be stored until the previous frame/field data has been read out; thus requiring an increase of frame buffer capacity for an interval equal to the vertical blanking interval, any top border of the field and a number of scan lines corresponding to a macroblock band (usually sixteen lines) into which the image is divided for compression. For arbitrary scaling and positioning where the top border could approach the frame height, a spill buffer capacity in excess of a full field would be required; substantially the same amount of extra storage as would be required for 2.0 frame latency. If a spill buffer is not provided, this period, subsequent to the PTS timing, would also be lost for decoding purposes since no memory locations would be available to receive decoded data in the frame buffer. Again, provision of such amounts of memory as is required by spill buffers in STBs is prohibitive and will be so well into the future.
Additionally, as images are scaled to smaller sizes the image quality tends to degrade since the frame buffers are constructed with a finite number of taps to subsample the image data. The number of taps is not increased and cannot be effectively increased as the picture is scaled. Therefore, the resolution of the picture becomes relatively lower as size is reduced. Increasing the number of taps causes a significant increase in memory access bandwidth required to feed original picture data to additional taps. Variation of the number of effective taps with picture size that is desirably continuously variable is also prohibitively complicated for inclusion in STBs.
Due to the memory requirements alluded to above and the difficulty of increasing speed of decoding, at some point, achieving display of small and high-positioned images requires an adjustment between decode and display latency. One approach is disclosed in “MPEG Video Decoder with Integrated Scaling and Display Functions”, now U.S. Pat. No. 6,470,051, which is hereby fully incorporated by reference. In that approach, a fixed amount of scaling was performed in the decoding path and the scaled image simply placed on the screen. Performing scaling in the decoding path reduces the amount of required data storage capacity in the frame buffer to the point that a spill buffer may not be used and its size thus limited as may be dictated by economic constraints. However, this function has substantially less than the full desired flexibility of continuous and arbitrarily variable image scaling.
In summary, decoding of B images is the most difficult among the types of image a STB may be required to decode because they are the most compressed and require a greater number of references to previous and future decoded images in the decoding process. Decoding of B images also presents the most demanding requirements for scheduling buffering and output of the decoded image data. Therefore the loss of decoding time due to scaling and positioning is most critical for B images which make up a substantial fraction of the images in a compressed motion picture sequence and effectively prevents flexible scaling and positioning in a commercially feasible STB at the present state of the art.