Pixonics High Definition (PHD) significantly improves perceptual detail of interpolated digital video signals with the aid of a small amount of enhancement side information. In its primary application, PHD renders the appearance of High Definition Television (HDTV) picture quality from a Standard Definition Television (SDTV) coded DVD movie which has been optimized, for example, for a variable bitrate average around 6 mbps (megabits-per-second), while the multiplexed enhancement stream averages approximately 2 mbps.
In 1953, the NTSC broadcast system added a scalable and backwards-compatible color sub-carrier signal to then widely deployed 525-line black-and-white modulation standard. Newer television receivers that implemented NTSC were equipped to decode the color enhancement signal, and then combine it with the older black-and-white component signal in order to create a full color signal for display. At the same time, neither the installed base of older black-and-white televisions, nor the newer black-and-white only televisions designed with foreknowledge of NTSC would need color decoding circuitry, nor would be noticeably affected by the presence of the color sub-carrier in the modulated signal. Other backwards-compatible schemes followed NTSC.
Thirty years later, PAL-Plus (ITU-R BT.1197) added a sub-carrier to the existing PAL format that carries additional vertical definition for letterboxed video signals. Only a few scalable analog video schemes have been deployed, but scalability has been more widely adopted in audio broadcasting. Like FM radio, the North American MTS stereo (BTSC) audio standards for television added a sub-carrier to modulate the stereo difference signal, which when matrix converted back to discrete L+R channels, could be combined in advanced receivers with the mono carrier to provide stereo audio.
In most cases, greater spectral efficiency would have resulted if the encoding and modulation schemes had been replaced with state-of-the-art methods of the time that provided the same features as the scalable schemes. However, each new incompatible approach would have displaced the installed base of receiving equipment, or required spectrum inefficient simulcasting. Only radical changes in technology, such as the transition from analog to digital broadcast television, have prompted simultaneous broadcasting (“simulcasting”) of related content, or outright replacement of older equipment.
Prior attempts to divide a compressed video signal into concurrent scalable signals containing a base and at least one enhancement layer have been under development since the 1980's. However, unlike analog, no digital scalable scheme has been deployed in commercial practice, largely due to the difficulties and overheads created by the scalable digital signals. The key reason perhaps is found is in the very nature in which the respective analog and digital consumer distribution signals are encoded: analog spectra have regular periods of activity (or inactivity) where the signal can be cleanly partitioned, while digital compressed signals have high entropy and irregular time periods that content is modulated.
Analog signals contain high degree of redundancy, owing to their intended memory-less receiver design, and can therefore be efficiently sliced into concurrent streams along arbitrary boundaries within the signal structure. Consumer digital video distribution streams such as DVD, ATSC, DVB, Open Cable, etc., however apply the full toolset of MPEG-2 for the coded video representation, removing most of the accessible redundancy within the signal, thereby creating highly variable, long-term coding dependencies within the coded signal. This leaves fewer cleaner dividing points for scalability.
The sequence structure of different MPEG picture coding types (I, P, B) has a built-in form of temporal scalability, in that the B pictures can be dropped with no consequence to other pictures in the sequence. This is possible due to the rule that no other pictures are dependently coded upon any B picture. However, the instantaneous coded bitrate of pictures varies significantly from one picture to another, so temporal scalable benefits of discrete streams is not provided by a single MPEG bitstream with B-pictures.
The size of each coded picture is usually related to the content, or rate of change of content in the case of temporally predicted areas of the picture. Scalable streams modulated on discrete carriers, for the purposes of improved broadcast transmission robustness, are traditionally designed for constant payload rates, especially when a single large video signal, such as HDTV, occupies the channel. Variable Bit Rate (VBR) streams provide in practice 20% more efficient bit utilization that especially benefits a statistical multiplex of bitstreams.
Although digital coded video for consumer distribution is only a recent development, and the distribution mediums are undergoing rapid evolution, such as higher density disks, improved modems, etc., scalable schemes may bridge the transition period between formats.
The Digital Versatile Disc (DVD), a.k.a. “Digital Video Disc,” format is divided into separate physical, file systems, and presentation content specifications. The physical and file formats (Micro-UDF) are common to all applications of DVD (video, audio only, computer file). Video and audio-only have their respective payload specifications that define the different data types that consume the DVD storage volume.
The video application applies MPEG-2 Packetized Elementary Streams (PES) to multiplex at least three compulsory data types. The compulsory stream types required by DVD Video are: MPEG-2 Main Profile @ Main Level (standard definition only) for the compressed video representation; Dolby AC-3 for compressed audio; a graphic overlay (sub-picture) format; and navigation information to support random access and other trick play modes. Optional audio formats include: raw PCM; DTS; and MPEG-1 Layer II. Because elementary streams are encapsulated in packets, and a systems demultiplexer with buffering is well defined, it is possible for arbitrary streams types to be added in the future, without adversely affecting older players. It is the role of the systems demultiplexer to pass only relevant packets to each data type specific decoder.
Future supplementary stream types envisioned include “3b” stereo vision, metadata for advanced navigation, additional surround-sound or multilingual audio channels, interactive data, and additional video streams (for supporting alternate camera angles) that employ more efficient, newer generation video compression tools.
Two major means exist for multiplexing supplementary data, such as enhancement stream information of this invention, in a backwards-compatible manner. These means are not only common to DVD, but many other storage mediums and transmission types including D-VHS, Direct Broadcast Satellite (DBS), digital terrestrial television (ATSC & DVB-T), Open Cable, among others. As the first common means, the systems stream layer multiplex described above is the most robust solution since the systems demultiplexer, which comprises a parser and buffer, is capable of processing streams at highly variable rates without consequence to other stream types multiplexed within the same systems stream. Further, the header of these system packets carry a unique Registered ID (RID) that, provided they are properly observed by the common users of the systems language, uniquely identify the stream type so that no other data type could be confused for another, including those types defined in future. SMPTE-RA is such an organization charged with the responsibility of tracking the RID values.
The other, second means to transport supplementary data, such as enhancement data of the invention, is to embed such data within the elementary video stream. The specific such mechanisms available to MPEG-1 and MPEG-2 include user_data( ), extension start codes, reserved start codes. Other coding languages also have their own means of embedding such information within the video bitstream. These mechanisms have been traditionally employed to carry low-bandwidth data such as closed captioning and teletext. Embedded extensions provides a simple, automatic means of associating the supplementary data with the intended picture the supplementary data relates to since these embedded transport mechanisms exist within the data structure of the corresponding compressed video frame. Thus, if a segment of enhancement data is found within a particular coded picture, then it is straight-forward for a semantic rule to assume that such data relates to the coded picture with which the data was embedded. Also, there is no recognized registration authority for these embedded extensions, and thus collisions between users of such mechanisms can arise, and second that the supplementary data must be kept to a minimum data rate. ATSC and DVD have made attempts to create unique bit patterns that essentially serve as the headers and identifiers of these extensions, and register the ID's, but it is not always possible to take a DVD bitstream and have it translate directly to an ATSC stream.
Any future data stream or stream type therefore should have a unique stream identifier registered with, for example, SMPTE-RA, ATSC, DVD, DVB, OpenCable, etc. The DVD author may then create a Packetized Elementary Stream with one or more elementary streams of the this type.
Although the sample dimensions of the standard definition format defined by the DVD video specification are limited to 720×480 and 720×576 (NTSC and PAL formats, respectively), the actual content of samples may be significantly less due to a variety of reasons.
The foremost reason is the “Kell Factor,” which effectively limits the vertical content to approximately somewhere between ⅔ and ¾ response. Interlaced displays have a perceived vertical rendering limit between 300 and 400 vertical lines out of a total possible 480 lines of content. DVD video titles are targeted primarily towards traditional 480i or 576i displays associated with respective NTSC and PAL receivers, rather than more recent 480p or computer monitors that are inherently progressive (the meaning of “p” in 480p). A detailed description of the Kell Factor can be found in the books “Television Engineering Handbook” by Wilkonson et al, and “Color Spaces” by Charles Poynton. A vertical reduction of content is also a certain measure to avoid the interlace flicker problem implied by the Kell Factor. Several stages, such as “film-to-tape” transfer, can reduce content detail. Interlace cameras often employ lenses with an intentional vertical low-pass filter.
Other, economical reasons favor moderate content reduction. Pre-processing stages, especially low-pass filtering, prior to the MPEG video encoder can reduce the amount of detail that would need to be prescribed by the video bitstream. Assuming, the vertical content is already filtered for anti-flicker (Kell Factor), filtering along the horizontal direction can further lower the average rate of the coded bitstream by a factor approximately proportional to the strength of the filtering. A 135 minute long movie would have an average bitrate of 4 mbps if it were to consume the full payload of a single-sided, single-layer DVD (volume of 4.7 billion bytes). However, encoding of 720×480 interlace signals have been shown to require sustained bitrates as high as 7 or 8 mbps to achieve transparent or just-noticeable-difference (JND) quality, even with a well-designed encoder. Without pre-filtering, a 4 mbps DVD movie would likely otherwise exhibit significant visible compression artifacts. The measured spectral content of many DVD tiles is effectively less than 500 horizontal lines wide (out of 720), and thus the total product (assuming 350 vertical lines) is only approximately half of the potential information that can be expressed in a 720×480 sample lattice. It is not surprising then that such content can fit into half the bitrate implied at least superficially by the sample lattice dimensions.
The impact of this softening is minimized by the fact that most 480i television monitors are not capable of rendering details within the Nyquist limits of 720×480. The displays are likely optimized for an effective resolution of 500×350 or worse. Potentially, anti-flicker filters, as commonly found in computer-to-television format converters, could be included in every DVD decoder or player box, thus allowing true 480 “p” content to be encoded on all DVD video discs. Such a useful feature was neither given as a mandate nor suggested as an option in the original DVD video specification. The DVD format was essentially seen as a means to deliver the best standard definition signals of the time to consumers.
Prior art interpolation methods can interpolate a standard definition video signal to, for example, a high definition display, but do not add or restore content beyond the limitations of the standard-definition sampling lattice. Prior art methods include, from simplest to most complex: sample replication (“zero order hold”), bi-linear interpolation, poly-phase filters, spline fitting, POCS (Projection on Convex Sets), and Bayesian estimation. Inter-frame methods such as super-resolution attempt to fuse sub-pixel (or “sub-sample”) detail that has been scattered over several pictures by aliasing and other diffusion methods, and can in fact restore definition above the Nyquist limit implied by the standard definition sampling lattice. However such schemes are computationally expensive, non-linear, and do not always yield consistent quality gains frame-to-frame.
The essential advantage of a high-resolution representation is that it is able to convey more of the actual detail of a given content than a low-resolution representation. The motivation of proving more detail to the viewer is that it improves enjoyment of the content, such as the quality difference experienced by viewers between the VHS and DVD formats.
High Definition Television (HDTV) signal encoding formats are a direct attempt to bring truly improved definition, and detail, inexpensively to consumers. Modem HDTV formats range from 480p up to 1080p. This range implies that content rendered at such resolutions has anywhere from two to six times the definition as the traditional, and usually diluted, standard definition content. The encoded bitrate would also be correspondingly two to six times higher. Such an increased bitrate would not fit onto modem DVD volumes with the modem MPEG-2 video coding language. Modem DVDs already utilize both layers, and have only enough room left over for a few short extras such as documentaries and movie trailers.
Either the compression method or the storage capacity of the disc would have to improve to match as the increase in definition and corresponding bitrate of HDTV. Fortunately both storage and coding gains have been realized. For example, H.264 (a.k.a. MPEG-4 Part 10 “Advanced Video Coder”) has provided a nominal 2× gain in coding efficiency over MPEG-2. Meanwhile, blue-laser recording has increased disc storage capacity by at least 3× over the original red-laser DVD physical format. The minimal combined coding and physical storage gain factor of 6:1 means that it is possible to place an entire HDTV movie on a single-sided, single-layer disc, with room to spare.
A high-definition format signal can be expressed independently (simulcast) or dependently (scalable) with respect to a standard-definition signal. The simulcast method codes the standard definition and high definition versions of the content as if they were separate, unrelated streams. Streams that are entirely independent of each other may be multiplexed together, or transmitted or stored on separate mediums, carriers, and other means of delivery. The scalable approach requires the base stream (standard definition) to be first decoded, usually one frame at a time, by the receiver, and then the enhancement stream (which generally contains the difference information between the high definition and standard definition signals) to be decoded and combined with the frame. This may be done piecewise, as for example, each area of the base picture may be decoded just in time prior to the addition of the enhancement data. Many implementation schedules between the base and enhancement steps are possible.
The simulcast approach is cleaner, and can be more efficient than enhancement coding if the tools and bitrate ratios between the two are not tuned properly. Empirical data suggests that some balance of rates should exist between the base and enhancement layers in order to achieve optimized utilization of bits. Thus if a data rate is required to achieve some picture quality for the base layer established by the installed base of DVD players, for example, then the enhancement layer may require significant more bits in order to achieve a substantial improvement in definition.
In order to lower the bitrate of the enhancement layer, several tricks can be applied that would not noticeably impact quality. For example, the frequency of intra pictures can be decreased, but at the tradeoff of reduced robustness to errors, greater IDCT drift accumulation, and reduced random access frequency.
Previous scalable coding solutions have not been deployed in main-stream consumer delivery mediums, although some forms of scalability have been successfully applied to internet streaming. With the exception of temporal scalability (FIG. 2e) that is inherently built-in all MPEG bitstreams that utilize B-frames, the spatial scalable scheme (FIG. 2d), SNR scalable (FIG. 2c) and Data Partitioning schemes documented in the MPEG-2 standard have all incurred a coding efficiency penalty rendering scalable coding efficiency little better, or even worse, than the total bandwidth consumed by the simulcast approach (FIG. 2b). The reasons behind the penalties have not been adequately documented, but some of the known factors include: excessive block syntax overhead incurred when describing small enhancements, and re-circulation of quantization noise between the base and enhancement layers.
FIG. 2a establishes the basic template where, in subsequent figures, the different scalable coding approaches most fundamentally differ in their structure and partitioning. Bitstream Processing (BP) 2010 includes those traditional serially dependent operations that have a varying density of data and hence variable complexity per coding unit, such as stream parsing, Variable Length Decoding (VLD), Run-Length Decoding (RLD), header decoding. Inverse Quantization (IQ) is sometimes placed in the BP category if only the non-zero transform coefficients are processed rather applying a matrix operation upon all coefficients. Digital signal processing (DSP) 2020 operations however tend to be parallelizable (e.g. SIMD scalable), and have regular operations and complexity. DSP includes IDCT (Inverse Discrete Cosine Transform) and MCP (Motion Compensated Prediction). Reconstructed blocks 2025 are stored 2030 for later display processing (4:2:0 to 4:2:2 conversion, image scaling, field and frame repeats) 2040, and to serve as reference for prediction 2031. From the bitstream 2005, the BP 2010 produces Intermediate decoded bitstream 2015 comprising arrays of transform coefficients, reconstructed motion vectors, and other directives that when combined and processed through DSP produce the reconstructed signal 2025.
FIG. 2b demonstrates the “simulcast” case of two independent streams and decoders that optionally, through multiplexer 2136, feed the second display processor 2140. The most typical application fitting the FIG. 2b paradigm is a first decoder system for SDTV, and a second decoder system for HDTV. Notably, the second decoder's BP 2110 and DSP 2120 stages do not depend upon state from the first decoder.
The scalable schemes are best distinguished by what processing stages and intermediate data they relate with the base layer. The relation point is primarily application-driven. FIG. 2c illustrates frequency layering, where the relation point occurs at the symbol stages prior to DSP. (symbols are an alternate name for bitstream elements). In block based transform coding paradigms, the symbol stream is predominately in the frequency domain, hence frequency layering. The enhanced intermediate decoded symbols 2215 combined with the intermediate decoded base symbols 2015 creates a third intermediate symbol stream 2217 that is forward-compatible decodable, in this example, by the base layer DSP decoder 2220. The combined stream appears as an ordinary base layer stream with increased properties (bitrate, frame rate, etc.) over the base stream 2005. Alternatively, the enhanced DSP decoder could have tools not present in the base layer decoder DSP, and 2217 depending on the tools combination and performance level, therefore only be backward-compatible (assuming the enhanced DSP is a superset of the base DSP). SNR scalability and Data partitioning are two known cases of frequency layering that produce forward-compatible intermediate data streams 2217 decodable by base layer DSP stages 2020. Frequency layering is generally chosen for robustness over communications mediums.
In a forward-compatible application example of frequency layering, detailed frequency coefficients that could be added directly to the DCT coefficient block would be encoded in the enhancement stream, and added 2216 to the coefficients 2015 to produce a higher fidelity reconstructed signal 2225. The combined stream 2217 resembles a plausible base layer bitstream coded at a higher rate, hence the forward compatible designation. Alternatively, a backward-compatible example would be an enhancement stream that inserted extra chrominance blocks into the bitstream in a format only decodable by the enhanced DSP decoder. The original Progressive JPEG mode and the more recent JPEG-2000 are examples of frequency layering.
Spatial scalability falls into the second major scalable coding category, spatial layering, whose basic decoding architecture is shown in FIG. 2d. The spatial scalability paradigm exploits the base layer spatial-domain reconstruction 2025 as a predictor for the enhanced reconstruction signal 2327, much like previously reconstructed pictures serve as reference 2031 for future pictures (only the reference pictures in this example are, as an intermediate step, scaled in resolution). A typical application would have the base layer contain a standard definition (SDTV) signal, while the enhancement layer would encode the difference between the scaled high definition (HDTV) and standard definition reconstruction 2025 scaled to match the lattice of 2325.
Spatial layering is generally chosen for scaled decoder complexity, but also serves to improve robustness over communications mediums when the smaller base layer bitstream is better protected against errors in the communications channel or storage medium.
A third scalability category is temporal layering, where the base layer produces a discrete set of frames, and an enhancement layer adds additional frames that can be multiplexed in between the base layer frames. An example application is a base layer bitstream consisting of only I and P pictures could be decoded independently of an enhancement stream containing only B-pictures, while the B-pictures would be dependent upon the base layer reconstruction, as the I and P frame reconstructions would serve as forward and backward MCP (Motion Compensated Prediction) references. Another application is stereo vision, where the base layer provides the left eye frames, and the enhancement layer predicts the right eye frames from the left eye frames, with additional correction (enhancement) to code the left-right difference.
Enhancement methods that do not employ side information or any significant enhancement layer stream are applied by default in the conversion of SDTV to HDTV. Interpolation, through scaling and sharpening, a standard definition (SDTV) signal to a high definition (HDTV) signal is a method to simulate high definition content, necessary to display SDTV on a high definition monitor. Although the result will not look as good as genuine HDTV content, certain scaling or interpolation algorithms do a much better job than others, as some algorithms better model the differences between a HDTV and SDTV representation of the same content. Edges and textures can be carefully sharpened to provide some of the appearance of HDTV, but will at the same time look artificial since the interpolation algorithm will not sufficiently estimate the true HDTV from the content. Plausible detail patterns can be substituted, but may also retain a synthetic look upon close examination.
Many methods falling under the genre of superresolution can partially restore HDTV detail from an SDTV signal under special circumstances, although to do so requires careful and complex motion compensated interpolation since the gain is realized by solving for detail that have been mixed over several pictures through iterative mathematical operations. Superresolution tools require sub-pixel motion compensated precision, similar to that found in newer video coders, and with processing at sub-pixel granularity rather than whole blocks. Thus, instead of one motion vector for every 8×8 block (every 64 pixels), there would be one to four motion vectors generated by the superresolution restoration algorithm at the receiver for every high-definition pixel.
Optimization techniques can reduce this complexity, but the end complexity would nonetheless exceed the combined decoding and post-processing complexity of the most advanced consumer video systems. In an effort to improve stability of the restored image, and reduce implementation costs, several approaches have been investigated by researchers to restore high resolution from a combination of a lower resolution image and side information or explicit knowledge available only to the encoder.
Gersho's 1990 publication “non-linear VO interpolation . . . ” [Gersho 90] first proposes to interpolate lower resolution still images by means of Vector Quantization (VQ) codebooks (2410 and 2516) trained on their original higher resolution image counterparts. Prior interpolation methods, such as multi-tap polyphase filter banks, generate the interpolated image sample-by-sample (or point-wise) where data is fitted to a model of the interpolated signal through convolution with curves derived from the model. The model is typically a sinc function. Gersho's interpolation procedure (FIG. 2f) closely resembles block coding, where the picture (example shown in FIG. 7e) is divided into a grid of input blocks similar to the grid 7411. Each block (whose relationship to the grid 7411 is demonstrated by block 7431) in signal 2506 may be processed independently of other blocks within the same picture. The mapping stage 2504 models some form of distortion such as sub-sampling of the original signal 2502 to the input signal 2506. It is the goal of the Gersho 90 interpolator that the reconstructed block 2518 best approximates the original block 2502 given the information available in the receiver, namely, input block 2506 and previously derived codebooks 2510 and 2516. Input block 2506 is matched to a best-fit entry within a first codebook 2510. FIG. 2g adapts the mapping stage 2604 as a combination of decimation followed by the MPEG encode-decode process, the focus of this disclosure's application. Specifically, the mapping stage is the conversion of an HDTV signal to an SDTV signal (via sub-sampling or decimation) that is then MPEG encoded. While the classic VQ picture coder transmits codebook indices to the receiver, in the nonlinear VQ interpolation application (FIGS. 2f through 2j), the first index 2512 of the matching codebook entry in 2510 serves as the index of a corresponding entry in a second codebook 2516. “Super-resolution” is achieved in that the second codebook contains detail exceeding the detail of the input blocks 2506. Gersho 90 is targeted for the application of image restoration, operating in a receiver that is given the distorted image and codebooks 2510, 2516, 2610, and 2616 trained on content 2502 available only at the transmitter.
Gersho's non-linear VQ interpolation method is applied for image restoration, and therefore places the codebook search matching and index calculation routine at the receiver. In contrast, the typical applications of VQ are for compression systems whose search routine is at the transmitter where indices and the codebooks are generated and transmitted to the receiver. The receiver then uses the transmitted elements to reconstruct the encoded images. While in the Gersho 90 design, the index generator 2008 is the receiver, the codebook generator still resides at the transmitter, where the higher resolution source content 2002 upon which C* (2016, 2116) is trained, is available.
The principal step of Non-linear Interpolative Vector Quantization for Image Restoration described by [Sheppard 97], over the [Gersho 90] paper that it builds upon, is the substitution of the first VQ stage (2508,2608) with a block waveform coder comprising a Discrete Cosine Transform 2904 and transform coefficient Quantization stage 2908. The quantized coefficients are packed 2912 to form the index 2914 applied to the second codebook 2716, 2812. Thus, a frequency domain codebook is created rather than the traditional, spatial domain VQ codebook. The significance of this step is many-fold. First, the codebook search routine is reduced to negligible complexity thanks to the combination of DCT, quantization, and packing stages (2904, 2908, 2912 respectively) that collectively calculate the second codebook index 2712 directly from a combination of quantized DCT coefficients 2906 within the same block 2902. Prior methods, such as Gersho 90, generated the index through a comprehensive spatial domain match tests (similar to the process in 5400) of many codebook entries (similar to 5140) to find the best match, where the index 2712 of the best match serves as the index sought by the search routine.
Sheppard further overlaps each input block by a pre-determined number of samples. Thus, a window of samples is formed around the projected area to be interpolated, and the input window steps through the picture at a number of samples smaller than the dimensions of the input block. Alternatively, in a non-overlapping arrangement, the projected and input block dimensions and step increments would be identical. An overlapping arrangement induces a smoothing constraint, resulting in a more accurate mapping of input samples to their output interpolated counterparts. This leads to fewer discontinuities and other artifacts in the resulting interpolated image. However, the greater the overlap, the more processing work must be done in order to scale an image of a given size. For example, in a combination of a 4×4 process block overlapping a 2×2 input block, sixteen samples are processed for every four samples that are interpolated. This is a 4:1 ratio of process bandwidth to input work. In a non-overlapping arrangement, sixteen samples (in a 4×4 block) are produced for every sixteen input samples. The overlapping example given here requires four times as much work per average output sample as the non-overlapping case.
Although the DOT method by Sheppard et al does permit larger codebooks than the NLIVQ methods of Gersho et al, it does not address the cost and design of sending such codebooks to a receiver over a communications or storage medium. The application is a “closed circuit” system, with virtually unlimited resources, for restoring images of similar resolution. Thus, an improved system that is designed specifically for entropy-constrained, real-time transmission and can scale across image resolutions is needed.
DVD
DVD is the first inexpensive medium to deliver to main stream consumers nearly the full quality potential of SDTV. Although a rigid definition of SDTV quality does not exist, the modern definition has settled on “D-1” video—the first recording format to adopt CCIR 601 parameters. SDTV quality has evolved significantly since the first widespread introduction of television in the 1940's, spawning many shades of quality that co-exist today.
In the late 1970's, the first popular consumer distribution format, VHS and Betamax tape, established the most common denominator for standard definition with approximately 250 horizontal luminance lines and a signal-to-noise ratio (SNR) in the lower to mid 40's dB range. Early television broadcasts had similar definition. In the 1980's, television monitors, analog laserdiscs, Super-VHS and the S-Video connector offered consumers improved SD video signals with up to 425 horizontal lines and SNR as high as 50 dB, exceeding the 330 horizontal-line-per-picture-height limit of the broadcast NTSC signal format today.
Starting in 1982, professional video engineering organizations collaborated on the creation of the CCIR 601 discrete signal representation standard for the exchange of digital signals between studio equipment. Although it is only one set of parameters among many possible choices, CCIR 601 effectively established the upper limit for standard definition at 540 horizontal lines per picture height (on a 4:3 aspect ratio monitor). Applications such as DVD later diluted the same pixel grid to cover a one third wider screen area. Thus the horizontal density on 16:9 anamorphic DVD titles is one third less than standard 4:3 “pan & scan” titles. The CCIR 601 rectangular grid sample lattice was defined as 720 samples per line, with approximately 480 lines per frame at the 30 Hz frame rate most associated with NTSC, and 576 lines at the 25 Hz frame rate of PAL and SECAM. Horizontal line density is calculated as (aspect ratio)×(total lines per picture width). For a 4:3 aspect ratio, the yield is therefore ((4/3)×(720))=540 lines.
Although technically a signal format, CCIR 601 cultivated its own connotation as the ultimate watermark of “studio quality.” By the late 1990's, CCIR 601 parameters were ushered to consumers by the ubiquitous MPEG-2 video standard operating mode, specifically designated “Main Profile @ Main Level or “MP@ML”. MPEG-2 MP@ML was adopted as the exclusive operating point by products such as DVD, DBS satellite, and digital cable TV. While the sample dimensions of DVD may be fixed to 720×480 (“NTSC”) and 720×576 (“PAL”), the familiar variables such as bitrate (bandwidth), content, and encoder quality very much remain dynamic, and up to the discretion of the content author.
Concurrent to the end of the SDTV evolution, HDTV started from almost its beginning as a handful of digital formats. SMPTE 274M has become HDTV'subiquitous analogy for to SDTV's CCIR 601. With 1920 samples-per-line by 1080 lines per frame, and a 16:9 aspect ratio—one third wider than the 4:3 ratio of SDTV--SMPTE 274M meets the canonical requirement that HD be capable of rendering twice the horizontal and vertical detail of SDTV. The second HDTV format, SMPTE 296M, has image dimensions of 1280×720 samples.
Until all programming is delivered in an HDTV format, there will be a need to convert SDTV signals to fit on HDTV displays. SDTV legacy content may also circulate indefinitely. In order to be displayed on a traditional HDTV display, SDTV signals from sources such as broadcast, VHS, laserdisc, and DVD need to first be up-converted to HDTV. Classic picture scaling interpolation methods, such as many-tap FIR poly-phase filters, have been regarded as the state of the art in practical interpolation methods. However, the interpolated SD signal will still be limited to the detail prescribed in the original SD signal, regardless of the sample density or number of lines of the HD display. Interpolated SD images will often appear blurry compared to their true HD counterparts, and if the interpolated SD images are sharpened, they may simulate some aspect of HD at the risk looking too synthetic.
One reason for SD content looking better on HD displays comes from the fact that most display devices are incapable of rendering the full detail potential of the signal format they operate upon as input. The HD display has the advantage that details within the SD image that were too fine or subtle to be sufficiently resolved by a SD display can become much more visible when scaled up on the HD display. Early on, however, the interpolation processing and HD display will reach a point of diminishing returns with the quality and detail that can be rendered from an SD signal. In the end, information must be added to the SD signal in order to render true detail beyond the native limits of the SD format. Several enhancement schemes, such as the Spatial Scalable coders of MPEG-2, have been attempted to meet this goal, but none have been deployed in commercial practice due to serious shortcomings.
Enhancement methods are sensitive to the quality of the base layer signal that they build upon. To optimize the end quality, a balance in bitrate and quality must be struck between the base layer and enhancement layer reconstructions. The enhancement layer should not always spend bits correcting deficiencies of the base layer, while at the same time the base layer should not stray too close to its own point of diminishing returns.