1. Field of the Invention
The field of the invention relates to real time video processing, and, more specifically, to measurement of digital video image quality using principles of human physiological perception.
2. Background of the Technology
The future of image transmissionxe2x80x94indeed, much of the presentxe2x80x94is the streaming of digital data over high-speed channels. Streaming audio and video and other forms of multimedia technologies are becoming increasingly common on the Internet and in digital broadcast satellite television, and will take over most of the television broadcast industry in the next decade.
Broadcasters naturally want to build quality assurance into the product they send their customers. Such quality assurance is difficult, especially when video streams originate in a variety of different formats. Furthermore, various transmission channels have quite different degradation characteristics. Experts in video quality analysis and standardization communities have been and currently are grappling with this problem by assessing various methods of digital video quality assessment and correction in order to standardize quality measurement.
These considerations drive the search for the most objective mathematical and computational techniques to enable quality metrics. Ultimately, to be of any use, calculated quality measurements and the quality humans perceive during viewing must correlate. Mathematically modeling the visual pathways and perceptual processes inside the human body is a natural way to maximize this correlation.
Previous methods to computationally model the way humans judge visual quality relied on the lowest perceptual mechanisms, principally at the retinal level. A good example of these methods is edge detection, a visual function that takes place in the retina. There is an unmet need for visual quality measurement methods that model the higher functions of the human visual pathway in the visual cortex, the level at which the brain understands what is seen.
Specifically, a number of problems with the prior art exist in the regime of video quality analysis or measurement and the fundamental technique of video quality analysis with regard to digital video. One example in terms of digital video is what viewers often receive from a dish network, such as provided by Echostar Satellite of Littleton, Colo., or DirecTV(copyright) of El Segundo, Calif. Digital video is also what viewers typically see when working with a computer to, for example, view Internet streaming and other video over the Internet. Other examples of digital video include Quicktime(trademark) movies, supported by Apple Computer, Inc., of Cupertino, Calif., AVI movies in Windows, and video played by a Windows media player. Another important example of digital video is high definition television (HDTV). HDTV requires a substantially greater amount of bandwidth than analog television due to the high data volume of the image stream.
What viewers currently watch, in general, on standard home television sets is analog video. Even though the broadcast may be received as digital video, broadcasts are typically converted to analog for presentation on the television set. In the future, as HDTV becomes more widespread, viewers will view digital video on home televisions. Many viewers also currently view video on computers in a digital format.
An unmet need exists in the prior art for a fundamental method of analyzing video quality. The need arises typically to address some type of degradation in the video. For example, noise may have been introduced in a video stream that causes the original picture to be disturbed. There are various types of noises, and the particular type of noise can be critical. For example, one form of digital video quality measurement involves examination of the specific type of degradation encountered.
Examples of various types of noise include the following. In one type of digital noise, the viewer sees xe2x80x9chalosxe2x80x9d around the heads of images of people. This type of noise is referred to as xe2x80x9cmosquito noise.xe2x80x9d Another type of noise is a motion compensation noise that often appears, for example, around the lips of images of people. With this type of noise, to the viewer, the lips appear to xe2x80x9cquiver.xe2x80x9d This xe2x80x9cquiveringxe2x80x9d noise is noticeable even on current analog televisions when viewing HDTV broadcasts that have been converted to analog.
The analog conversion of such broadcasts, as well as the general transmittal of data for digital broadcasts for digital viewing, produces output that is greatly reduced in size from the original HDTV digital broadcast, in terms of the amount of data transferred. Typically, this reduction in data occurs as a result of compression of the data, such as occurs with a process called moving pictures expert group (MPEG) conversion or otherwise via lossy data compression schemes known in the art. The compression process selectively transfers data, reducing the transmittal of information among frames containing similar images, and thus greatly improving transmission speed. Generally, the data in common among these frames is transferred once, and the repetitive data for subsequent similar frames is not transferred again. Meanwhile, the changing data in the frames continues to be transmitted. Some of the noise results from the recombination of the continually transferred changing data and reused repetitive data.
For example, when a news broadcaster is speaking, the broadcaster""s body may not move, but the lips and face may continuously change. The portions of the broadcaster""s body, as well as the background behind the broadcaster on the set, which are not changing from frame to frame, are only transmitted once as a result of the compression routine. The continuously changing facial information is constantly transmitted. Because the facial information represents only a small portion of the screen being viewed, the amount of information transmitted from frame to frame is much smaller than would be required for transmission of the entire frame for each image. As a result, among other advantages, the transmission rate for such broadcasts is greatly increased from less use of bandwidth.
As can be seen from the above example, one type of the changing data that MPEG continuously identifies for transfer is data for motion occurring among frames, an important part of the transferred video. For video quality purposes, accurate detection of motion is important. Inaccuracies in identification of such motion, however, lead to subjective image quality degradation, such as lip xe2x80x9cquiveringxe2x80x9d seen in such broadcasts.
There remains an unmet need to determine, using an objective technique, the quality of video streams in a manner that is consistent with human subjective opinion of video quality. There is a further unmet need to improve on the existing state of the art for making such objective assessments, in that none of the existing techniques has proven to be superior to analysis using peak signal-to-noise ratio (PSNR).
PSNR is a mathematical comparison of differences among video frames, once the frames have been reduced to numerical data, following capture and processing in a computer. For example, video that is operated upon, such as by undergoing transmission to a remote site for viewing, typically can undergo degradation in video quality. Such operations upon the video stream are generically referred to as xe2x80x9chypothetical reference circuitsxe2x80x9d (HRCs). Comparison may be made in this example between the original source video stream and the transmitted, possibly degraded video stream in order to determine the amount of degradation that has occurred.
In one existing method for subjectively measuring such possible degradation, the original frames or video sequences are shown to human observers, and then the possibly degraded frames or sequences are shown to the observers. The observers are then asked to rank the degradation on a scale, such as a scale of one to ten.
In one simple existing objective technique of video quality analysis, the numerical data that is produced by inputting each original frame is compared to the numerical data for each possibly degraded frame. The difference is determined between the numerical data in the frames on a pixel by pixel basis between the original and possibly degraded frame, and the differences are then squared, summed, and normalized. The resulting value produced by this method is referred to as PSNR.
It has been found that some correlation exists between PSNR and human subjective analysis. However, although there has been some correlation, this correlation has not been found to be sufficiently robust for PSNR to be determined to be fully equivalent to subjective analysis. Further, in some cases, PSNR indicates that video quality is good, when subjective measures find the quality to be poor.
One drawback of PSNR is that this method does not account for aspects of human visual perception other than gross numeric comparison. This is one reason why PSNR is sometimes inaccurate. There remains an unmet need to address video quality using biomimetic principles, those principles that mimic biological behaviors. In particular, there is a need to focus on biomimetic principles based on the human visual system.
Other techniques have been developed to attempt to provide objective measures that mimic human perceptions. One known technique, described in U.S. Pat. Nos. 5,446,492 and 5,596,364 to Wolf et al. involves edge detection. Edges of images have long been recognized as a part of the human perception of images. Importance of edges is well known within the literature, in human psycho-visual-analysis and, for example, in machine vision. In a broad sense, the method of the patents to Wolf et al. involves detecting the edge of the frame, then performing a PSNR type of analysis between the edges. This approach has been found to be statistically equivalent to PSNR.
U.S. Pat. No. 6,075,884 to Lubin et al. involves a technique known as xe2x80x9cjust noticeable differencexe2x80x9d. The method of the patent to Lubin et al. attempts to localize the variations between pixels in the reference and the analysis frame. Generally, rather than determining arithmetic differences among frames globally, the method of Lubin et al. attempts to identify small variations between the frames which are counted. The total of these small variations, which are counted, produce a result called Just Noticeable Difference (JND). More specifically, the patent includes three types of measurements: 1) a luma measurement; 2) a chroma measurement; and 3) and a combined measurement of these two maps, referred to as a xe2x80x9cJND map.xe2x80x9d The JND measurements, or maps, are produced by processing four levels in the luma, seven levels in the chroma, application of a gaussian filter, and then decimation by two for each level. The JND measurements are performed on the source video and the video to be compared, and then the differences are determined between the results for the two resulting maps. This method also involves the use of a front-end processing engine. The method of this patent has not been found to produce a significantly different result from PSNR.
Another example of the prior art is a method produced by KDD Media Will Corp. of Japan. The KDD method generally involves analysis of regions and determination of specific differences between regions for the source and compared frames. The method also includes some edge analysis. This method has the theoretical advantage that it mimics human focus on specific items or objections, which correspond to regions of the image.
U.S. Pat. No. 5,818,520 to Bozidar Janko et al., describes a method of automatic measurement of compressed video quality that superimposes special markings in the active video region of a subset of contiguous frames within a test video sequence. The special markings provide a temporal reference, a spatial reference, a gain/level reference, a measurement code, a test sequence identifier, and/or a prior measurement value. The temporal reference is used by a programmable instrument to extract a processed frame from the test video sequence after compression encoding-decoding, which is temporally aligned with a reference frame from the test video sequence. Using the spatial reference, the programmable instrument spatially aligns the processed frame to the reference frame. The programmable instrument uses the measurement code to select the appropriate measurement protocol from among a plurality of measurement protocols. In this way video quality measures for a compression encoding-decoding system are determined automatically as a function of the special markings within the test video sequence.
U.S. Pat. No. 4,623,837 to Edward Efron et al., describes a method and means for evaluating the quality of audio and/or video transfer characteristics of a device upon which, or through which, audio and/or video information is contained, or passes, respectively. Both the method and apparatus of this patent concern the evaluation of the quality of information transfer in the recording and playing back of a recording medium or in the transferring of audio and/or video information through an information handling device, referred to as a throughput device. Unit evaluation is accomplished by establishing an input signal of known content, measuring selected parameters of selected parts of the input signal, feeding the input signal to the unit under test, measuring the parameters of parts of the output signal from the unit under test corresponding to the same selected parts of the input signal, and comparing the selected parameters of the input signal with the corresponding parameters of the output signal. Whether monitoring the quality of the signal transfer characteristics of a throughput device, a magnetic tape containing program material, or a video disc, master disc, or replica, a xe2x80x9csignaturexe2x80x9d is created for the unit under test, and subsequent analysis of the unit as it progresses along a production line or of a copy made on the same or alternate recording medium results in a second xe2x80x9csignature,xe2x80x9d which is compared against the first signature to make a determination as to the quality of the signal handling or transfer characteristics of the unit. In this manner, out-of-tolerance conditions can be automatically detected, thereby eliminating subjectivity and providing consistency in the quality level of device testing.
U.S. Pat. No. 5,574,500 to Takahiro Hamada et al., describes a sync controller that controls an amount of delay of a delay part so that original video data entered from a video source is synchronized with reproduced video data, which is compressed and reproduced by a system to be evaluated. A first orthogonal transformation (OT) calculator orthogonally transforms a reproduced image, a second OT calculator orthogonally transforms an original image, and a subtractor obtains a difference value of the same order coefficients in one block. A weighted signal to noise ratio (WSNR) calculator weights the difference with a weighting function which varies with a position of a coefficient of orthogonally transformed data and a magnitude of an alternating current (AC) power in the block after orthogonal transform of the original image and subsequently obtains an average weighted signal to noise (S/N) ratio of each video frame or a plurality of video frames. Finally, a subjective evaluation value calculator calculates a subjective evaluation value (deterioration percentage) according to the average weighted S/N ratio. Consequently, the invention provides video quality evaluating equipment for a reproduced image of a video signal subject to digital compression capable of economically evaluating video quality in a short period of time.
U.S. Pat. No. 5,940,124 to Bozidar Janko et al., describes attentional maps that reflect the subjective view of an observer to the effects of degradation in a video image that are used in the objective measurement of video quality degradation. The observer assists in generating an attentional map for each image of a test image sequence, which provides different thresholds or weighting factors for different areas of each image. A video image sequence from a system under test is compared with the test image sequence, and the error results are displayed as a function of the corresponding attentional maps.
U.S. Pat. No. 5,929,918 to Ricardo Alberto Marques Pereira et al., describes an interpolation filter for video signals that includes four circuits to improve video quality in both intra-field and inter-field modes. The interpolation filter is configured to interpolate according to the direction of an image edge. The interpolation filter is also configured to interpolate in a prescribed spatial direction when no image edges can be univocally determined. The first circuit detects an image edge of discrete image elements to generate a first signal. The second circuit uses output from the first circuit to generate a first signal corresponding to an average of the discrete image elements along a direction of the image edge. The third circuit uses output from the first circuit to detect a texture image area wherein an image edge cannot be univocally determined and for generating a second signal depending on a degree of existence of the image edge. The fourth circuit is supplied by the first signal, the second signal, and a third signal. The fourth circuit generates an output signal obtained by combining the first signal with the third signal in a proportion dependent upon the second signal. Additionally, the fourth circuit is configured for multiplexing to selectively couple the third signal to a fourth signal, corresponding to an average of the discrete image elements along a prescribed direction, or to a fifth signal corresponding to a previously received image element value.
U.S. Pat. No. 5,790,717 to Thomas Helm Judd describes an apparatus and method for predicting a subjective quality rating associated with a reference image compressed at a given level. The invention includes components and steps for storing a digitized color image representing a reference image in memory and compressing at a given level and decompressing the reference image to produce a processed image. The invention also entails converting the reference image and the processed image each to a grayscale image and dividing each grayscale image into an array of blocks. The invention further includes generating a first intensity variance array corresponding to the array of blocks of the grayscale reference image and a second intensity variance array corresponding to the array of blocks of the grayscale processed image. Lastly, the invention involves generating a variance ratio based on the first and second intensity variance arrays, determining a block variance loss based on the variance ratio, and generating the subjective quality rating indicated by the impairment level based on the variance loss.
U.S. Pat. No. 5,835,627 to Eric W. Higgens et al., describes an image processing system and method for processing an input image that provides a virtual observer for automatically selecting, ordering, and implementing a sequence of image processing operations, which will yield maximum customer satisfaction as measured by a customer satisfaction index (CSI), which, for example, can balance the image quality and the processing time. The CSI evaluates an effect of the sequence of image processing operations on the input image in response to upstream device characteristic data received from an input device profile, downstream device characteristic data received from an output device profile, host configuration data, user selection data, trial parameter values, and data corresponding to the sequence of image processing operations. In a preferred embodiment, the effect is evaluated in accordance with predetermined psychovisual attributes of the input image, as attained and codified by human observers who have subjectively selected a most pleasing test image corresponding to objective metrics of the predetermined psychovisual attributes.
Other existing methods include identifying specific degradations, such as blocking effects of the discrete cosine transform (DCT) or mosquito noise for format conversions, and correcting for these specifically identified degradation.
There thus remains a number of unsolved problems with these existing methods. For example, none of these existing methods uses a Gabor filter as a basis for measuring quality. A second problem is that none uses a spherical coordinate transform (SCT), a process shown to enhance objective results when compared to subjective analyses of image quality. A third problem is that none mimics human visual functions at the visual cortex level, an approach capable of producing higher likelihood of correlation with subjective analyses.
One advantage of the present invention is that it does not require reference source data to be transmitted along with the video data stream. Another advantage of the present invention is that it is suitable for online, real-time monitoring of digital video quality. Yet another advantage of the present invention is that it detects many artifacts in a single image, and is not confined to a single type of error.
In order for objective computations of digital video quality to correlate as closely as possible with the subjective human perception of quality, embodiments of the present invention mimic the highest perceptual mechanism in the human body as the model for the measure of quality. Embodiments of the present invention therefore provide methods to measure digital video quality objectively using biomimetic principlesxe2x80x94those principles that mimic biological behaviors.
A meaningful digital video quality measurement must match the quality perceived by human observers. Embodiments of the present invention provide methods for correlating objective measurements with the subjective results of human perceptual judgments.
The present invention includes a method and system for analyzing and measuring image quality between two images. A series of conversions and transformations of image information are performed to produce a single measure of quality. A YCrCb frame sequence (YCrCb is component digital nomenclature for video, in which the Y component is luma and CrCb (red and blue chroma) refers to color content of the image) is first converted using RGB (red, green, blue) conversion to an RGB frame sequence. The resulting RGB frame sequence is converted using spherical coordinate transform (SCT) conversion to SCT images. A Gabor filter is applied to the SCT images to produce a Gabor Feature Set, and a statistics calculation is applied to the Gabor Feature Set. The resulting Gabor Feature Set statistics are produced for both the reference frame and the frame to be compared. Quality is computed for these Gabor Feature Set statistics to produce a video quality measure. Spectral decomposition of the frames is also performable for the Gabor Feature Set, rather than the statistics calculation, allowing graphical comparison of results for the compared frames.
The YCrCb to RGB conversion is made because the eye appears to operate as an RGB device. The conversion to SCT images is made to simulate the functions performed in the visual cortex of the brain; performing operations using SCT transforms more closely matches the results of behavior of the human brain than working in other formats. The SCT conversion is made because studies have shown that the visual system of the brain tracks objects in a spherical coordinate system, as opposed to using a Cartesian coordinate system.
Further, the brain appears to perform the equivalent of a Gabor transform on images prior to the brain analyzing the visual content of the frame. The result of the application of the Gabor transform is essentially a reduced set of data, produced by filtering the original image data, that comprises the extracted features of the image. The extracted features have been shown to correspond to features that are initially extracted at the visual cortex of the brain.
Comparison of quality computations using Gabor feature set statistics have indicated that the method and system of the present invention provide results for comparing digital quality that are as effective as PSNR.
Another advantage of the present invention is that the invention""s use of algorithms that model the biological processing at the visual cortex level, including the SCT and Gabor filters, provides the secondary benefit that the transform and filtering lends itself to real-time processing using a Single Instruction Multiple Data (SIMD) architecture.
Additional advantages and novel features of the invention will be set forth in part in the description that follows, and in part will become more apparent to those skilled in the art upon examination of the following or upon learning by practice of the invention.