1. Field of the Invention
This invention relates generally to processing video signals, and more specifically to systems for interpolating and coding video signals using speech analysis techniques.
2. Description of the Related Art
The progressive developments in digital electronics and digital computing since the 1960s have resulted in the conversion, from analog to digital technology, of devices for storing and processing audio and video signals. Storing, processing and transmitting signals digitally offers significant advantages. Digital signals are less sensitive to transmission noise than analog signals. Moreover, digital signals of different types can be treated in a unified way and, provided adequate decoding arrangements exist, can be mixed on the same channel. The latter approach is the main feature of the Integrated Services Digital Network (ISDN) which is currently being developed and implemented. The ISDN can handle, for example, speech, image and computer data on a single channel.
A major disadvantage of digital communication, however, is that it requires greater channel bandwidth. This can be several times the bandwidth of an equivalent analog channel. In multimedia, videotelephony, and teleconferencing applications, bandwidth and storage space limitations permit only a relatively low frame rate (typically 5-10 frames per second, but as low as 1-2 frames per second for some applications). Thus, there is currently a strong emphasis on techniques and systems which compress the channel bandwidth required to transmit the signals. In the context of speech signals, for example, a number of techniques have been proposed which are capable of efficiently coding at very low bit-rates (between 4.8 to 64 kbits/s). Such techniques include logarithmic pulse code modulation (Log PCM), adaptive pulse code modulation (APCM), adaptive differential pulse code modulation (ADPCM), delta modulation (DM), and continuously variable slope delta modulation (CVSD). All of these techniques operate directly on the time domain signal and achieve reduced bit rates by exploiting the sample to sample correlation or redundancy in the speech signal.
While the coding techniques discussed above permit very-low bit rates to be achieved for the transmission or storage of speech signals, they are less suitable for the coding of video signals. Thus, although current visual coding standards may also operate at very low bit rates, the trade-off between temporal and spatial resolution results in visually annoying motion or spatial artifacts. As such, various techniques have been proposed to interpolate between transmitted or stored frames as a means of increasing the frame rate for flicker free and smooth motion rendition.
In the interframe coding of television pictures, for example, it is known to drop or discard information from some frames or fields by subsampling the video signal at a fraction of the normal rate. At the receiver, a reconstructed version of the information contained in the nontransmitted frames or fields is obtained by interpolation, using information derived from the transmitted fields. Simple linear interpolation may be performed by averaging the intensity information defining picture elements (pels) in the preceding and succeeding transmitted fields at fixed locations most closely related to the location of the picture element being processed. In certain instances, the interpolation may be performed adaptively, such that the pels used to form certain reconstructed or estimated intensity values are selected from two or more groups having different spatial patterns or such that the information obtained from pels in the same relative spatial positions in the prior and succeeding frames are combined in two or more different ways.
Although both the fixed and the adaptive techniques described above adequately recover nontransmitted or unstored picture information when little motion occurs in a picture, their performance is less than adequate when objects in the picture are moving quickly in the field of view. That is, reconstruction by these interpolation techniques often causes blurring and other objectionable visual distortion. Thus, a more advanced interframe coding technique is proposed in U.S. Pat. No. 4,383,272 issued to Netravali et al. on May 10, 1983 and entitled VIDEO SIGNAL INTERPOLATION USING MOTION ESTIMATION. In accordance with the technique disclosed therein, information defining elements of a picture are estimated by interpolation using information from related locations in preceding and succeeding versions of the picture. The related locations are determined by forming an estimate of the displacement of objects in the picture. Displacement estimates are advantageously formed recursively, with updates being formed only in moving areas of the picture. While this coding technique is capable of eliminating the annoying distortion and flicker associated with the other prior art techniques described above, it is still incapable of reproducing the motion of a speaker's mouth in so-called talking-head (i.e. speaking-person) sequences.
Normal speech has about 13 speech sounds per second, and the positions of the lips, jaw, teeth, and tongue change at even higher rates. As such, it will be readily appreciated that at rates of 5-10 frames per second or lower, a great deal of information about mouth movements is necessarily lost. Accordingly, it is a principal object of the present invention to enable improved reconstruction of non-transmitted or non-stored fields or frames of a video signal indicative of a speaking personsequence using information from the speaking person's utterances and at least one transmitted or stored field or frame.