1. Field of the Invention
The present invention relates generally to audio and video decompression. More particularly, the present invention relates to a system and method for reducing audio breakups and other artifacts during audio and video decompression.
2. Description of the Related Art
Compression and subsequent decompression of audio and video data has wide spread application. Applications include digital video transmission and digital television. In a typical digital video transmission, the participants transmit and receive audio and video signals that allow the participants to see and hear one another.
To efficiently transmit the large amount of video and audio data generated at particular digital video transmission sites, digital video transmission systems typically digitize and compress the video and audio data for transmission across digital networks. Various compression schemes are available, and various digital networks are available as well.
Decompression of audio and video data at a receiving end of a digital video transmission system may be implemented in hardware, software, or a combination of hardware and software. Decompression of video data typically includes decoding sequential frames of the video data, converting the decoded frames of video data from one format (e.g. Luminance-Chrominance color space YUV)to another (e.g. Red, Green, and Blue (RGB) space), and rendering the converted frames of data. Decoding frames of video data may include decoding of blocks of pixels, performing motion compensation, inverse quantization to de-quantize data, inverse scanning to remove zig zag ordering of discrete cosine transforms, and/or inverse discrete cosine transformation to convert data from the frequency domain back to the pixel domain. Compressed frames of audio data are received and sequentially decoded into decoded frames of audio data. The decoded frames of audio data are subsequently rendered.
FIG. 1 shows in block diagram form, a prior art system for decompressing streams of video and audio frames in a digital video transmission system. The system in FIG. 1 includes a video decoder 102, a YUV to RGB converter 104, a video renderer 108, a video reference memory 106, an audio decoder 110, and an audio renderer 112. The video decoder 102, YUV to RGB converter 104, video renderer 108, audio decoder 110, and audio renderer 112 are implemented wholly or partly in a processor executing respective software algorithms.
In FIG. 1, compressed video frames are first decoded by video decoder. Each decoded frame of video data is subsequently stored in the video reference memory 106. A subsequent compressed frame of video data is decoded as a function of the decoded frame of video data previously stored within the reference memory. After decoding a frame of video data, YUV to RGB converter circuit 104 converts the format of the decoded frame of video data from YUV into RGB. Finally, once converted, the video renderer 108 renders the decoded frames of video data which are then subsequently displayed on a monitor. Compressed frames of audio are received and subsequently decoded by audio decoder 110. The decoded frames of audio data are subsequently rendered by audio renderer 112. More particularly FIG. 2 shows that decoded frames of audio data are first stored in a buffer 202 (typically a FIFO). Speaker 204 generates audio corresponding to the individual frames of decoded audio data stored in buffer 202. In a correctly operating digital video transmission system, speaker 204 constantly generates audio from decoded audio data stored in buffer 202.
The various activities of decoder 100 must be achieved in nearly real time. Failure to keep up with real time results in unnatural gaps, jerkiness, or slowness, in the motion video or audio presentation. Prolonged failure to keep up with the incoming compressed data will eventually result in the overflow of some buffer and the loss of a valid video frame reference which, in turn, results in chaotic video images due to the differential nature of the compressive coding.
The decompression algorithms described above may execute concurrently with an operating system. Frequently, software applications other than digital video transmission decompression algorithms, are required to be executed concurrently (i.e. multi-tasked) with the digital video transmission decompression algorithms. The processing power requirements of the operating system, digital video transmission independent software applications and the decompression algorithms described above, may cause the processor to become congested. In addition, some portions of the coded video require more processing than others. When an extended epoch of video of high computational complexity is received, the decoder may become congested. When congested, processors may not have enough processing power to execute the decompression algorithms at a rate sufficient to keep up with the source of the encoded data. Processor congestion is a state which is often incompatible with the real time requirements of digital video transmission.
Processor congestion may cause noticeable effects in the decompression of audio and video data. FIGS. 3 and 4 contrast the effects of processor congestion during video and audio decompression. FIG. 3 illustrates video and audio decompression when the processor has sufficient processing power. FIG. 4 illustrates potential effects on video and audio decompression when the processor is overloaded or congested.
FIG. 3 shows the display timing of subsequent images I1 through I6 corresponding to compressed video frames VF1 through VF6, respectively, after video frames VF1 through VF6 have been decompressed. FIG. 3 also shows the timing aspects of generating audio A1 through A6 corresponding to compressed frames of audio data AF1 through AF6, respectively, after frames of audio data AF1 through AF6 have been decompressed. It is noted that various transmission formats may be used and the number of audio frames and video frames may be unequal in some transmission formats. When the processor has sufficient processing power (i.e., the processor is not congested) subsequent image frames I1 through I6 are displayed on a display screen at time intervals in general compliance with digital video transmission scheduling standards, thereby creating a continuous and artifact free sequence of displayed images. Likewise, when the processor has sufficient processing power, subsequent intervals of audio A1 through A6 are generated at time intervals in general compliance with digital video transmission scheduling standards, thereby creating continuous and artifact free audio. Audio artifacts occur when a noticeable time gap occurs between the generation of audio corresponding to any two consecutive frames of audio data.
FIG. 4, as noted above, illustrates the effects on video and audio decompression when the processor experiences congestion. With respect to video decompression, when the processor is congested, the scheduled decoding, converting, or rendering of one or more compressed frames of video data VF1 through VF6 may be delayed, which, in turn, may delay the display of one or more corresponding images I1 through I6 as shown in FIG. 4. Likewise, if the processor is congested, the scheduled decoding of one or more compressed frames of audio data AF1 through AF6 may be delayed, which in turn, delays the generation of one or more corresponding audio A1 through A6 as shown in FIG. 4. The delay in audio generation manifests itself in the form of audio breakup. It is noted that digital video transmission participants are highly sensitive to audio breakups when compared to video artifacts.
Disclosed is a method and system for detecting congestion during decompression of a stream of video and audio data. The system and method includes receiving and decoding first audio data. Thereafter, a first audio time stamp ATS1 is generated which indicates the time at which the decoding of the first audio data has finished. Subsequently, second audio data is received and decoded. A second audio time stamp ATS2 is also generated indicating the time at which the decoding of the second audio data has finished. The first audio time stamp ATS1 is added to a predetermined amount of time T, the result of which is compared with ATS2. T, in one embodiment, is the time it takes a speaker to generate audio corresponding to a decoded frame of audio data. If ATS2 is later in time than (ATS1+T) by a predetermined amount TMIN, a corresponding signal is generated. The signal can be used to indicate that received audio data are not being decoded fast enough. If ATS2 is not later in time than (ATS1+T) by the predetermined amount TMIN, a corresponding signal is generated which can be used to indicate that received audio data are being decoded fast enough.