1. Field of the Invention
The present invention relates generally to an audio/video decoding system which conforms to the MPEG (Moving Picture Experts Group) standards. More particularly, this invention relates to an audio/video decoding system that synchronizes the output of audio and video data to compensate for the time delays when decoding audio and video stream according to MPEG standards.
2. Description of the Related Art
Wide spread use of multimedia recording formats in personal computer, business and home entertainment systems has highlighted the need to process digitally recorded video and audio information at increasingly faster rates. This need for faster information processing has been accompanied by developments in data compression and expansion techniques that directly effect processing speed. Many types of multimedia recording formats in fact utilize data compression and expansion to enhance processing speeds. The "MPEG" standards are one popular type of standard that defines and governs data compression and expansion techniques. Current MPEG standards are continuing to be established by the MPEG Committee (ISO/IEC JTC1/SC29/WG11) under the ISO (International Organization for Standardization)/IEC (International Electrotechnical Commission).
The MPEG consists of three parts. In part 1 (ISO/IEC IS 11172-1), the MPEG defines a "system" or way of synchronizing and multiplexing the; video and audio data. In part 2, (ISO/ICE IS 11172-2), the MPEG defines video standards that govern the format and encoding process for video data and govern decoding process for a video bitstream. In part 3 (ISO/IEC IS 11172-3), the MPEG defines audio standards that govern the format and encoding process for audio data and govern decoding process for an audio bitstream.
At present, there are two MPEG standards, MPEG-1 and MPEG-2, which differ from each other principally in the rate which video and audio data are encoded. Video data, handled according to MPEG video standards, contains dynamic images, each of which consist of several tens (e.g., 30) of frames per second. This video data has a six-layered structure: a sequence of Groups of Pictures (GOP), individual GOPs, containing a plurality of pictures, a collection of slices within each picture, a macroblock within each slice, and a plurality of blocks within each macroblock. In the MPEG-1, frames are associated with pictures. In the MPEG-2, a frame or a field may but need not be associated with a picture. That is, MPEG-1 standards utilize only frame structure, whereas MPEG-2 standards may utilize both frame and field structures. Two fields constitute one frame. The structure where a frame is associated with a picture is called a frame structure, whereas the structure where a field is associated with a picture is called a field structure.
In the MPEG, a compression technique called an inter-frame coding is employed. The intra-frame coding compresses inter-frame data based on a chronological correlation of frames. A bidirectional prediction is performed to obtain the inter-frame coding. This bidirectional prediction uses both the forward prediction for predicting a current (or present) reproduction image from a past reproduction image (or picture) and the backward prediction for predicting a current reproduction image from a future reproduction image.
The bidirectional prediction defines three types of pictures called "I (Intra-coded) picture, P (Predictive-coded) picture and B (Bidirectionally predictive-coded) picture". An I picture is produced independently irrespective of a past or future reproduction image. A P picture is produced by forward prediction (prediction from a past I picture or P picture). A B picture is produced by the bidirectional prediction. In the bidirectional prediction, a B picture is produced by one of the following three predictions:
(1) Prediction from a past I picture or P picture; PA1 (2) Prediction from a future I picture or P picture; and PA1 (3) Prediction from past and future I pictures or P pictures.
Each I picture is produced without a past picture or a future picture, whereas every P picture is produced without a past picture and every B picture is produced with a past or future picture. The individual I, P and B pictures are then separately encoded.
In the inter-frame prediction, an I picture is periodically produced first. Then, a frame, several frames ahead of the I picture, is produced as a P picture. This P picture is produced by the prediction in one direction from the past to the future (i.e., in a forward direction). Next, a frame located before the I picture and after the P picture is produced as a B picture. At the time this B picture is produced, an optimal prediction scheme is selected from among either the forward, backward or bidirectional prediction schemes.
In general, a current image differs only slightly from its preceding and succeeding images in a consecutive dynamic image. It is assumed that the previous frame (e.g., I picture) and the next frame (e.g., P picture) are virtually the same. Should a slight difference exist between consecutive frames (i.e., in the B picture) the difference is extracted and compressed. Accordingly, inter-frame data can be compressed based on the chronological relationship between consecutive frames.
A series of encoded video data in accordance with MPEG video standards, as described above, is called an MPEG video bitstream (hereinafter simply referred to as "video stream"). Likewise, a series of encoded audio data in accordance with MPEG audio standards is called an MPEG audio bitstream (hereinafter simply referred to as "audio stream"). The video stream and audio stream are multiplexed in interleave fashion in accordance with the MPEG system standards as a series of data or MPEG system stream (sometimes referred to as a multiplex stream). The MPEG-1 is mainly associated with a storage medium such as a CD-ROM (Compact Disc-Read Only Memory), and the MPEG-2 incorporates many of the MPEG-1 standards and is used in a wide range of applications.
The following describes the flow from the encoding process. The MPEG encoding system first separately encodes video data and audio data to produce a video and audio stream. Next, a multiplexer (MUX) incorporated in the MPEG encoding system multiplexes the video stream and audio stream in a way that matches the format of a transfer medium or a recording medium, thus producing a system stream. The system stream is either transferred from the MUX via the transfer medium or is recorded on the recording medium.
A demultiplexer (DMUX) incorporated in the MPEG decoding system separates the system stream into a video stream and an audio stream. The decoding system separately decodes the individual streams to produce a decoded video output (hereinafter called "video output") and a decoded audio output (hereinafter called "audio output"). The video output is sent to a display, and the audio output is sent to a D/A (Digital/Analog) converter and a loud speaker via a low-frequency amplifier.
A system stream consists of a plurality of packs each having a plurality of packets. Each packet includes a plurality of access units, which are the units for the decoded reproduction. Each access unit corresponds to a single picture for a video stream, and corresponds to a single audio frame for an audio stream.
The encoding system affixes a pack header to the head of a pack and a packet header to the head of a packet. The pack header includes reference information, such as a reference time for synchronous reproduction called an SCR (System Clock Reference). Here, "reproduction" means the output of video and audio to an external unit.
The packet header includes information that allows identification of whether identifying subsequence data is video data, audio data or time stamp (TS) information. This TS information is used by the decoder to manage time during the decoding process. The packet length depends on the transfer medium and an application. For example, there is a short packet length of 53 bytes as in an ATM (Asynchronous Transfer Mode) and a long packet length of 4096 bytes for a CD-ROM. The upper limit of the packet length is set to 64 k bytes.
For instance, data is recorded on a CD-ROM continuously in units called sectors. Data is read from the CD-ROM by a CD-ROM player at a constant speed of 75 sectors per second. Each sector on a CD-ROM corresponds to one pack, which is the same as a packet in this case. When the head of an access unit is present in a packet, the encoding system affixes a TS to the packet header corresponding to that access unit. When the head of an access unit does not exist in a packet, the encoding system does not affix a TS. When the heads of two or more access units are present in a packet, the system encoder affixes a TS to the packet header corresponding to the first access unit.
There are two types of TS's: PTS (Presentation Time Stap) and DTS (Decoding Time Stap). The decoding standards are specified by a virtual reference decoder called an STD (System Target Decoder) in the MPEG system part. The reference clock for the STD is a sync signal called an STC (System Time Clock).
PTS information is used to manage reproduction output timing, and has a precision based on its 32 bit length and the timing of a 90-kHz clock. When the value of the PTS matches that of the STC, the audio/video decoding system decodes the access unit affixed to the PTS and produces output reproduction of sector data.
Since the inter-frame coding scheme is used for video, an I picture and a P picture are sent out in a video stream before a B picture. When receiving the video stream, therefore, the audio/video decoding system rearranges the pictures in an order based on the picture header affixed to each picture in that video stream, thus yielding a video output. The DTS is information for managing the decoding start time after the rearrangement of pictures. The encoding system affixes both the PTS and DTS to the packet header when both stamps differ from each other. The encoding system affixes only the PTS to the packet header when both stamps match. For example, in a video stream containing a B picture, a packet containing an I picture and a P picture would be affixed with both the PTS and DTS while a packet containing a B picture would be affixed with only the PTS. For a video stream having no B pictures, only the PTS would be affixed to the packet header.
The SCR is information for setting or correcting the value of the STC to a value intended by the encoding system. The precision of the SCR is expressed by a 32-bit value measured with a 90-kHz clock according to MPEG-1 standards and is expressed by a 42-bit value measured with a 27-kHz clock according to MPEG-2 standards. The SCR is transferred in 5 byte segments under MPEG-1 standards and in 6 byte segments under MPEG-2 standards. When the last byte is encoded, the decoding system sets the STC in accordance with the value of the SCR.
FIG. 13A shows one example of a system stream. One pack consists of a pack header H and individual packets V1, V2, A1, . . . , V6 and V7. The packets include the individual packets V1 to V7 of video data and the individual packets A1 to A3 of audio data. While each of the video and audio data packets are arranged in numerically increasing order, the sequential placement of individual audio and video data packets varies in the packet. For example, the audio data packet A1 follows the video data packets V1 and V2, the video data packet V3 follows this packet A1, and then the audio data packets A2 and A3 follow the packet V3. In this case, an SCR is affixed to the pack header H, the PTS(V1) is affixed to the packet header of the packet V1, the PTS(A1) is affixed to the packet header of the packet A1, and the PTS(V6) is affixed to the packet header of the packet V6. Therefore, the packets V1 to V5 constitute an access unit .alpha. as shown in FIG. 13B, the packets A1 to A3 constitute an access unit .beta. as shown in FIG. 13C, and the packets V6 and V7 constitute an access unit .gamma. as shown in FIG. 13D. In this case, the access units .alpha. and .gamma. each correspond to a single picture, and the access unit .beta. corresponds to a single audio frame. The DTS is not shown in FIGS. 13A through 13D.
FIG. 14 shows a block circuit of a conventional audio/video decoding system 111.
The audio/video decoding system 111 comprises an MPEG audio decoder 112, an MPEG video decoder 113 and an audio video parser (AV parser) 114. The AV parser 114 includes a demultiplexer (DMUX) 115.
The AV parser 114 receives a system stream sent from an external unit (e.g., a reader for a recording medium such as a video CD). The DMUX 115 separates the system stream to a video stream and an audio stream based on the packet header in the system stream. More specifically, the system stream shown in FIG. 13A is separated into a video stream consisting of the video data packets V1 to V7 and an audio stream consisting of the audio data packets A1 to A3.
The AV parser 114 extracts the SCR, the PTS for audio (hereinafter called "PTS(A)") and the PTS for video (hereinafter called "PTS(V)") from the system stream. The AV parser 114 outputs the audio stream, the SCR and the PTS(A) to the audio decoder 112, and outputs the video stream, the SCR and the PTS(V) to the video decoder 113.
The audio decoder 112 decodes the audio stream in accordance with the MPEG audio portion in order to produce an audio output. The video decoder 113 decodes the video stream in accordance with the MPEG video portion in order to produce a video output. The video output is sent to a display 116, while the audio output is sent to a loudspeaker 118 via an audio player 117, which has a D/A converter and a low-frequency amplifier.
The audio decoder 112 and the video decoder 113 perform synchronous reproduction of the audio output and the video output based on the associated SCRs and the PTS's. That is, the audio decoder 112 times the audio reproduction to produce audio output based on the SCR, PTS(A) and PTS(A1). The audio decoder then starts reproducing the access unit .gamma. at time t3 as shown in FIG. 13D. The video decoder 113 times the video reproduction to produce video output based on the SCR and PTS(V), PTS(V1) and PTS(V6). The video decoder then reproduces access units .alpha. and .beta. at the respective times t1 and t2 as shown in FIGS. 13B and 13C. Reproduction timing of audio output by the audio decoder 112 and video output by the video decoder 113 are determined separately according to the respective time stamps PTS(A) and PTS(V).
In the synchronous reproduction of an audio output and a video output, "lip sync" should be considered. "Lip sync" refers to the synchronization of movement between the mouth of a person appearing on the display and accompanying audio. When the sound of the voice appears faster or slower than the movement of the mouth, a lip sync error occurs. Such an error, however, is not significant if it is below the frequency perceptible to the human ear. If the is sync error is above the limit of the human audio sensitivity, a listener/viewer will notice the discrepancy. Generally, the sensible limit of the lip sync error is said to be about several milliseconds.
The audio/video decoding system 111 shown in FIG. 14 cannot reliably prevent lip sync error because the decoding time of the STD (reference decoder) or the internal delay time in the STD is assumed to be zero. The actual decoding times of the audio decoder 112 and the video decoder 113 however are not zero, although they are very short. The decoding times (i.e., internal delay times) of the decoders 112 and 113 differ from each other, and also differ depending on the amount of data in an access unit to be processed. For instance, the number of packets forming each of the access units .alpha. to .gamma. shown in FIGS. 13B to 13D will normally differ among various access units. Because individual packets do not always share the same packet length, the amounts of data in the access units .alpha. to .gamma. will normally differ.
As a solution to the above shortcomings, a method has been proposed which synchronizes the video output with the audio output by delaying either the video or audio output based on the computed difference between the time stamps PTS(V) and PTS(A). This method requires a delay memory for delaying the video output or the audio output, thus increasing circuit scale and cost. Moreover, delay memory control presents considerable difficulties. Unfortunately, if the AV parser 114 assumes control over that function, the software demands on the AV parser 114 increase to such an extent as to affect the proper operation of the AV parser 114.