For storage on or distribution via such media as CD-ROMs, laser disks (LDs), video tapes, magneto-optical (MO) storage media, digital compact cassette (DCC), terrestrial or satellite broadcasting, cable systems, fibre-optic distribution systems, telephone systems, ISDN systems etc., video and audio signals are compressed and coded, and the resulting video stream and audio stream are then multiplexed to provide a bit stream for feeding to the medium. The bit stream is later reproduced from the medium, is demultiplexed, and the resulting video stream and audio stream are decoded and expanded to recover the original audio and video signals.
Two of the main international standards related to compressing audio and video signals for storage on or distribution via a medium are those known as MPEG-1 and MPEG-2. These standards have been established by the Motion Picture Experts Group (MPEG) operating under the auspices of the International Standards Organization (ISO) and the International Electrotechnical Committee (IEC).
The MPEG standards are established under the assumption that they will be used in a wide range of applications. As a result, the standards allow for such possibilities a phase-locked system, in which the sampling rate clock of the audio signal is phase locked to the same clock reference (SCR) as the frame rate clock of the video signal, and a non phase-locked system in which the sampling rate clock of the audio system and the frame rate clock of the video system operate independently. Irrespective of whether the system is phase locked, the MPEG standards require the addition of a time stamp to the multiplexed bit stream at least once every 0.7 s, and that the encoder provide separate time stamps for use by the audio decoder and by the video decoder.
One of the aims of the MPEG standards is to provide maximum flexibility for encoder and decoder design while ensuring that the bit stream provided by any encoder can be successfully decoded by any decoder. One of the ways in which this compatibility is established is by the concept of the System Target Decoder.
A typical audio and video signal processing system 110 according to the MPEG-1 and MPEG-2 standards is shown in FIG. 1. In this, the encoder 100 receives the video signal S2 from the video signal storage medium 2, and receives the audio signal S3 from the audio signal storage medium 3. The audio signal S3 could alternatively be (and is more usually) also received from the video signal storage medium 2 instead of from a separate audio storage medium.
The encoder 100 compresses and codes the video and audio signals, and multiplexes the resulting audio stream and video stream to provide the multiplexed bit stream S100, which is fed for storage or distribution by the medium 5. The medium can be any medium suitable for storing or distributing a digital bit stream, for example, a CD-ROM, a laser disk (LD), a video tape, a magneto-optical (MO) storage medium, a digital compact cassette (DCC), a terrestrial or satellite broadcasting system, a cable system, a fibre-optic distribution system, a telephone system, an ISDN system, etc.
The encoder 100 compresses and codes the video signal picture-by-picture. Each picture of the video signal is compressed in one of three compression modes. A picture compressed in the intra-picture compression mode is called an I-picture. In the intra-picture compression mode, the picture is compressed by itself without reference to other pictures of the video signal. Pictures compressed in the inter-picture compression mode are called P-pictures or B-pictures. A P-picture is compressed using forward prediction coding using as a reference picture a previous I-picture or P-picture, i.e., a picture occurring earlier in the video signal. Each block of a B-picture may use as a reference block any one of the following: a block of a previous I-picture or P-picture, a block of a following P-picture or I-picture (i.e., a picture occurring later in the video signal), or a block obtained by performing linear processing on a block of a previous I-picture or P-picture and block of a following I-picture or P-picture. In addition, blocks of a B-picture may be compressed in the intra-picture compression mode. Typically, about 150 kbits (kb; 1 kb=1024 bits) of the video stream are required for an I-picture, 75 kb of the video stream are required for a P-picture, and 5 kb of the video stream are required for a B-picture.
The digital video and audio processing system 110 also includes the decoder 600, which receives as its input signal the bit stream S5 from the medium 5. The decoder performs demultiplexing inverse to the multiplexing performed in the encoder 100. The decoder also applies decoding and expansion to the resulting audio stream and video stream using processing complementary to that performed by the encoder 100 to provide the recovered video signal 6A and the recovered audio signal 6B. The recovered video signal 6A and the recovered audio signal 6B respectively closely match the video signal S2 and the audio signal S3 fed into the encoder 100.
FIG. 1 also shows the system target decoder (STD) 400 which is used to define the processing performed by the encoder 100 and the decoder 600. In practical video and audio signal processing systems, the encoder seldom includes an actual system target decoder, but instead performs the encoding processing and multiplexing taking account of the system target decoder parameters. Also, in practical systems, the decoder is designed to have performance equalling or exceeding that of the system target decoder. These relationships between the system target decoder and the encoder and the decoder are indicated in FIG. 1 by the broken line labelled S4A interconnecting the system target decoder and the encoder, and the broken line labelled S4B interconnecting the system target decoder and the decoder.
The system target decoder 400 is also known as a hypothetical system target decoder, system reference decoder, or reference decoding processing system. From now on it will be referred to as a system target decoder.
System target decoders are defined in international standard specifications such as CCITT H.261 and the MPEG-1 standard to provide guidelines for the designers of video and audio encoders and decoders for these standards.
In the MPEG-1 system standard, the system target decoder includes a reference video decoder and a reference audio decoder. In addition, the system target decoder includes an input buffer for the reference video decoder and an input buffer for the reference audio decoder. The size of each input buffer is defined in the standard. The standard also defines the operation of the two reference decoders, especially with regard to the way in which they remove the audio stream and the video stream from their respective buffers.
The concept of the system target decoder provides compatibility between encoders and decoders of different designs as follows. All encoders are designed to provide a bit stream that can be successfully decoded by the system target decoder, and that does not cause the respective input buffers in the system target decoder to overflow or underflow. In addition, all decoders are designed to have performance parameters that are equal to or better than those defined for the system target decoder. As a result, all such decoders will be capable of successfully decoding the bit stream produced by any of the encoders designed to produce a bit stream capable of being decoded by the system target decoder. The bit stream produced for decoding by the system target decoder is called a "constraint system parameter stream."
The structure of the hypothetical system target decoder 400 shown in FIG. 1 is as follows. The demultiplexer 401 notionally receives the bit stream S100 from the encoder 100. The demultiplexer 401 demultiplexes the bit stream into a video stream and an audio stream. The video stream is fed to the video input buffer 402, the output of which is connected to the video decoder 405. The audio stream from the demultiplexer 401 is fed into the audio input buffer 403, the output of which is connected to the audio decoder 406. In the example shown in FIG. 1, the video input buffer 402 has a storage capacity of 46 k bytes and the audio input buffer 403 has a storage capacity of 4 k bytes, as specified by the MPEG-1 standard. The video decoder 405 removes the video stream from the video input buffer 402 one video access unit at a time, i.e., one picture at time, at a timing corresponding to the picture rate of the video signal, e.g., once every 1/29.94 seconds in an NTSC system. The amount of the video stream removed from the video input buffer for each picture varies because of the different amount of compression applied to each picture. The audio decoder 406 removes the audio stream from the audio input buffer 403 one audio access unit at a predetermined timing.
It is desirable from the standpoint of the construction of the system, and to maximize flexibility, that, in the real decoder 600, the element corresponding to the demultiplexer 401 in the STD include a switching circuit, and that the elements corresponding to the video decoder 405 and the audio decoder 406 in the STD be provided using a high-speed processor (DSP) having a configuration suitable for performing high-speed signal processing operations. Such processors normally cannot include a large amount of storage for cost reasons. Therefore, the MPEG standards take these practical considerations into account and set the storage capacities of the video input buffer 402 and the audio input buffer 403 to the relatively small values set forth above.
FIG. 2 shows the structure of the constraint parameter (multiplex) system bit stream CPSP that is notionally fed into the system target decoder 400. The bit stream shown in FIG. 2 has a multi-layer structure, and includes various headers in a multiplex layer and the audio stream and the video stream in a signal layer. In this structure, plural packs serially arranged in time. Each pack begins with a pack header, and includes at least one video packet and at least one audio packet. Each video packet begins with a packet header and includes the video stream of at least pan of at least one picture. One video packet will accommodate the video stream of more than one B-picture, but several video packets are required to accommodate the video stream of one I-picture. There is no requirement that a picture begin immediately after the packet header: the picture may start at any point in the video packet.
Each video packet header may include at least one video time stamp showing the presentation time of the first picture that begins in the packet. If the first picture is an I-picture or a P-picture, and its decoding time differs from its presentation time, a decoding time stamp may also be included. The purpose and use of the video time stamps will be described below.
Each audio packet includes at least one audio access unit of the audio stream, and begins with an audio packet header. The audio packet header may include a presentation time stamp showing the output timing of the audio signal obtained by decoding the first audio access unit beginning in the audio packet. Each audio access unit is about 384 bytes in MPEG-1.
FIG. 2 shows a video packet that includes the video stream of the end of the picture i, and the video stream of at least the beginning of the picture i+1. The video time stamp vts included in the video packet header shown is the video time stamp of the picture i+1, because the picture i+1 is the first picture that begins in the video packet. FIG. 2 also shows the audio packet that includes the audio signal of the end of the access unit j, and the audio signal of the access units j+1 and j+2. The audio time stamp ats included in the audio packet header is the time stamp of the audio access unit j+1, because the access unit j+1 is the first access unit that begins in the audio packet.
The encoder 100 compresses and codes the video signal S2 and at least codes the audio signal S3 to provide a video stream and an audio stream, respectively, and multiplexes the audio stream, the video stream, and the various headers to provide the multiplexed bit stream S100 having the format shown in FIG. 2. The encoder feeds the multiplexed bit stream to the medium 5 for transmission or storage. The multiplexed bit stream is such that, if the encoder had fed the multiplexed bit stream to the system target decoder 400 for decoding, the system target decoder would have decoded the multiplexed bit stream successfully, and no overflow or underflow would have occurred in either of the input buffers in the system target decoder.
Because of the requirement that the multiplexed bit stream S100 be capable of being successfully decoded by the system target decoder 400, the encoder 100 applies a dynamically-varying compression and coding processing to at least the video signal S2. The compression ratio of the compression applied by the encoder 100 varies with time. Moreover, since the amount of the video stream that can be used to represent a picture of the video signal S2 depends on the occupancy of the video input buffer of the system target decoder at the instant that the picture is compressed, the amount of compression applied to a given picture varies dynamically. The amount of the video stream derived from a given video sequence will differ if the given video sequence is processed on different occasions. Accordingly, the compression ratio of at least the video stream produced by the encoder 100 varies constantly.
As shown above, the audio stream and the video stream are time multiplexed to provide the multiplexed bit stream S100. The audio stream of the audio signal belonging to a given picture of the video signal is located in the multiplexed bit stream some time earlier or later than the video stream of the picture. As a result of this, the decoder 600 must provide timing synchronization between the recovered video signal produced by expanding the video stream, and the recovered audio signal produced by expanding the audio stream. To provide this synchronization, the MPEG standard stipulates that the encoder add the above-mentioned time stamps to at least some of the video packet headers and the audio packet headers. The video time stamps and the audio time stamps show timings prescribing the clocks to be used to perform synchronized decoding of the video stream and the audio stream. The video time stamps and the audio time stamps also show the times at which units (i.e., pictures) of the recovered video signal and units of the recovered audio signal obtained by expanding respective access units of the video stream and the audio stream are to be presented at the decoder output. Such timing information is necessary to prevent audio/video synchronization errors from occurring if the decoder is unable to decode lost or corrupted audio or video access units. This will be described in more detail below.
FIG. 3 shows the structure of the decoder 600. In the decoder 600, the demultiplexer 601 receives the multiplexed bit stream from the medium 5. The demultiplexer demultiplexes the multiplexed bit stream into the video stream, the video time stamps, the audio stream, and the audio time stamps. The video time stamps and the audio time stamps are respectively fed to the picture rate control circuit 698 and the sampling rate control circuit 699 for use in decoding the video stream and the audio stream, respectively. The video stream from the output of the demultiplexer 601 is fed into the video input buffer 602, which precedes the video decoder 605. The audio stream from the demultiplexer is fed into the audio input buffer 603, which precedes the audio decoder 606.
The video decoder 605 removes each access unit of the video stream from the video input buffer 602 for decoding in the order in which the access unit was received by the video input buffer. The video decoder 605 decodes the video stream removed from the video input buffer 602 in response to timing signals received from the picture rate control circuit 698. The picture rate control circuit is, in turn, controlled by the time stamps fed from the demultiplexer 601. Similarly, the audio decoder 606 removes each access unit of the audio stream from the audio input buffer 603 for decoding in the order in which the access unit was received by the audio input buffer. The audio decoder 606 decodes the audio stream removed from the audio input buffer 603 in response to timing signals received from the sampling rate control circuit 699. The sampling rate controller is, in turn, controlled by the audio time stamps fed from the demultiplexer 601.
The video input buffer 602 and the audio input buffer 603 will be described in detail next. The elementary streams entering the decoders must be buffered for the following reasons. The first reason is that, as mentioned above, the compression ratios constantly change. The second reason is that the average transfer rate of the elementary streams from the medium 5 differs from the average input rate of the elementary streams to its respective decoder, depending on clock error. The third reason is that the decoders normally receive access units of their respective streams intermittently, so that the instantaneous transfer rate of the elementary stream in the multiplexed bit stream S5 from the medium 5 and the instantaneous input rate of the elementary stream to its respective decoder do not match. Therefore, the input buffers 602 and 603 are provided between the demultiplexer 601 and the video decoder 605 and the audio decoder 606, respectively, to adjust the differences in the average transfer rate and the average input rate, and in the instantaneous transfer rate and the instantaneous input rate.
FIGS. 4B-4D are bit index curves showing the time dependency of the transfer of the audio stream in the multiplexed signal from the medium 5 into the audio input buffer 603 and the input of the audio stream into the audio decoder 606 from the audio input buffer. The arrangement of the audio input buffer 603 and the audio decoder 606 is shown in FIG. 4A.
The bit index curves show the relationship between the total number of bits (shown on the y-axis) that pass a given point in the circuit at the time indicated on the x-axis.
FIG. 4B shows the average bit index at the point IA at the input of the audio input buffer 603, which reflects the average rate at which the audio stream is transferred from the medium. The curve shows that the average transfer rate of the audio stream from the medium is more or less constant. However, the curve is not a straight line because the transfer rate varies with time due to clock drift.
FIG. 4C shows the actual bit index at the point I.sub.A at the input to the audio input buffer 603. No bits are fed into the audio input buffer at first, because the multiplexer is feeding the video stream into the video buffer. Then, the demultiplexer 601 encounters the first audio packet in the multiplexed bit stream, and feeds the audio access units contained therein into the audio input buffer 603. Following the first audio packet, the demultiplexer ceases transfer of the audio stream into the audio input buffer during the time it feeds the contents of the next video packet(s) into the video input buffer. Then, the demultiplexer encounters another audio packet in the multiplexed bit stream and feeds the audio access units contained therein into the audio input buffer. This process is repeated throughout the decoding process.
FIG. 4D shows the bit index at the point O.sub.A at the output of the audio input buffer 603 as the audio stream is removed from the audio input buffer by the audio decoder 606. The audio decoder removes the audio stream from the audio input buffer one access unit at a time. Removal of the access unit takes place instantaneously, once every 24 ms, for example.
When each picture of the video signal is compressed and subject to variable length coding in the encoder 100, the amount of video stream produced changes significantly from picture-to-picture, depending on the mode in which the video signal of the picture was compressed, as described above. Accordingly, the input rate at which the video decoder 605 removes the video stream from the video input buffer 602 also changes significantly from picture to picture. As a result, the storage capacity of the video input buffer 602 is required to be considerably larger than the storage capacity of the audio input buffer 603. For example, the MPEG-1 standard requires that the size, i.e., the storage capacity, of the video input buffer 602 be 46 k bytes, whereas the standard sets the size of the audio input buffer at only 4 k bytes.
FIGS. 5A-5D include three bit index curves showing the time dependency of the transfer of the video stream in the multiplexed signal from the medium 5 into the video input buffer 602 and the input of the video stream into the video decoder 605 from the video input buffer. The arrangement of the video input buffer 602 and the video decoder 605 is shown in FIG. 5A.
FIG. 5B shows the average bit index at the point I.sub.V at the input of the video input buffer 602, which reflects the average rate at which the video stream is transferred from the medium. The curve shows that the average transfer rate of the video stream from the medium is more or less constant. However, the curve is not a straight line because the transfer rate varies gradually with time due to clock drift.
FIG. 5C shows the actual bit index at the point I.sub.V at the input to the video input buffer 602. The video stream is first fed into the video input buffer at a substantially constant rate until the demultiplexer 601 encounters the first audio packet in the multiplexed bit stream. The multiplexer interrupts feeding the video stream into the video input buffer while it feeds the contents of the audio packet into the audio input buffer 603. During this interruption, the bit index remains unchanged. At the end of the first audio packet, the demultiplexer demultiplexes the video packet header of the following video packet, and then resumes transferring the video stream into the video input buffer until it encounters another audio packet in the multiplexed bit stream. This process is repeated throughout the decoding process.
FIG. 5D shows the bit index at the point O.sub.V at the output of the video input buffer 602 as the video stream is removed from the video input buffer by the video decoder 605. The video decoder removes the video stream from the video input buffer one access unit, i.e., one picture, at a time. Removal of the access unit takes place instantaneously, once every picture period, e.g., once every 33.4 ms in an NTSC system. The amount of the video stream removed each time depends on the mode in which the picture was compressed by the encoder. FIG. 5D shows an example in which a sequence of B-pictures is followed by an I-picture, which is followed by a sequence of B-pictures. It can be seen that a much greater amount of video stream is removed from the video input buffer for one I-picture than for one B-picture.
FIGS. 6A and 6B show the buffering provided by the video input buffer 602 or the audio input buffer 603. In these Figures, the video input buffer 602 is used as an example. The Figures are both bit index curves. FIG. 6A shows ideal buffering, in which the video input buffer 602 is used simply to accommodate the differences between the transfer rate of the video stream from the medium and the input rate of the video steam to the video decoder 605. The video stream is fed into the video input buffer 602 from the multiplexer 601 at a substantially constant transfer rate, as indicated by the straight line marked IS in FIG. 6A. The video decoder removes the video stream from the video input buffer one access unit, i.e., one picture, at a time, as shown. The amount of video stream removed for any one picture can vary from about 150 kbits for an I-picture to about 5 kbits for a B-picture. Thus, the video stream bit index at the output of the video input buffer changes in steps, the step size of which depends on the number of bits used to encode each picture, as indicated by the stepped curve marked OS.
In the ideal buffering illustrated in FIG. 6A, both of the following conditions are met at all times:
(a) the difference between the amount of the video stream transferred into the video input buffer 602 from the medium and the storage capacity of the video input buffer 602 (indicated by the broken line SC), does not exceed the amount of the video stream removed from the video input buffer by the video decoder, i.e., there is no overflow; and PA1 (b) the amount of the video stream removed from the video input buffer 602 by the video decoder 605 does not exceed the amount of the video stream transferred into the video input buffer from the medium, i.e., there is no underflow. PA1 (a) the difference between the amount of the video stream (indicated by curve L.sub.1') transferred into the video input buffer 602 from the medium and the storage capacity of the video input buffer does not exceed the amount of the video stream (indicated by the curve L.sub.3) removed from the video input buffer by the video decoder 605, i.e., there is no overflow; and PA1 (b) the amount of the video stream (indicated by the curve L.sub.3) removed by the video decoder 605 from the video input buffer 602 does not exceed the amount of the video stream (indicated by the curve L.sub.1') transferred into the video input buffer 602, i.e., there is no underflow. PA1 (a) the amount of video stream (indicated by the curve L.sub.2), which is the difference between the amount of the video stream (indicated by the curve L.sub.1) fed into the video input buffer 602 and the storage capacity of the video input buffer, does not exceed the amount of the video stream (indicated by the curve L.sub.3 ') removed from the video input buffer by the video decoder 605, i.e. there is no overflow; and PA1 (b) the amount of the video stream (indicated by the curve L.sub.3 ') removed from the video input buffer by the video decoder does not exceed the amount of the video stream (indicated by the curve L.sub.1) transferred into the video input buffer 602 from the medium, i.e., there is no underflow.
However, as illustrated in FIG. 6B an overflow or an underflow can sometimes occur in buffering. In FIG. 6B the transfer rate at which the video stream is received from the medium 5 varies with time. The video stream is otherwise similar to that shown in FIG. 6A. Initially, the video input buffer 602 receives an excess amount of video stream compared with that required by the video decoder 605, with the result that the video input buffer overflows at the point indicated by the letter A. Later, the transfer rate of the video stream received by the video input buffer falls below the demand of the video decoder for the video stream, with the result that the video input buffer underflows at point indicated by the letter B.
By controlling various ones of the parameters involved, input buffer overflow or underflow can be prevented. Some ways of preventing overflow or underflow are illustrated in the bit index curves shown in FIGS. 7A through 7C.
The first method illustrated in FIG. 7A is called the medium slave method. In this method, the amount of the video stream transferred from the medium 5 to the video input buffer 602 is controlled to prevent an overflow or underflow from occurring. Without such control, the transfer rate is indicated by the curve L.sub.1. With control, the transfer rate is that indicated by the curve L.sub.1'. The amount of the video stream transferred from the medium is controlled so that the following two conditions are satisfied:
The curve L.sub.2 shows how controlling the amount of the video stream transferred into the video input buffer 602 from the medium controls the difference between the amount of the video stream transferred into the video input buffer and the storage capacity of the video input buffer. The curve L.sub.2' shows this difference when the amount of the video stream transferred into the video input buffer from the medium is not controlled.
The second method illustrated in FIG. 7B is called the decoder slave method. In this method, the picture rate of the video decoder is controlled to change the amount of the video stream removed from the video input buffer by the video decoder. The picture rate is controlled such that the following two conditions are both met:
The actual amounts of the video stream removed from the video input buffer by the video decoder are indicated by the curve L.sub.3'.
The above explanation is made with reference to the video stream, but similar results can be obtained for the audio stream by changing the sampling rate of the audio decoder 606 to adjust the rate at which the audio stream is removed from the audio input buffer 603.
The third method illustrated in FIG. 7C adjusts the amount of the video stream removed from the video input buffer 603 by the video decoder 605. For example, the method may cause the video decoder to skip decoding portions of the video stream or to repeat decoding portions of the video stream to adjust the amount of the video stream removed from the video input buffer.
The curve L.sub.3 ' shows the changes in the amount of the video stream removed from the video input buffer 602. To prevent an overflow from occurring early in the sequence, the amount of the video stream removed from the video input buffer is increased by removing some video access units from the video input buffer but not decoding them. Later, to prevent an underflow, the amount of the video stream removed from the input buffer is reduced by removing some video access units from the video input buffer and decoding them more than once. This provides additional pictures without removing video access units from the video input buffer.
Changing the picture rate of the video decoder, the sampling rate of the audio decoder, or the transfer rate of the multiplexed bit stream from the medium 5, as just described, causes undesirable side effects on the systems external to the video and audio signal processing system 110. Therefore, the changes just described cannot be made freely, and may only be made within a limited range. Consequently, it is desirable to control the multiplexed bit stream produced by the encoder so that the buffering requirements in the decoder can be met comfortably without having to resort to the correction methods just described.
Malfunctions in the buffering process are most likely to occur at the start of decoding. An underflow will result if the decoder attempts to remove an access unit of the stream from the input buffer before the whole of that access has been transferred into the input buffer from the medium. To prevent this, the decoding processing is started only after certain delay time has elapsed after transfer of the bit stream from the medium has begun. This allows the audio stream and the video stream to accumulate in the respective audio and video input buffers before the respective decoders start removing units of the audio stream and the video stream for decoding.
FIGS. 8A through 8D show some effects of a startup delay on buffering. FIG. 8A shows ideal buffering, similar to that shown in FIG. 6A. FIG. 8B shows the beneficial effect of a suitable startup delay when the multiplexed bit stream is transferred from the medium at a varying transfer rate. In FIG. 8B, the startup delay allows additional video stream to accumulate in the video input buffer 602 before the video decoder 605 starts to remove access units of the video stream from the video input buffer.
Care must be exercised in determining the optimum startup delay. FIG. 8C shows the effect of an excessively long startup delay. In FIG. 8C, the video decoder 605 waits too long before it starts to remove the video stream from the video input buffer 602. As a result, an overflow occurs at point C. FIG. 8D shows the effect of a startup delay that is too short. The short startup delay does not allow sufficient video stream to accumulate in the video input buffer before the video decoder starts to remove the video stream from the video input buffer for decoding. As a result, insufficient video stream has accumulated in the video input buffer when the video decoder tries to remove the video stream of the first I-picture 12, and an underflow occurs at point D. FIG. 8D also shows that, with a suitable start-up delay, the video stream of the first I-picture I.sub.2 can be removed without causing an underflow.
FIG. 9 illustrates in detail how the multiplexed bit stream transferred from the medium 5 is processed by the demultiplexer 601, the video input buffer 602, and the video decoder 605 to decode the video stream in the multiplexed bit stream. The circuit arrangement of the multiplexer 601, the input buffer 603, and the video decoder 605 is shown at the top of the drawing.
An example of a portion of the multiplexed bit stream is shown at the left side of the drawing. The portion of the demultiplexed bit stream includes all of the pack n, and the beginning part of the pack n+1. Each pack begins with the pack header, which includes the clock reference SCR, which shows the decoding timing of the pack.
The pack n begins with the pack header (Pack Header n), and contains the video packet m, which, in turn, contains the video stream for the pictures i and i+1. The video packet m begins with the video packet header (V.Packet H), which includes the presentation time stamp PTSm and the decoding time stamp DTSm.
The pack n+1 follows the pack n, and includes the pack header (Pack Head n+1), which includes the clock reference SCRn+1. Following the pack header are the video packets m+1 and m+2, and possibly more video packets. Each of the video packets m+1 and m+2 includes a packet header including a decoding time stamp DTS, and the video stream of one picture.
FIG. 9 also shows the bit index curves for the input (marked I.sub.V) and the output (marked O.sub.V) of the video input buffer 602. Various events in the multiplexed bit stream are linked to the bit index curves with broken lines, and are also shown on the x-axis of the bit index curve. The bit index curve I.sub.V represents the bit index of the video stream transferred to the video input buffer 602 from the medium 5 via the demultiplexer 601. The bit index curve O.sub.V represents the bit index of the video stream removed from the video input buffer by the video decoder 605.
The multiplexed bit stream is processed as follows: at the timing indicated by the clock reference SCRn in the pack header of the pack n, the video stream contained in the pack n, i.e., the video stream of the pictures i and i+1, is transferred via the demultiplexer 601 to the video input buffer 602. Then, at the timing indicated by the clock reference SCRn+1 the video stream contained in the pack n+1 is transferred into the video input buffer 602 via the demultiplexer 601. The time stamps in the video packet headers are stored elsewhere.
Later, at the time indicated by the decoding time stamp DTSm in the header of the video packet m, the video stream of the picture m is instantaneously removed from the video input buffer 602 by the video decoder 605. Then, one picture period later, the video stream of the picture i+1, which was also included in the video packet m, is removed from the video input buffer by the video decoder. Later, at the timing indicated by the decoding time stamp DTSm+1 included in the packet header of the video packet m+1, the video stream of the picture i+2, which is the first picture beginning in the video packet m+1, is removed from the video input buffer 602 by the video decoder 605.
At the time indicated by the decoding time stamp DTSm+2 in the packet header of the video packet m+2, the video stream of the picture i+3, which is the first picture beginning in the video packet m+2, is removed from the video input buffer 602 by the video decoder 605. Following removal of the video stream of the picture i+3, the video streams of the pictures whose video streams follow the video stream of the picture i+3 in the video packet i+3, are removed from the video input buffer 602 at times that are increments of one picture period later than the time indicated by the decoding time stamp DTSm+2.
The timings indicated by the time stamps may be stored as absolute timings using, for example, a crystal oscillator and a reference clock of 90 kHz. In this way it is possible to use the difference between the clock reference and the time stamps as the start-up delay.
As mentioned above, when a decoder according to the MPEG standard is used for decoding an audio stream and a video stream, it is necessary to synchronize the times at which units of the respective decoded signals resulting from decoding corresponding access units of the audio stream and the video stream are fed to the decoder output. The time at which a decoded signal unit is fed to the decoder output is called the presentation time of that unit. The time stamps in the multiplexed bit stream are used to provide this synchronization.
Part of providing the necessary synchronization includes reordering the video signal resulting from decoding the video stream. This is illustrated in FIG. 10. As mentioned above, the video stream includes the video streams of pictures that are compressed as I-pictures, as P-pictures, and as B-pictures. Of these pictures, the decoding time and the presentation time are only the same for B-pictures. Incidentally, the decoding time and the presentation time are also the same for the audio stream. I-pictures and P-pictures have a presentation time that is later by a number of picture periods than the decoding time. The video decoder 605 removes the video stream of an I-picture or a P-picture from the video input buffer 602 at the time indicated by the decoding time stamp DTS. After the video stream of a picture has been decoded, the resulting decoded video signal is temporarily stored in the video decoder output buffer 611. Then, at the time indicated by a presentation time stamp PTS, the video signal of the picture is fed from the video decoder output buffer to the output of the video decoder 605 to provide a picture of the video output signal.
For example, in FIG. 10, the video stream of the I-picture I.sub.2 is removed from the video input buffer 602 at the time indicated by the display time stamp DTSm for decoding, and the resulting video signal is stored in the output buffer 611 provided in the video decoder 605 for temporarily storing the video signals of decoded I-pictures and P-pictures.
Then, the video decoder 605 consecutively removes the video streams of the B-pictures B.sub.0 and B.sub.1 from the video input buffer 602, consecutively decodes them, and feeds the resulting video signals to its output one picture period apart.
Next, the video decoder 605 removes the video stream of the P-picture P.sub.5 from the video input buffer 602. The video decoder instantaneously decodes the video stream, and stores the resulting video signal in the output buffer 611. Also, at the time indicated by the presentation time stamp PTS of the I-picture I.sub.2, which has the same value as the decoding time stamp of the P-picture P.sub.5, the video decoder feeds the video signal of the picture I.sub.2 to its output.
Finally, in this example, the video decoder 605 consecutively removes the video streams of the B-pictures B.sub.3 and B.sub.4 from the video input buffer 602, consecutively decodes them using the stored pictures I.sub.2 and P.sub.5 as reference pictures, and feeds the resulting video signals to its output one picture period apart.
Since the video streams of I-pictures and P-pictures differ in their decoding timing and their presentation timing, a presentation time stamp and a decoding time stamp, respectively indicating the presentation time and the decoding time, are included in the video packet headers of the video packets in which the video streams of I-pictures or P-pictures begin. However, both types of time stamps need not be included, because, according to the MPEG decoding rules, the presentation time of each I-picture or P-picture is the same as the decoding time of the following I-picture or P-picture. In other words, the decoding time stamps can be omitted, and each I-picture or P-picture can be decoded at the time indicated by the presentation time stamp of the previous I-picture or P-picture.
FIG. 10 also shows the consequence of the differing decoding and presentation times of the MPEG video signal. It can be seen from the bit index curve that the video decoder removes the video streams of the pictures from the video input buffer in the order in which they were transferred into the input buffer from the medium 5, i.e., in non-sequential picture order. However, the presentation time stamps of the pictures cause the pictures to be displayed in their sequential order shown at the bottom of the Figure.
As stated above, the time stamps are included in the multiplex layer of the multiplexed bit stream, and not in the audio or video stream layer. This means that when the multiplexed bit steam is demultiplexed in the decoder, the correlation between the time stamps and the access units to which they pertain is lost. The decoder must therefore include a provision to link the time stamps extracted from the multiplexed bit stream with their respective access units. One approach is shown in FIGS. 11A and 11B.
In FIG. 11A, the decoder 600 includes the demultiplexer 601, which receives the multiplexed bit stream from the medium 5. The demultiplexer demultiplexes the video stream and the video time stamps from the multiplexed bit stream and feeds these into the video stream reconfiguration unit 692. The demultiplexer also demultiplexes the audio stream and the audio time stamps from the multiplexed bit stream and feeds these into the audio stream reconfiguration unit 693. The output of the video stream reconfiguration unit is fed into the video input buffer 602, which precedes the video decoder 605. The decoding in the video decoder is controlled by the picture rate control circuit 698 in response to the video time stamps. The output of the audio stream reconfiguration unit 693 is fed into the audio input buffer 603, which precedes the audio decoder 606. Decoding in the audio decoder is controlled by the sampling rate control circuit 699 in response to the audio time stamps.
The demultiplexer 601 receives the multiplexed bit stream S5 from the medium 5 and separates it into the video stream, the video time stamps, the audio stream, and the audio time stamps. The video stream and the video time stamps are fed into the video stream reconfiguration unit 692, which inserts the video time stamps into the video stream. For example, a video time stamp is inserted between the picture i and the picture i+1 shown in FIG. 11B. The video stream, reconfigured as shown in FIG. 11B, is fed to the video input buffer 602, where it is temporarily stored. The video decoder 605 removes the video stream, including the video time stamps, from the video input buffer 602 in the order in which it was received by the video input buffer.
In a similar manner, the audio stream reconfiguration unit 693 receives the audio stream and the audio time stamps from the multiplexer 601 and inserts the audio time stamps into the audio stream. For example, an audio time stamp is inserted between the access unit j and the access unit j+1 of the audio stream shown in FIG. 11B. The audio stream, reconfigured as shown in FIG. 11B, is then fed from the audio stream reconfiguration unit to the audio input buffer 603, where it is temporarily stored. The audio decoder 606 removes the audio stream, including the audio time stamps, from the audio input buffer in the order in which it was received by the audio input buffer.
The video decoder 605 decodes the video stream removed from the video input buffer 602 in response to timing signals received from the picture rate control circuit 698. The picture rate control circuit is, in turn, controlled by the time stamps fed from the video decoder. Similarly, the audio decoder 606 decodes the audio stream removed from the audio input buffer 603 in response to timing signals received from the sampling rate control circuit 699. The sampling rate controller is, in turn, controlled by the audio time stamps fed from the audio decoder.
The decoder just described solves the problem of correlating the time stamps included in the multiplex layer with the video and audio access units to which they belong. However, embedding the time stamps into the audio and video streams results in streams that are no longer standard. A decoder that is suitable for decoding, for example, a video stream with embedded time stamps would be unsuitable for decoding a video stream in an application in which time stamps are not used. It is therefore preferable to correlate the time stamps with the access units to which they belong in a way that does not result in a non-standard stream and a non-standard decoder.
Recently, the MPEG standards have permitted packets of information other than an audio stream or a video stream to be included in the multiplexed bit stream. For example, packets of directory information may be added to the bit stream. Directory information allows pictures to be displayed during fast forward operations by providing the address of successive access points in the multiplexed bit stream. An access point is a access unit can be decoded without requiring that another access unit be decoded. For example, a video access point is a picture that is wholly or partially coded using intra-picture coding. An access point is normally located at the beginning of each Group of Pictures.
The MPEG standards stipulate that the packets containing directory information (directory packets) be interleaved with the audio packets and the video packets in the multiplexed bit stream, and also stipulate that a directory information buffer be provided in the decoder. However, the MPEG standards define neither the size nor the operation of the directory buffer. Because of the memory constraints in processors used in MPEG decoders, decoder designers allocate relatively little memory for buffering the directory information. Moreover, encoder designers have customarily made the directory packets relatively large, so that the directory packets occur relatively rarely in the multiplexed bit stream.
The impact of the present relationship between the directory buffer size and the size and spacing of the directory packets on the fast-forward operation of a video tape recorder is shown in FIGS. 12A-12E. FIG. 12A shows the arrangement of part of the multiplexed bit stream as recorded on the video tape. The directory packet consists of the directory packet header (Dir.Pkt.Hdr), followed by a set of directory entries, one directory entry for each one of the following Groups of Pictures. Following the directory packet are plural video packets containing the video stream of the Groups of Pictures. Since, in this example, there are 20 Groups of Pictures following the directory packet, the directory packet includes 20 directory entries. In these Figures, the audio packets interleaved with the video packets have been omitted to simplify the drawing.
During the fast-forward operation, the directory packet header is recognized, and the contents of the directory packet are read from the tape, and transferred into the directory buffer, as shown in FIG. 12B. However, since the directory buffer typically has a capacity of about 500 bits, and each directory typically requires about 100 bits, the directory buffer overflows after the first five directory entries have been stored.
After the contents of the directory packet have been reproduced from the tape, the address of the beginning of the first Group of Pictures (GOP 0) is read from the directory buffer, and the tape is advanced to this address to enable the access point at the beginning of the first Group of Pictures to be reproduced from the tape, as shown in FIG. 12C. While this picture is being decoded for display, the address of the beginning of the second Group of Pictures (GOP 1 ) is read from the directory buffer, and the tape is advanced to this address to enable the access point, e.g., I-picture, at the beginning of the second Group of Pictures to be reproduced from the tape, also as shown in FIG. 12C. This process is repeated, as shown in FIG. 12C up to the fifth Group of Pictures (GOP 4), after which the contents of the directory buffer are exhausted.
Then, the tape has to be rewound back to the directory packet to reproduce the next five of the directory entries. These directory entries are stored in the directory buffer, as shown in FIG. 12D. The tape recorder then uses these five new directory entries to fast forward through the pictures at the beginnings of the sixth through tenth Groups of Pictures (GOPs 5-9), as shown in FIG. 12E. In all, the directory packet must be reproduced four times for the pictures at the beginning of each of the twenty Groups of Pictures GOP 0-COP 19 to be reproduced.
The mismatch between the directory buffer capacity, and the size and spacing of the directory packets makes the fast forward operation an extremely slow one if pictures are to be reproduced during the fast-forward operation, something that is routine during the fast forward operation in an analog video tape recorder.
Using a larger directory buffer is not a complete solution to the problem just described (although a larger buffer may reduce the seriousness of the problem) because the MPEG standards do not define the size and operation of the directory packet. Hence, no matter how large the directory buffer is made, the possibility of a directory packet larger than the directory buffer always exists.
As an alternative to embedding time stamps in the audio and video streams following demultiplexing, it has been proposed to provide time stamp buffers to store the time stamps until they are needed. Separate buffers may be provided for the time stamps relating to audio access units and for the time stamps relating to video access units. Again, the MPEG standards include no direct specification for the size and operation of these buffers. However, the current MPEG standards require that the system target decoder have a maximum buffering delay of one second for both audio and video. This means that the time stamps need only be buffered for a maximum of one second, which enables the maximum size of the time stamp buffers to be calculated. If a time stamp is provided for each picture in the video stream, a buffer capacity of 30 time stamps must be provided for the video time stamps. Similarly, if a time stamp is provided for each audio access unit, a buffer capacity of 115 time stamps must be provided for the audio time stamps.
In the manner just described, the MPEG standards indirectly impose maximum size on the audio and video time stamp buffers. However, this way of setting the maximum size of the time stamp buffers has an undesirable side effect, namely, it makes the MPEG standards unsuitable for use in applications in which a longer buffer delay is necessary. For example, the low picture-rate, low bit-rate video signal shown in FIG. 13, although otherwise capable of being multiplexed according to an MPEG-standard bit rate, cannot be multiplexed by the MPEG standard because it requires a decoder buffer delay of about 5 seconds.
Since the MPEG standards are meant to be used in many applications, it is desirable to eliminate the maximum delay requirement defined by the MPEG standard and to establish instead a more rational way of defining the time stamp buffer sizes.