The present invention relates to a method and a system for synchronizing audio and video signals in general and to a method and a system for synchronizing MPEG video and audio streams in particular.
Methods and systems for providing synchronized audio and video streams are known in the art. For example, MPEG specifications ISO/IEC 11172-1,2,3 (MPEG1) and the ISO/IEC 13818-1,2,3 (MPEG2) describe a method of encoding and decoding analog audio and video.
The encoding process consists of three stages. The first stage is digitizing the analog audio/video signals. The second stage is compression of the digital signals to create elementary streams. The third stage is multiplexing the elementary streams into a single stream.
The decoding process consists of inversing each of these stages and applying them in the reverse order. Reference is now made to FIG. 1, which is a schematic illustration of an encoding and decoding system, generally referenced 10, known in the art.
System 10 includes an encoding device 20 and a decoding device 40. The encoding device 20 includes an audio encoder 12, a video encoder 22 and a multiplexor 18. The audio encoder 12 includes an audio analog to digital converter (A/D) 14 and an audio compressor 16. The video encoder 22 includes a video A/D 24 and a video compressor 26. The audio compressor 16 is connected to the audio A/D 14 and to the multiplexor 18. The video compressor is connected to the video A/D 24 and to the multiplexor 18. An A/D converter is also known as a digitizer.
The decoding section 40 includes an audio decoder 32, a video decoder 42 and a de-multiplexor 38. The audio decoder 32 includes an audio digital to analog converter (D/A) 34 and an audio decompressor 36. The video decoder 42 includes a video D/A 44 and a video decompressor 46. The audio decompressor 36 is connected to the audio D/A 34 and to the de-multiplexor 38. The video decompressor 46 is connected to the video D/A 44 and to the de-multiplexor 38.
Each of the A/D converters 14 and 24 is driven by an independent sampling clock. The origin of this clock differs in audio and video encoders 12 and 22. Each of the respective compressors 16 and 26 is affected by the sampling clock of the A/D converter connected thereto.
Analog audio is a continuous, one-dimensional function of time. Digitization of analog audio amounts to temporal sampling and the quantization of each sampled value. It will be appreciated by those skilled in the art that the audio digitizer clock is not derived from the analog source signal.
Analog video is a two dimensional function of time, temporally sampled to give frames (or fields) and spatially sampled to give lines. The broadcasting standard of the analog video source signal (e.g. PAL, NTSC), defines the number of frames/fields per second and the number of lines in each frame/field.
Analog video is therefore a discrete collection of lines which are, like analog audio signals, one-dimensional functions of time. Timing information is modulated into the analog video signal to mark the start of fields/frames and the start of lines. The timing of the pixel samples within each line is left to the digitizer, but the digitizer must begin sampling lines at the times indicated by the signal.
Video digitizers typically feed analog timing information into a phase locked loop to filter out noise on the video signal and divide the clock accordingly to derive the pixel clock for digitizing each line. Thus the timing of video sampling is derived from the analog source signal. In the case of video, digitization refers only to the quantization of pixels and CCIR 601 is an example of a video digitizing standard that describes such a process.
The input of a video or audio compression module, such as compressors 16 and 26, is samples or sets of samples. The output is a compressed bit-stream.
As the compressor consumes the samples produced by its respective digitizer, its timing is slaved to that digitizer. In a simple model of the system, the compressor has no clock of its own. Instead, it uses the bit-rate specification to calculate the number of bits required per sample or set of samples. As samples appear at the input of the encoder, they are compressed and the compressed bits appear at the output.
It will be appreciated by those skilled in the art that the actual timing of audio or video compressed bit emission by an encoder is determined by the digitizer clock which times the arrival of samples at the compressor input.
The timing of the video digitizer 24 is derived from the video analog source and the video compressor 26 derives its own timing from the digitizer 24. Thus the timing of the video compressor is derived from the analog video source. If the timing information in the analog source is missing or incomplete, then the compressor 26 will be subject to abnormal timing constraints.
The following are examples of problematic video input sources:
The analog source is not a professional one (cheap VCR).
Noise is present on the line that carries the video signal.
The source is detached from the input for some time.
The video source is a VCR without a TBC (Time Base Corrector) and fast forward or rewind are applied.
The effects of problematic video input sources on the compressed stream depends on the nature of the problem and the implementation of the encoder.
Among the timing information present in the analog video signal are pulses that indicate the start of a field, the start of a frame and the start of a line.
If, for instance, noise is interpreted by the digitizer as a spurious pulse marking the start of a field, such that the pulse is not followed by a complete set of lines, then the timing information will become inconsistent.
One encoder might interpret the pulse as an extra field, somehow producing a complete compressed field. Another encoder might react to the glitch by discarding the field it was encoding. In both these cases, the ratio between the number of bits in the stream and compressed frames in the stream may be correct, but one encoder will have produced more frames than the other within the same interval and from the same source.
To an observer at the output of the encoders, this would appear to be caused by a variance between the clocks that drive the video encoders.
As will be appreciated by those skilled in the art, each video and audio encoder may be driven by its own clock. Decoders may also be driven by independent clocks.
As an example of the operation of system 10 the video encoder and audio encoder are fed from the same PAL source (analog video combined with analog audio).
The number of frames that are compressed within a given time interval can be calculated by multiplying the interval measured in seconds by twenty five (according to the PAL broadcasting standard).
In this example, the clocks of the video and audio decoders and the clock of the audio encoder have identical timing. The clock of the video encoder is running slightly fast with respect to the others.
Thus, within a given interval measured by the video decoder clock, the video encoder will produce more frames than the number calculated from the duration of that interval. The video decoder will play the compressed stream at a slower rate than the rate at which the video encoder produces that stream. The result will be that over any given interval, the video display will be slightly delayed.
As the timing of the audio encoder and audio decoder are identical, audio decoding will progress at the same rate as audio encoding. The result will be a loss of audio video synchronization at the decoder display.
It is a basic requirement to be able to guarantee that the decoded audio and video at the output of MPEG decoders are synchronized with each other despite the relative independence of the timings of the units in the system.
One of the methods known in the art to synchronize audio and video streams is called end-to-end synchronization. This means that the timing of each encoder determines the timing of its associated decoder. End-to-end synchronization is supported by the MPEG system layers. If this were applied to the example above, the video decoder would spend the same time displaying the video as the audio decoder spends decoding the audio. The audio and video would therefore play back synchronously.
The MPEG multiplexor implements the MPEG system layers. The system layer may use the syntax of MPEG1 System, MPEG2 Program or MPEG2 Transport. These layers support end-to-end synchronization by the embedding in the multiplexed stream of presentation time stamps (PTS fields), decoding time stamps (DTS fields) and either system clock references (SCR fields), in the case of MPEG1 System, or program clock references (PCR fields), in the case of MPEG2 Program or MPEG2 Transport. In the following, SCR will be used to refer to either SCR or PCR fields.
A conventional MPEG multiplexor operates under the following assumptions:
The deviations between all clocks in the system, free-running as they may be, are bound due to constraints on clock tolerance and rate of change.
The channel between the encoder and decoder introduces no delay or a constant delay.
The multiplexor uses an additional clock called the system time clock (STC). The multiplexor reads this clock to produce time stamps (DTS and PTS values) for each compressed stream (audio and video).
According to a first aspect of end-to-end synchronization, the SCR fields enable the reconstruction of the encoder STC by the demultiplexor 38. From time to time, the multiplexor 18 embeds SCR fields in the multiplexed stream. These fields contain the time, according to the STC, at which the last byte of the SCR field leaves the encoder.
As the delay across the channel is constant, the time that elapses between the emission of two SCR fields according to any clock is the same as that which elapses between their arrivals at the decoder according to the same clock. The elapsed time according to the STC is the difference between the SCR values embedded in those fields.
A free-running clock at the demultiplexor 38 can be adjusted using a phase-locked loop so that the interval it registers between the arrivals of any two SCR fields in the stream is exactly the difference between the times indicated by those fields. Thus the STC is reconstructed by the demultiplexor.
According to a second aspect of end-to-end synchronization, the DTS and PTS associate a time value with selected video or audio frames within the stream. These time stamps indicate the time according to the STC, at which the associated frames are encoded.
As the channel introduces no delay or a constant delay, all time values (in PTS, DTS, SCR or PCR fields) measured during multiplexing can be used to schedule the demultiplexing and decoding process. The constant delay if any, can be subtracted from the times measured in the decoder device, for comparison.
After the subtraction of the constant delay, the decoding and display of frames is scheduled at precisely the times specified in the PTS and DTS fields with respect to the reconstructed STC.
Thus both aspects of end-to-end synchronization together, ensure that each elementary stream encoder runs at the same rate as its peer encoder.
When applied to system 10, the decoder 42 will decode frames at the same rate as they are encoded, and thus the audio and video will play back synchronously.
There are some disadvantages to the above method of end-to-end synchronization. One of these disadvantages is that this method is not applicable by MPEG2 Transport multiplexors, for program encoders. The MPEG2 Transport specification requires encoders of audio and video, belonging to one Program, to be driven by the same clock. (A Program or Service within an MPEG2 Transport stream contains audio and video that are associated with each other and are expected to play back synchronously).
Another disadvantage is that even for MPEG1 System and MPEG2 Program multiplexors, the method described is not trivial to implement. The method requires the multiplexor to perform certain operations at certain times and to read the STC when those events occur. If there are deviations from those times, then the multiplexor 18 must somehow correct the STC readings.
One aspect of this disadvantage involves the embedding of the SCR fields. The SCR fields contain readings of the STC when the last byte of the field is emitted by the encoder. In a real system, the output of the multiplexor might be very bursty, thus introducing a jitter in the SCR fields. In order to reduce the jitter to a tolerable level, the multiplexor needs to perform some measure of correction for the SCR readings. Some methods known in the art apply an output stage, to smooth the burstiness.
Another aspect of this disadvantage involves the embedding of the PTS and DTS fields. If the video encoder produces a few frames together in one burst, then the time stamps read from the STC might have an intolerable amount of jitter. The multiplexor will need to smooth the burstiness of the elementary stream encoders with an input stage or it will have to correct the time stamps to compensate for their burstiness.
It will be appreciated by those skilled in the art that adding an input stage or an output stage for smoothing introduces delay.
In order to eliminate these disadvantages, it is expedient to use video and audio encoders that use the same clock to drive their digitizers together with the end-to-end synchronization method supported by the MPEG system layers.
When this policy is employed, the first disadvantage is no longer relevant as this is precisely what is required for Transport Program encoders. The second disadvantage can be overcome as follows.
The STC is selected as the video encoder clock (equivalent to selecting the audio clock). Between the compression of consecutive video frames the video clock registers, by definition of the video encoder clock, the elapse of exactly the nominal frame time (e.g. 40 milliseconds for PAL). The video clock is the STC, therefore this is also the interval observed by the STC. Therefore decoding time stamps can be calculated by multiplying the index of the video frame and the nominal frame time (e.g. 40 milliseconds for PAL).
Moreover, when the STC is selected as the audio clock audio decoding time stamps can be calculated without reading the STC. When the video clock and the audio clock are identical, all DTS and PTS values can be calculated without reading the STC.
Moreover, if the network clock is identical to the video clock, the SCR values can be calculated without reading the STC. Each SCR is calculated by multiplying the index of the last byte of the SCR field within the multiplexed stream by the constant bit-rate of the multiplexed stream across the network.
Thus, if all elementary streams and the network are driven by the same clock the implementation of end-to-end synchronization is not complicated by bursty elementary stream encoders and multiplexors.
In a typical implementation, known in the art, elementary stream encoder pairs (video and audio) are often driven by the same clock, however the network clock is independent. In these cases, time stamps can be calculated as described, but SCR values must be read in real-time from the STC.
The synchronization methods described above do not provide an adequate solution in two cases. In the first case, the video and audio encoder clocks are not synchronized (locked). In the second case the video and audio encoder clocks are synchronized, however, the video encoder clock drifts with respect to the audio encoder clock due to a problematic video source.
A time base corrector (TBC) is a device, known in the art, which filters noise out of the timing information of a video signal. The output of a TBC will be stable even if the input is a problematic video source such as a fast-forwarding VCR, a low-quality VCR or a disconnected source.
Some MPEG encoder systems require time base correction of video signals to prevent video encoder timing to drift. Professional video equipment often has a built in TBC at its output. In other cases an external appliance is used.
Using a TBC has some disadvantages, one of which is that requiring a TBC adds additional cost to the system.
Another disadvantage is that though a TBC overcomes noise introduced by problematic video sources, it does not lock the video clock with the audio clock. Therefore a TBC is useful in the above second case but not in the above first case.
A further disadvantage of using an external TBC, is that it introduces a delay in the video signal. The audio is typically not delayed in the TBC. Though this delay is small (typically one or two video frames) and constant, it can cause a detectable loss of audio video synchronization. An ideal system will need to compensate for this delay possibly introducing further cost.
Another method known in the art is referred to as periodic reset. Some MPEG encoders are programmed to stop and restart after many hours of operation. This flushes any accumulated loss of synchronization.
It will be appreciated that this method has some disadvantages, one of them being that stopping an MPEG encoder and restarting is a time consuming process. (It may take as much as a few video frames.) During this time the output of the decoder is interrupted.
Another disadvantage is that, depending on the implementation of the decoder, additional delay may be incurred until the decoder restarts after a reset.
A further disadvantage is that the interval between resets (the reset period) should be determined by the time it takes for the system to accumulate a discernible delay. This period is system dependent and therefore is difficult to determine in the general case.
It is an object of the present invention to provide a novel system for providing video and audio stream synchronization, which overcomes the disadvantages of the prior art.
It is another object of the present invention to provide a method for synchronizing audio and video streams, which overcome the disadvantages of the prior art.
In accordance with a preferred embodiment of the present invention, there is thus provided a method for synchronizing between the encoded streams. The method is implemented in an encoding system receiving a plurality of elementary streams. Each of the elementary streams includes a plurality of elementary samples.
The encoding system produces an encoded stream from each of the elementary streams. Each of the encoded streams includes a plurality of encoded samples. Each elementary stream and its associated encoded stream define a stream including a plurality of samples.
The method including the steps of:
monitoring the encoded streams,
detecting the rate of production of each encoded stream,
increasing the number of samples in one of the streams when the rate of production of the encoded stream associated with that one stream is greater than the rate of production of another encoded stream, and
decreasing the number of samples in one of the streams when the rate of production of the encoded stream associated with that one stream is lower than the rate of production of another encoded stream.
Each of the elementary streams can either be an audio stream or a video stream.
The step of increasing can either include increasing the number of the elementary samples in the elementary stream associated with that one stream or increasing the number of the encoded samples in the encoded stream associated with that one stream.
The step of decreasing can either include decreasing the number of the elementary samples in the elementary stream associated with that one stream or decreasing the number of the encoded samples in the encoded stream associated with that one stream.
The method can further include a step of normalizing between the rates of production of the encoded streams, before the steps of increasing and decreasing.
Increasing the number of samples can be performed in many ways, such as duplicating at least one of the samples, adding a sample between two selected samples, replacing selected samples with a greater number of new samples, and the like.
Decreasing the number of samples can be performed in many ways, such as deleting at least one sample, skipping at least one samples, replacing at least two samples with fewer new samples, and the like.
In accordance with another aspect of the present invention there is provided a method for detecting synchronization loss between the encoded streams. The method includes the steps of:
monitoring the encoded streams,
detecting the rate of production of each encoded stream, and
detecting the difference between the rates of production.
Accordingly, this method can also be a step of normalizing between the rates of production of the encoded streams, before the step of detecting the difference.
In accordance with a further aspect of the present invention, there is thus provided a method for reducing the rate of production of one of the streams with respect to the rate of production of another stream. The method includes the step of decreasing the number of the samples in the one stream.
In accordance with a yet another aspect of the present invention, there is thus provided a method for reducing the rate of production of one of the streams with respect to the rate of production of another stream. The method includes the step of increasing the number of samples in the other stream.