Audio signals, like speech or music, are encoded for example for enabling an efficient transmission or storage of the audio signals.
Audio encoders and decoders are used to represent audio based signals, such as music and background noise. These types of coders typically do not utilise a speech model for the coding process, rather they use processes for representing all types of audio signals, including speech.
Speech encoders and decoders (codecs) are usually optimised for speech signals, and can operate at either a fixed or variable bit rate.
An audio codec can also be configured to operate with varying bit rates. At lower bit rates, such an audio codec may work with speech signals at a coding rate equivalent to a pure speech codec. At higher bit rates, the audio codec may code any signal including music, background noise and speech, with higher quality and performance.
In some audio codecs the input signal is divided into a limited number of bands. Each of the band signals may be quantized. From the theory of psychoacoustics it is known that the highest frequencies in the spectrum are perceptually less important than the low frequencies. This in some audio codecs is reflected by a bit allocation where fewer bits are allocated to high frequency signals than low frequency signals.
One emerging trend in the field of media coding are so-called layered codecs, for example ITU-T Embedded Variable Bit-Rate (EV-VBR) speech/audio codec and ITU-T Scalable Video Codec (SVC). The scalable media data consists of a core layer, which is always needed to enable reconstruction in the receiving end, and one or several enhancement layers that can be used to provide added value to the reconstructed media (e.g. improved media quality or increased robustness against transmission errors, etc).
The scalability of these codecs may be used in a transmission level e.g. for controlling the network capacity or shaping a multicast media stream to facilitate operation with participants behind access links of different bandwidth. In an application level the scalability may be used for controlling such variables as computational complexity, encoding delay, or desired quality level. Note that whilst in some scenarios the scalability can be applied at the transmitting end point, there are also operating scenarios where it is more suitable that an intermediate network element is able to perform the scaling.
For example this scalable layer operation to audio encoding may be employed in telephony. For example in packet switched network transmission protocols typically employed for Voice over IP (VoIP) the audio signal is layer encoded using packets transmitted according to the Real-time Transport Protocol (RTP) encapsulated in the User Datagram Protocol (UDP), further encapsulated in Internet Protocol (IP).
In such media transport arrangements scalable codecs can be handled in one of two ways. In the first arrangement the enhancement layers may be transmitted in the same packets, i.e. in the same RTP session as the core layer data.
The approach of carrying all of the layers (of a media frame) in a single packet provides low overhead and easy cross-layer synchronization as the receiver decoder knows that all information for a certain media frame is carried in the same packet, which implicitly also provides cross-layer media synchronization. However, the drawback of this approach is that any intermediate network element carrying out a scaling operation needs to be aware of the details of the packet and media content structure, and then carry out a filtering operation by reading, parsing, and then modifying the packet contents.
The second approach is that the enhancement layers (or the subsets of enhancement layers) may be transmitted in separate packet stream(s) as the core layer data. This second approach requires also a signalling mechanism that can be used to synchronize the separate packet data streams carrying layers of the same media source.
However the second approach, employing separate data streams for (subsets of) layers, provides easier scaling opportunities because the scaling operations can be realized by discarding packets of some data streams and therefore not requiring the packet to be modified.
This approach therefore does not require in-depth knowledge about the packet structure but the scaling operations can be performed based on information about the relationship between the data streams.
Multiple data streams using multiple RTP sessions is the traditional way to transmit layered media data within a RTP framework (the approach is often referred to as scalable multicast).
Synchronization of multiple data streams is obviously a problem when the receiver is reconstructing a media frame using layers distributed across multiple RTP sessions.
The timeline of a media stream received in RTP packets can be reconstructed using Time Stamp (TS) information included in the RTP header. The RTP TS provides information on the temporal difference compared to other RTP packets transmitted in the same RTP session, which enables putting each received media frame in its correct place in the timeline.
However, the initial value of the RTP TS of an RTP session is a random value. Thus the RTP TS does NOT indicate an absolute time (i.e. “wallclock time”) but only a relative time or timing reference within the RTP session. Note that this “randomness” may be considered to be an unknown offset from an absolute time. The unknown offset may and is likely to be different in each RTP session.
Thus two or more RTP sessions cannot be synchronized based on their RTP TS values only and is valid for separate RTP sessions used to carry (subsets of) layers of layered encoding.
The prior-art mechanism for synchronizing multiple RTP sessions is to use the control protocol associated with the transport protocol to send additional information. Thus in the prior art Real-Time Control Protocol (RTCP) reports may be transmitted within each session. The transmitter of these RTCP reports includes both a timing reference (NTP) and a sending instant in the RTP TS domain in the RTCP Sender Reports (SR) transmitted according to a specified pattern. Furthermore, an RTCP packet also includes an identifier (SSRC) that is used to map an RTCP packet to the correct stream of RTP packets.
The receiver on receiving the control protocol packets may use these control protocol packets within the timing reference and the RTP time stamps to compute the RTP TS offset from the timing reference (NTP) for each of the RTP sessions it receives. These offsets values may then be used to match the timing of the media data received in separate RTP sessions. Therefore, for example, the receiver may combine layers of a media frame received in multiple RTP sessions.
However a problem associated with such a system is that it requires a RTCP SRs for each of the RTP sessions to be received before any full reconstruction of a media frame can be carried out. In case of layered media in practice this means that only the core layer of any layered encoding scheme is available until the synchronization information is available, i.e. until the first RTCP packets (on each of the sessions) are received.
A less complex approach, not relying on the control protocol associated with the transport protocol, has been to pre-synchronize the RTP TS across RTP sessions in the transmitting end-point. In other words the “random” initial value of the RTP TS is set to be the same value for each RTP session.
Whilst this may provide simple cross-session synchronization mechanism without need to transmit additional data, it is not in line with the RTP specification, and existing RTP implementations may therefore not support it.
Furthermore such an approach would provide synchronization at the RTP (header) level, but only for the subset of RTP payloads which were pre-synchronized. Such payload type-dependent processing at RTP level may be considered a non-desirable feature in a system handling multiple payload types.
A further prior-art solution has been to attach additional information for each transmitted data unit (e.g. a layer of a media frame in a layered encoding transport) indicating its temporal location in the presentation timeline. Such additional information may be for example a cross-layer sequence number or an additional timestamp information value that can be used at the receiver/intermediate network element to reconstruct the presentation order of media frames and layers within the frames.
This additional information adds additional overhead for each packet and although it may be possible to use information with data fields smaller than a RTP Sequence Number (SN) and RTP TS (16 bits and 32 bits, respectively), any information added to each packet would still introduce additional overhead for each transmitted layer. For example in the case of smallish pieces of speech/audio data (in order of 10-20 bytes) even one additional byte of overhead per layer may have a significant effect on the overall system performance.