In a conference call, a group of terminal users is connected together in a way that when one of the participating users talks, all other participating users are able to hear the voice of the talking participant. In such a kind of communication, normally only one of the participating users is talking at a time, while the other users are listening. In a centralized conference call, the terminals of the participating users are not connected directly with each other, but via a conference call server. A centralized conference call can be realized for instance by a Voice over Internet Protocol (VoIP) conference call application in the Internet or as voice conferencing in Universal Mobile Telecommunication Services (UMTS) network's packet switched domain.
In a VoIP session, the voice data is typically carried by using the Real-time Transport Protocol (RTP) on top of the Internet Protocol (IP) and the User Datagram Protocol (UDP). RTP has been described in detail in RFC 1889: “RTP: A Transport Protocol for Real-Time Applications”, January 1996, by H. Schulzrinne et al.
An end-to-end VoIP connection is often called a VoIP tunnel. In a typical centralized conference call set-up, VoIP tunnels are formed between each participating terminal and the conference call server.
For illustration, the tunneling of coded voice in a centralized, RTP based conference call is presented in FIG. 1.
FIG. 1 schematically shows a centralized conference call system in a packet switched domain of UMTS network 11, with a conference call server 12 connected to this network 11 and with a plurality of mobile terminals 13. The mobile terminals 13 are connected to the conference call server via the UMTS network 11 using RTP tunnels 14.
At the terminals 13, voice data produced by the respective user of the terminals 13 is first encoded and then inserted to the payload of RTP packets. There is a multitude of alternative audio coders that can be used to perform the actual voice coding. For example, the Adaptive Multirate (AMR) speech codec, which is specified as the mandatory speech.codec for the 3rd generation systems, could be used to compress the speech data carried inside the RTP payload. The coders encode the speech samples to frames, which are then carried over the RTP/UDP/IP protocols via the UMTS network 11 to the conference call server 12.
The conference call server 12 comprises an RTP mixer 15, which receives the incoming RTP packet flows from the connected terminals 13, removes the RTP packaging, combines the flows into a single flow of RTP packets and then sends this flow to each of the terminals 13.
To each RTP packet transmitted between the terminals 13 and the conference call server 12, a header is associated. The structure of this header, which is specified in the above cited RFC 1889, is illustrated in FIG. 2. The header comprises a field V which identifies the version of the employed RTP and a field P for a padding bit. If the padding bit is set, the packet contains one or more additional padding octets at the end which are not part of the payload. The header further comprises a field X for an extension bit. If the extension bit is set, the fixed header is followed by exactly one header extension. The header moreover comprises a field CC for a Contributing Source (CSRC) count, which contains the number of CSRC identifiers that follow the fixed header, and a field M for a marker bit, the interpretation of the marker being defined by a profile. In addition, the header comprises a field PT for identifying the format of the payload and a field for a Sequence Number, which increments by one for each RTP data packet sent. The Sequence Number may be used by the receiver to detect a packet loss and to restore the packet sequence. The header also comprises a field for a Timestamp, which reflects the sampling instant of the first octet in the RTP data packet.
Furthermore, the RTP packet headers carry a Synchronisation Source (SSRC) identifier and, as mentioned above with reference to the CC field, a list of Contributing Source (CSRC) identifiers.
The SSRC identifier is used to identify the synchronization source that has transmitted the RTP packet in question. An SSRC identifier which is unique for the respective RTP session is associated randomly to each possible source, i.e. to each of the terminals 13 and to the conference call server 12. Each terminal 13 adds the SSRC identifier associated to it to the SSRC identifier field in the RTP header of each RTP packet it assembles. Equally, the RTP mixer 15 of the conference call server 12 adds the SSRC identifier associated to the conference call server 12 to the SSRC identifier field in the RTP header of each RTP packet leaving the server 12.
The CSRC list is used to identify different sources contributing to an RTP packet and is thus only of relevance for the RTP packets assembled in the conference call server 13. The RTP mixer 15 adds the SSRC identifiers of those terminals 13 contributing to the combined outbound VoIP flow to the CSRC fields of outgoing RTP packets. In order to enable a control of the VoIP connections using RTP, in addition a Real Time Control Protocol (RTCP) is defined in the above cited RFC 1889. RTCP is used for instance to keep both ends of a connection informed about the quality of service they are providing and receiving. This information is sent in RTCP sender report (SR) and receiver report (RR) packet types. In addition, the RTP specification defines an RTCP source description (SDES) packet type. RTCP SDES packets can be used by the source to provide more information about itself. SDES CNAME or NAME packets can be used for example to provide a mapping between the random SSRC identifier and the source identity. SDES CNAME packets are intended for providing canonical end-point identifiers, while SDES NAME packets are intended for providing a real name used to describe the respective source. The RTP mixer 15 is expected to combine SR and RR type RTCP packets from all terminals 13 before forwarding them. The SDES type RTCP packets, in contrast, are forwarded by the RTP mixer 15 to all conference participants 13 without modifications.
In a conference call it is sometimes difficult for the participating users to recognize immediately who is speaking. This is in particular a problem, in case there are many participating users in a conference call, while these participating users do not know each other very well.
The above cited RFC 1889 states that an example application is audio conferencing where a mixer indicates all the talkers whose speech was combined to produce the outgoing packet, allowing the receiver to indicate the current talker, even though all the audio packets contain the same SSRC identifier, i.e. that of the mixer.
In any sensible VoIP usage of a speech codec, however, the codec will send out Silence Descriptor (SID) frames enabling a comfort noise generation at the receiving end, as long as the respective conference participant is inactive, i.e. listening. Thus, all sources will always produce a signal that is transmitted to the conference call server 12. The conference call server 12 decodes VoIP flows received from each of the participants back to speech or to SID frames for summation before encoding the outbound speech and SID frames that will be transmitted to the terminals 13. This implies that the SSRC identifiers of all terminals 13 are included by the mixer 15 into the CSRC list of the outgoing mixed RTP packets, and therefore it is impossible for the receiving terminals 13 to distinguish active from inactive participants. It has to be noted that it has also its benefits to include the SSRC identifiers of all participating terminals 13 in the CSRC list, e.g. in order to keep each participating user up to date about the number and identity of all other users participating in the conference.