The present invention relates generally to processing telecommunications signals. More particularly, the invention provides a method and apparatus for voice transmixing of a number of voice compression bitstreams of different data rate encoding methods. Merely by way of example, the invention has been applied to voice transmixing in systems that employ multi-rate or multi-mode CELP based voice compression codecs, but it would be recognized that the invention may also include other applications.
This invention relates to speech conferencing. Conferencing has been a feature of PSTN services for more than two decades. In fact there are patents that date back to the early 1970s that outline circuits that allow analogue phone signals to be mixed into a total signal and transmitted to the non-speaking participants (U.S. Pat. Nos. 4,022,981, 4,022,991 and 4,031,328 are only three examples of such patents and FIG. 1 illustrates a digital version of such an apparatus, FIG. 2 illustrates a similar apparatus from the prior art (U.S. Pat. No. 6,463,414) that allows each conference channel to use a different voice compression scheme).
The early work was focused on summing circuits that would be part of a conference bridge. Large conferences could also be handled in a number of ways most of which were hardware circuits (see for example U.S. Pat. No. 4,000,377). The focus of much of that work was how PCM “coded” speech signals could be extracted from a Time Division Multiple-access (TDM) line, summed without causing any overflow and then re-placed on that line going to the non-speakers. FIG. 3 shows a sample prior art apparatus that can be used to determine which of the contributing conference channels is to be chosen to be passed on to the listener.
The method of choosing a speaker has always been a major issue for inventors concerned with the development of conferencing technology (see for example U.S. Pat. Nos. 4,054,755, 4,139,731, 4,257,120, 4,267,593, 4,274,155, 4,387,457 and 4,456,792). It was recognized at an early stage that typically when there are more than three conferees then people tend to be more conservative in how much they speak and so it was speculated that in most cases there is only a single person speaking. If such an assumption holds then it was interpolated that the conference can merely be a switching circuit that allows a single channel's input to be connected to all the other channel's outputs if the channel is determined to belong to a speaker. As such, a number of patented solutions to the conferencing problem included speaker detection using an energy measure. Simply put, the loudest speaker won the floor (see the previously listed U.S. Patents and FIG. 3 for an illustration of such an apparatus).
However, it was also recognized by a number of inventors in the field that the case of a single speaker did not always hold and that people did sometimes interrupt one another. It was also recognized that loud noise can sometimes take the floor from actual speakers. Although such a problem has existed for decades it was only recently that people have proposed the use of a Voice Activity Detection (VAD) algorithm to determine if there's actual speech on the incoming line (such a proposal has been made in U.S. Patent Applications Nos. 2003/0135368 and 2005/0102137). A VAD algorithm can take different forms, however, to be effective it must take into account both the time domain characteristics of speech as well as the frequency domain characteristics. In this context, the term “characteristics” refers to statistical as well as energy features of the signal.
In the recently proposed work (the two previously listed patent applications, 2005/0102137 and 2003/0135368, as well as U.S. Pat. No. 5,390,177) the VAD used is either an energy centric approach or a compression domain VAD approach. In either case, no mention is made of the error handling. VAD algorithms (like all signal detection algorithms) have a margin of error under which they operate. In some cases the erroneous detection of speech can be as high as 25%. That means speech is detected where there is no speech (actually VAD algorithms are deliberately constructed to be biased towards speech to ensure none is missed) which in turn means confusion for the speech conferencing tool as to which channels should be given the floor.
In the prior art there has also been concern about the quality of tandeming coders in the conferencing process. In this context, “tandeming” refers to the placement of speech codecs (encoder and decoder) end to end such that speech is coded and decoded using one specified coder and then re-encoded and re-decoded using a different coder, or the same coder (an apparatus that utilizes such an operation is illustrated in FIG. 2 where the conferees are accessing the same conference from a number of different networks and so encoders and decoders must be used on each channel). The concern is that once decoding has occurred, re-encoding the speech means a multiplicative effect of quality loss. That is why a number of proposed solutions have focused on the use of switching rather than tandeming (see for example U.S. Pat. Nos. 4,022,981, 4,054,757, 4,271,502 as well as U.S. Patent Application Nos. 2003/0135368 and 2005/0102137). In such solutions, a single speaker would be heard by the listening channels (with a number of variations on the same theme). However, in such cases other conferee's input is lost or not heard by all the listening participants. It is also apparent that when different compression standards are used by the input channels, the conversion from input standard to output standard must also be handled. In short, a switching solution cannot handle a situation where the input standard is different to the output standard and maintain the claimed quality advantage.
Recently, there has been some prior art published that proposed solutions for such cases based on compression level transcoding, such proposals have been made in U.S. Patent Application Nos. 2003/0135368 and 2005/0102137. Yet even in such cases there are restrictions placed on the user equipment (specifically, the end user needs to be able to receive multiple bit-streams in order to hear more than a single speaker).