Historically, conversational telephony has been implemented with handsets having only one transducer to output sound only to one of the user's ears. In the last decade, users have started to use their portable handset in conjunction with a headphone to receive the sound over their two ears mainly to listen to music but also, sometimes, to listen to speech. Nevertheless, when a portable handset is used to transmit and receive conversational speech, the content is still monophonic but presented to the user's two ears when a headphone is used.
With the newest 3GPP speech coding standard as described in Reference [1], of which the full content is incorporated herein by reference, the quality of the coded sound, for example speech and/or audio that is transmitted and received through a portable handset has been significantly improved. The next natural step is to transmit stereo information such that the receiver gets as close as possible to a real life audio scene that is captured at the other end of the communication link.
In audio codecs, for example as described in Reference [2], of which the full content is incorporated herein by reference, transmission of stereo information is normally used.
For conversational speech codecs, monophonic signal is the norm. When a stereophonic signal is transmitted, the bit-rate often needs to be doubled since both the left and right channels are coded using a monophonic codec. This works well in most scenarios, but presents the drawbacks of doubling the bit-rate and failing to exploit any potential redundancy between the two channels (left and right channels). Furthermore, to keep the overall bit-rate at a reasonable level, a very low bit-rate for each channel is used, thus affecting the overall sound quality.
A possible alternative is to use the so-called parametric stereo as described in Reference [6], of which the full content is incorporated herein by reference. Parametric stereo sends information such as inter-aural time difference (ITD) or inter-aural intensity differences (IID), for example. The latter information is sent per frequency band and, at low bit-rate, the bit budget associated to stereo transmission is not sufficiently high to allow these parameters to work efficiently.
Transmitting a panning factor could help to create a basic stereo effect at low bit-rate, but such a technique does nothing to preserve the ambiance and presents inherent limitations. Too fast an adaptation of the panning factor becomes disturbing to the listener while too slow an adaptation of the panning factor does not reflect the real position of the speakers, which makes it difficult to obtain a good quality in case of interfering talkers or when fluctuation of the background noise is important. Currently, encoding conversational stereo speech with a decent quality for all possible audio scenes requires a minimum bit-rate of around 24 kb/s for wideband (WB) signals; below that bit-rate, the speech quality starts to suffer.
With the ever increasing globalization of the workforce and splitting of work teams over the globe, there is a need for improvement of the communications. For example, participants to a teleconference may be in different and distant locations. Some participants could be in their cars, others could be in a large anechoic room or even in their living room. In fact, all participants wish to feel like they have a face-to-face discussion. Implementing stereo speech, more generally stereo sound in portable devices would be a great step in this direction.