A few hours face-to-face meeting between parties located at different geographical locations has proven to be a very effective way of building lasting business relations, getting a project group up to speed, exchanging ideas and information and much more. The drawback with such meetings is the big overhead that goes to travel and possibly even overnight lodging, which often makes these meetings too expensive and cumbersome to arrange. Much would be gained if a meeting could be arranged so that each party could participate in the meeting from their own geographical location and the different parties could communicate as easily with each other as if they were all gathered together in a face-to-face meeting. This vision of telepresence has blown new life into the research and development of video-teleconferencing systems, where great efforts are being put into the development of methods for creating a perceived spatial awareness that resembles that of an actual face-to-face meeting
One important factor of a real life conversation is the ability of the human species to locate participants by using only the sound information. Spatial audio, which is explained in more detail below, is sound that contains binaural cues, and those cues are used to locate sound sources. In a teleconference that uses spatial audio, it is possible to arrange the participants in a virtual meeting room, where every participant's voice is perceived as if it originated from a specific direction. When a participant can locate other participants in the stereo image, it is easier to focus on a certain voice and to determine who is saying what.
In a teleconference application that supports spatial audio, a conference bridge in the network is able to deliver spatialized (3D) audio rendering of a virtual meeting room to each of the participants. The spatialization enhances the perception of a face-to-face meeting and allows each participant to localize the other participants at different places in the virtual audio space rendered around him/her, which again makes it easier for the participant to keep track of who is saying what.
A teleconference can be created in many different ways. One may listen to the conversation through headphones or loudspeakers using stereo or mono signals. The sound may be obtained by a microphone utilizing either stereo or mono signals. The stereo microphone can be used when several participants are in the same physical room and the stereo image in the room should be transferred to the other participants located somewhere else. The people sitting to the left are perceived as being located to the left in the stereo image. If the microphone signal is in mono then the signal can be transformed into a stereo signal, where the mono sound is placed in a stereo image. The sound will be perceived as having a placement in the stereo image, by using spatialized audio rendering of a virtual meeting room.
For participants of an advanced multimedia terminal the spatial rendering can be done in the terminal, while for participants with simpler terminals the rendering must be done by the conference application in the network and delivered to the end user as a coded binaural stereo signal. For that particular case, it would be beneficial if standard speech decoders that are already available on the standard terminals could be used to decode the coded binaural signal.
A codec of particular interest is the so called Algebraic Code Excited Linear Prediction (ACELP) based Adaptive Multi-Rate Wide Band (AMR-WB) coder [1-2]. It is a mono-decoder, but it could potentially be used to code the left and right channels of the stereo signal independently of each other.
Listening tests of AMR-WB coded teleconference related stereo recordings and synthetically rendered binaural signals have shown that the codec often introduces coding artifacts that are quite disturbing and distort the spatial image of the sound signal. The problem is more severe for the modes operating at a low bit rate, such as 12.65 kbs, but is even found in modes operating at higher bit rates. The stereo speech signal is coded with a mono speech coder where the left and right channels are coded separately. It is important that the coder preserve the binaural cues needed to locate sounds. When stereo sounds are coded in this manner, strange artifacts can sometimes be heard during simultaneous listening to both channels. When the left and right channels are played separately, the artifacts are not as disturbing. The artifacts can be explained as spatial noise, because the noise is not perceived inside the head. It is further difficult to decide where the spatial noise originates from in the stereo image, which is disturbing to listen to for the user.
A more careful listening of the AMR-WB coded material has revealed that the problems mainly arise when there is a strong high pitched vowel in the signal or when there are two or more simultaneous vowels in the signal and the encoder has problems estimating the main pitch frequency. Further signal analysis has also revealed that the main part of the above mentioned signal distortion lies in the low frequency area from 0 Hz to right below the lowest pitch frequency in the signal.
If the AMR-WB codec is to be used as described above, it is necessary to enhance the coded signal in the low frequency range described above.
Voiceage Corporation has developed a frequency-selective pitch enhancement of synthesized speech [3-4]. However, listening tests have revealed that the method does not manage to enhance the coded signals satisfactorily, as most of the distortion could still be heard. Recent signal analysis of the method has shown that it only enhances the frequency range immediately around the lowest pitch frequency and leaves the major part of the distortion, which lies in the frequency range from 0 Hz to right below the lowest pitch frequency, untouched.
Due to the above, there is a need for methods and arrangements enabling enhancement of ACELP encoded signals to reduce the spatial noise.