Various systems are known which enable a user to send spoken audio to another user over a network, e.g. over a packet-based network such as the Internet or a private intranet, for instance in a live voice or video call such as a VoIP (voice over Internet Protocol) call. To enable this, typically each of the users installs a respective instance of a communication client application (e.g. VoIP client) on his or her respective user terminal. Alternatively, one or both of the users may use a web-hosted instance of the communication client. In some scenarios the call or other such communication session can also comprise video.
Calls and other audio or video sessions use networks that often have significant packet loss and jitter, which impair audio quality. Poor networks are the top reason why ˜5-20% of all audio calls (depending on region) are rated poor or very poor. Thus loss and jitter can be mitigated and concealed but not eliminated. Previous solutions use forward error correction (FEC), audio concealment, or multi-path transmission techniques to mitigate network loss. However, significant loss can still result in unintelligible audio which makes communication difficult or impossible. Many calls with loss have large bursts of packet loss which makes FEC ineffective. With regard to jitter, this can be mitigated using a jitter buffer at the receive side. Increasing the length of the jitter buffer increases tolerance to jitter, but this comes at the cost of increased delay. In short, all techniques to deal with imperfect network conditions come with a limit or trade-off of one sort or another.
In one known alternative solution, a transmitting device captures voice information from a user, uses speech recognition to convert the voice information into text, and communicates packets encoding the voice information and the text to a receiving device at a remote location. The voice and text are sent in separate streams with different service levels. The receiving device receives and decodes the packets containing the voice and text information, outputs the voice information through a speaker, and outputs the text information on a visual display. Thus the system provides both real-time voice communications and a reliable stream of text encoding those voice communications in case of poor network conditions. In this way, communications equipment can display a substantially real-time transcript of a voice communications session for reference during the conversation, to supplement the voice communications during periods of reduced transmission quality, or to save for future reference.
Furthermore, a text-to-speech module at the receiving device is also able to detect a degradation in a quality of the packet-based voice communications session, and to output the transmitting user's voice information using speech synthesis to convert the remote text into an audio output. Thus the voice-to-text module is able to supplement poor quality voice communications with synthesized speech.