Users often encounter problems with the transmission of data, such as in particular speech data, which are raised by network problems such as a high bit error rate (BER) or packet losses. These problems occur particularly often in wireless audio communication. As a result, the quality of communication may drop and become drastically poor. If these errors result from problems with the whole network and not just from one particular communication channel, even a redial will not help to establish a call with better quality. The call may be important, however, such as for instance in an emergency call.
Consequently, telecommunication providers should offer a good solution to save the audio communication even in case of severe network problems.
Saving the communication with the negotiated and established codecs and/or bearer channels may not be possible due to poor bandwidth, high packet delay, too many packet losses or high BER.
According to U.S. Pat. No. 7,617,106 B2, in order to check a correct speech-to-text (STT) conversion, the converted text is converted into speech again. Both, the original speech and the speech created from the text representation are then reproduced via a stereo headset. It is easy for a proof-reader (who in this case is a proof-listener) to find differences between the original and converted speech. U.S. Pat. No. 7,697,551 B2 teaches to interconnect telephone and instant messaging (IM) via a system. This system converts IM text into speech and then speech back into IM text. US 2002/123892 A1 discloses an embedded system for converting speech into text, which is presented on an interface to the user. In case of an error, the user provides a misrecognition error indication to the system. In turn of this, the audio input along with a reference to the active language model is forwarded to a training process. According to CN 201440733 U a sign language image is captured by a camera of a mobile communication device. A picture track is built from the images and converted into a vague text information. This text is further refined by a grammar and word combination parameters. JP 2006005440 A teaches that in case of being in a noisy environment, the camera of a mobile phone takes pictures of the lip movements and transmits them. At the receiver side, these pictures are displayed as moving pictures. As an alternative, just lip movement parameters are transmitted. According to US 2005/049868 A1, words or phrases are passed to a text-to-speech application. The created speech is then passed to one or more speech-to-text engines. A confidence level is assigned to the derived words or phrases.
The problem mentioned above may be solved by a method according to claim 1. Advantageous embodiments of the invention are subject-matter of the dependent claims.
According to the invention, the method of maintaining audio communication in a congested communication channel currently bearing the transmission of speech in the audio communication between a sender side and a receiver side, wherein the communication channel comprises at least one signaling channel and at least one payload channel having a (variable) quality of service, comprises the following steps: the quality of service of the payload channel is monitored, the sending of speech from the sender side over the payload channel is interrupted while at the same time at least the signaling channel of the communication channel is retained in case the quality of service of the payload channel is below a specified threshold. In other words, the method provides that the sending of audio data is stopped without dropping the communication channel, i. e. while maintaining at least the signaling portion of the communication channel. It goes without saying that this interruption of the sending of speech data (also briefly called “speech”) may be carried out maintaining the “complete” communication channel, i. e. also the payload channel thereof. Instead of transmitting the speech from the sender to the receiver side, the speech is converted to text and sent as text data to the receiver side. Unless otherwise instructed by a user or by the control center of the communication method, the speech produced on the receiver side will be converted to text and sent to the (former) sender side which is now the receiver side. In other words, after the switch-over to transmitting only text data, the speech at the respective sender side is converted to text and transmitted to the respective receiver side.
As explained, by using the method of the invention, a call can be saved even under condition of poor quality of service.
According to one aspect of the invention, the transmission of the text data occurs over the payload channel.
The present invention is based on the reasoning that the bandwidth of a congested communication channel may still be sufficient to communicate/transmit the necessary information as text data and to avoid audio streaming in order to be able to use a channel with a low quality of service or bandwidth. The quality of service can be sensed by existing matrix in all types of communications. The quality of a service of a voice stream in a payload channel or a real-time transport protocol (RTP) channel may be detected inter alia as follows:
1) RTP packets (which are transported in IP UDP (User Datagram Protocol) packets) in a stream are sequentially numerated. Packet loss can be easily detected when one or more packets is/are missing. Packets which are out of sequence can also be detected. This may happen when IP packets take different routes to the destination.
2) Packets with a bit error are indicative of bad packets. Although the RTP stream may not have a bit error detect mechanism, but some encodings, RTP payloads have a possibility to detect bit errors (according to RFC 4867). In this context, codecs like G.722 and AMR-type of codec may be used.
3) Packet delay and jitter buffer set up in receivers can be used to detect poor transmission quality as well. Based on defined/used codec, the RTP packet interval is determined. Since packets delay over the period of transmission, usually a dedicated buffer is used to buffer few packets and smooth out the jitter arrivals. This buffer causes the delayed play out of the stream. Since the person on the receiver side does not see a transmitter, up to a certain amount of delay (also called “a lag”) is tolerated. The size of this buffer is, however, finite and when the arrival of packets is delayed beyond the buffer size, then pauses in speech will be recognized by the receiver side. A jitter buffer underrun can also be an indication of bad voice quality (quality of service).
4) Analyzing the audio after stream reconstruction can also be used to detect bad audio quality. Based on abrupt audio changes it is possible to detect irregularities.
It may be advantageous for the respective receiver if the method of the invention comprises a step of converting the received text back to speech. In this case, the users involved in a telephone call or audio communication may continue their communication on an aural basis and are not forced to read the transmitted texts which were previously converted from speech to text. It is of course possible that the respective users at their end may force the system to continue to display the text transmitted by the communication channel instead of getting that text re-converted to speech.
In case the quality of service of the payload channel is continuously monitored it may be advantageous to switch back to transmit speech over the retained payload channel as soon as a sufficient quality of service has been detected in order to re-establish a “normal” audio communication or telephone call.
It may be advantageous that an alarm message is sent to the respective receiver side, as soon as the sending of speech is interrupted and text data resulting from the speech-to-text conversion is transmitted instead. This may help the respective receiver to be better prepared for the imminent change of the current communication.
In case the current audio communication is being encrypted using a certain key and a specific algorithm, it is advantageous to use the same key and the same algorithm for encrypting also the transmitted text. In this manner, the character of the secure connection may be maintained although a change to a transmission of only text data has occurred.
According to one aspect of the present invention, only the signaling channel may be used for transmitting text. Thereby, it is possible to drop the payload channel of the current communication channel, e. g. in a case where the quality of service (transmission quality) becomes too low, or in order to save the charges for using the payload channel. In this instance, the data may be in any format such as RAW, XML or other formats. The communication partners should be signaled, however, that other data will be arriving instead of those previously agreed/negotiated and which type and format of text will arrive.
According to a further aspect of the invention, a step of detecting the language of the speech may be included in order to convert the speech into text of the appropriate language. Since the technology of speech-to-text is quite advanced, this solution may be well used for the present invention. In case the STT cannot detect the language, the language to be used should be indicated from the setup of the communication device at that end of the current communication at which the change to transmission of text instead of speech has been initiated.
In order to improve the handling, it is advantageous when the imminent change from speech transmission to text transmission by the party which is the sender at that time is negotiated with the other involved party, e. g. the receiver at that point in time. While negotiating the switch-over to text, the sender may also indicate which default language is being used for the text transmission.
Some STT and TTS (text-to-speech) solutions allow the users to determine further parameters such as category of voice type and a pre-defined voice character which is to be used in TTS at the receiver side. The sender for instance may indicate in its text payload that the language is US English and “voice=Mike”. Some prior art TTS solutions use these pre-defined voice characters like Mike (for male persons) or Mary (for female persons). The receiver may accept such a choice or overrule by making an own choice or by using a default value.
In order to ease the text-to-speech process on the receiver side, it may be advantageous to use a step of converting the speech at the respective sender side to a phonetic type of text.
According to a further aspect of the present invention, users may on-demand force the telecommunication system to switch over from speech transmission to text transmission by inputting a respective command. A user may for instance want to use a voice other than his or her own voice for a specific communication. Another example is the reduction of disturbing noise in the background that may be obtained by switching over to a text transmission. This works well in case the communication device is sufficiently advanced to recognize the respective user's voice and optimally convert it to text, whereupon the output will increase the clarity on the receiver side.
The problem mentioned above is solved also by a non-transitory computer-readable medium on which a respective application is stored which is able to carry out the method as described above. It goes without saying that the application has to be designed such that it may be executed on a processor of a respective communication device.
The above problem is also solved by a computer program or computer program product, for a processor of a communication device, the program being designed for carrying out a method as described above.
According to a further aspect of the present invention, the above problem may also be solved by a communication system which comprises a first communication device, a second communication device, at least one communication channel for connecting the first communication device with the second communication device and a processor for controlling the communication between the first communication device and the second communication device in a manner that a method as described above can be carried out. The first and the second communication device may be e. g. a desktop telephone, a PDA, a smart phone or a computer equipped with a microphone and connected to a telephone network.
It goes without saying that the communication system according to the present invention may comprise any of the features as described in connection with the method of the invention, and that any advantage or particularity as described above with respect to the method may be present in the system as well.
It may be advantageous that the communication system further comprises language detecting means for detecting the language of the speech and converting it to text in the appropriate language. The languages used by the two users at the sender side and the receiver side may not necessarily be the same, so that for example each user may use his/her own mother language which will then be transformed in the respective text of the same language.
In case there is no language negotiation, the party at the receiver side may ignore the message in case it cannot interpret the indicated language from the sender side. Furthermore, one user may notice the lack of proper communication—which may result in silence. In this case the respective user may continue the communication/call, terminate the call or just communicate the communication problem to the other side by speaking out this fact into the microphone.
If the user at the receiver side cannot handle the TTS in general or in the current format, the respective user can ignore this fact or try to communicate the problem to the other side.
As indicated above, the switch-over to transmission of text instead of speech occurs upon sensing that the quality of service is insufficient to maintain the audio communication without alteration.
Advantageous embodiments of the present invention are shown in the drawing in an exemplary manner which is not to be construed in a restrictive way.