The invention relates to a method and apparatus for voice communication system that obtains greater speech correlation performance between input and output utilizing a speech post-processor.
In voice telecommunications and speech storage systems, losses of speech information segments occur as a result of channel impairments, perturbations or imperfections. Sometimes these losses occur due to storage media. For wireless or packet based voice communications, these impairments or perturbations are primarily due to additive noise, interference, fading or network congestion. For digital communications in particular, source coding is used which consists of speech compression algorithms whose performance heavily relies on accurate reception of the compressed information in order that high quality reproductions can be achieved at the receiver. To this end, channel coding consisting of forward error correcting codes (FEC) coupled with interleaving methods is applied. In addition to FEC, an error mitigation method consisting of replaying previous good frames in place of bad frames or attenuation is applied. In spite of the advances of this technology, the channel disturbances frequently result in audible speech that is only partially intelligible. Customarily, the listener must perform a mental piecing together of the voice components heard, in order to make sense out of a sentence or phrase. If the listener cannot do so, the meaning is usually lost. The distortions of speech most frequently observed are missing speech segments or noisy, unintelligible sounds.
This invention is a method and apparatus for voice communication in which the receiver of the system includes a novel language-dependent speech post-processor which seeks to correct for many of the speech distortions caused by channel errors.
What this invention seeks to do is to perform a post processing of speech information that was digitally transmitted and might have been corrupted due to channel impairments. The system, in the short term, is very often unable to recover the lost or corrupted information due to the standard processing method of error control coding. Also these channel error induced disturbances are very often not well mitigated by known error mitigation techniques that are applied to the decompressed speech on the receiver side.
Recovery of speech information in the previously mentioned situations is achieved by the present invention by the unique utilization of a novel speech post-processor treatment of the speech which otherwise would have been delivered by the receiver to the speaker. The speech post-processor treatment uses a novel interpolation between signal segments corresponding to the phonemes of a selected sequence which contain unrecognized phonemes, and employs a technique that determines the most likely sequence implemented by the Viterbi algorithm for preselected speech sequences. The method and apparatus operates via the speech post-processor to develop the most likely sequence estimation for the selected sequence in which phonemes were unrecognized, and substitutes the estimations, appropriately modified to conform with the speaker""s voice characteristics, for the unrecognized phonemes in the input sequence. In this manner, the invention reconstructs the selected sequence to account for the phonemes that were lost or degraded due to channel impairments. The end result is that the speech quality is enhanced over the case where there is no speech post-processing of the voice signals.
In a particular embodiment of the invention, a telecommunication system and method having a transmitter and receiver, for individual devices, are provided with a speech post-processor connected as the final element before conversion of the speech to aural form and delivery of the speech to a listener. The speech post-processor processes speech signals in digital form, and obtains the most likely estimation of a speech sequence that contains unrecognized phonemes. The speech post-processor has a recognizer and parser that receives speech signals, and parses them into corresponding phonemes or unrecognized phonemes. Speech sequences of preselected duration are selected, and processed through an execution trellis implemented by a Viterbi algorithm to obtain a most likely sequence estimation for sequences which contain unrecognized phonemes. Only speech sequences with unrecognized phonemes are directed to the execution trellis. Following processing, the speech sequences may be recombined in time order, or directed to D/A conversion and output to a listener via a conventional device, e.g. a speaker.