The present invention relates generally to the field of symbol reconstruction and in particular to a new and useful method for accurately reconstructing missing portions of a sequence of symbols. The invention is particularly useful for reconstructing portions of oral conversations which are made unintelligible to a listener by surrounding ambient noise, and sudden, loud “staccato” noise, among other noise sources.
Most forms of communication rely upon transmission of groups of discrete elements arranged in a manner which is understood by both the transmitting and a receiving party. Accurate communication between the transmitter and a receiver depends on the message formed by the groups of discrete elements, or symbols, being transmitted uncorrupted and intact between the two parties.
Corrupted streams of symbols or discrete elements which comprise a communications system having a known structure and probabilities can sometimes still accurately convey a message to a person capable of reconstructing the stream without the corruption. That is, the communication system has known characteristics, or language parameters. For example, when a non-native speaker of a language attempts to say something to a native speaker, the native speaker can often determine the meaning even though the message is not spoken the same way as the native speaker would because the native speaker can apply known language parameters. Similarly, when two persons communicating in the same language over a telephone have their conversation interrupted by noises, their knowledge of the language parameters of their speech sometimes allows them to “fill in” or reconstruct missing sections of the conversation and understand the intended message despite the corrupting noises.
Confidence windowing is the basis for many known reconstruction methods and employs the probability of unknown phonemes conditioned on its relationship with other symbols in the same communication stream. Confidence windows are discussed in greater detail in Christopher W. Scoville, Spatially Dependent Probabilistic Events. Master's Thesis, RPI, Troy, N.Y. 1998.
However, many times when a communication is corrupted, or damaged, by external noises, the message cannot be easily ascertained, even when both parties know the general language of communication or when other symbols in the communication are known. For example, if the outdoor performance of a symphony playing a new composition for an audience is corrupted by external noises like wind, traffic, etc., the audience will not likely be able to accurately determine what specific notes should have been heard. And, as well, when a communication is transmitted for reception by a large group of receivers, like a group of attendees at a seminar, some of the receivers of a corrupted portion of the communication may be able to reconstruct the intended message, while others cannot. This is due in part to a lack of knowledge of language rules that can be applied to the communication in these instances.
Many types of communication require accurate transmission and reception of uncorrupted messages. Computer voice recognition, for example, relies upon accurate speech communications from a person using the voice recognition. External interference with the transmission of a voice command to a computer can corrupt the command and result in no action or the wrong action being taken because the voice recognition capability cannot accurately reconstruct the command. That is, reconstruction is different from recognition in that it is a further step beyond recognition.
There are many other instances where it is advantageous to be able to reconstruct a corrupted message quickly and accurately. Speech reconstruction in particular is of great interest and has a wide range of applications, including interaction or communication with a computerized entity, law enforcement interception of communications relating to illegal activities, and assistance to persons with deficient hearing.
Different prior methods for recognizing sequences of symbols, such as speech recognition, are found in the prior art. As shown by its prevalence of use in modeling speech for recognition, hidden Markov models (HMMs) are a preferred modeling tool for this application.
Several patents disclose word recognition using hidden Markov models (HMM), including U.S. Pat. No. 5,608,840, which discloses a method and apparatus for pattern recognition using a hidden Markov model. HMMs are developed from signal samples for use in the recognition system. The HMM equations are weighted to reflect different state transition probabilities.
U.S. Pat. No. 5,794,198 teaches a speech recognition technique which reduces the necessary number of HMM parameters by tying similar parameters of distributions in each dimension across different HMMs or states.
Other patents disclosing speech recognition using HMMs include U.S. Pat. No. 5,822,731, U.S. Pat. No. 5,903,865 and U.S. Pat. No. 5,937,384. However, none of these three or the other patents teaching speech recognition disclose speech reconstruction. If a portion of received speech is not recognizable, the prior systems cannot determine the missing speech.
The ability to extrapolate and accurately replace missing pieces from a stream of symbols is what distinguishes reconstruction from recognition. Recognition assumes perfect or near-perfect communications, with no missing pieces. Recognition is effectively a conversion of a complete, uncorrupted communication from one media to another, such as voice to computer text. Reconstruction may include recognition for determining surrounding states, but is a further step beyond recognition. Reconstruction is a process of determining missing pieces of a communication and replacing those missing pieces with the correct piece, or symbol in the communication.
Hidden Markov models have been used by researchers in many speech processing applications such as automatic speech recognition, speaker verification, and language identification. An HMM is a doubly stochastic process where the underlying stochastic process for the model, usually described by a stochastic finite-state automaton, is not directly observable. The underlying stochastic process is only observed through a sequence of observed symbols, hence the term “hidden” Markov model.
A characteristic of the HMM is that the probability of time spent in a particular state, called “state occupancy”, is geometrically distributed. The geometric distribution, however, becomes a serious limitation and results in inaccurate modeling when the HMMs are used for phoneme recognition, which is essential to speech recognition.
The output of an HMM for each discrete time depends on the observation probability distribution of the current state. A discrete observation hidden Markov model, where the number of possible observation symbols is finite, can be completely described by a) the transition probably matrix describing the probability of transition between states of the finite-state automata, b) the observation probability matrix describing the probability distribution of the observation symbols given the current state, and c) the probability of being in a particular state at zero time.
Thus, the HMM output signal for each clock period depends on the observation probability distribution for the current state. With each clock pulse, a state transition is made depending on the state transition probability matrix. If transitions to the same state are allowed, then the state occupancy duration for a particular state is a random variable with a geometric probability distribution.
A semi-Markov model (SMM) is a more general class of Markov chains in which the state occupancy can be explicitly modeled by an arbitrary probability mass distribution. Semi-Markov models avoid the unrealistic implicit modeling of the state occupancy by replacing the underlying strictly Markov chain with a semi-Markov chain to explicitly model the state occupancy. As a result, semi-Markov chains do not necessarily satisfy the Markov property. While the knowledge of the current state is sufficient to determine the future states in a Markov chain, in a semi-Markov chain the future is also dependent on the past up to the last state change. Since the state occupancy durations are explicitly modeled, transition to the same state is not allowed. Although the semi-Markov model does not satisfy the strict Markov property, it retains enough of the main properties of the Markov chains.
Thus, there are drawbacks to using both HMMs and SMMs when reconstructing sequences of symbols, such as phonemes in a spoken communication.
A modification of the hidden Markov model, called a hidden semi-Markov model (HSMM) provides increased modeling accuracy over both SMMs and HMMs. The complete formulation of the HSMM and its training algorithms allow the HSMM to be used for any application currently modeled by an HMM by making appropriate modifications. Algorithms such as forward-backward procedure, Baum-Welch reestimation formula and Viterbi Algorithm can all be modified for use with an HSMM.
It should be noted that hidden semi-Markov models are different from hidden Markov models. HSMMs add a computational layer of complexity over HMMs which can increase the time to solve the equations and provide results.
Techniques have been developed at Rennselaer Polytechnic Institute to decrease the computation load while maintaining the desirable modeling characteristics of HSMMs. See, N. Ratnayake, “Phoneme Recognition Using a New Version of the Hidden Markov Model”. PhD Thesis, RPI, Troy, N.Y. 1992. Although these techniques are useful, further simplification while maintaining the accuracy of the HSMM is needed to improve it as a symbol sequence reconstruction method.
A method and system for reconstructing sequences of symbols using language parameters and a statistical assessment of the effects of known symbols on unknown symbols, are needed to improve symbol sequence reconstruction accuracy.