Conventional speech recognition systems are highly complex and operate by matching an acoustic signature of an utterance with acoustic signatures of words stored in a language model. As an example, according to a conventional speech recognition process, a resource such as a microphone receives and converts a received acoustic signal into an electrical signal. Typically, an A/D (analog-to-digital) converter is used to convert the electrical signal into a digital representation. A digital signal processor converts the captured electrical signal from the time domain to the frequency domain.
Generally, as another part of the speech recognition process, the digital signal processor breaks down the detected utterance into its spectral components. The amplitude or intensity of the digital signal at various frequencies and temporal locations can be compared to a language model to determine the word that was uttered.
In certain cases, it is desirable to convert a received utterance spoken in a first language into text of a second language. In such an instance, a conventional two-stage process can be deployed.
For example, a first stage of the conventional two-stage process can include a speech recognition system as discussed above. More specifically, the speech recognition system in the first stage applies a speech-to-text algorithm to convert an uttered sentence into one or more sentences of text in a first language that likely represents the utterance. Thereafter, a second stage such as a language translator stage applies a language translation algorithm to convert the text in the first language into a set of text in a second language. Converting the received utterance spoken in the first language into text of the second language can alleviate the need for the speaker from having to know multiple languages and communicate in the second language.
As mentioned, conventional translation of an uttered sentence of words in a first language can include producing many possible translations of the sentence in a second language. For example, a single uttered sentence in a first language can be converted into multiple possible textual translations in the first language. Each of the different possible textual translations of the sentence in the first language can be converted into one or more possible textual translations in the second language. In general, the most likely best translation of the uttered sentence can be selected amongst the multiple possible translations based on so-called confidence values generated for each possible translation.