Speech recognition technology converts words spoken by arbitrary speakers into written text. This technology has many uses, such as voice control applications including voice dialing, call routing and appliance control, and other applications, such as searching audio recordings, data entry, document preparation and speech-to-text processing.
Speech synthesis technology produces audible human speech by artificial means, such as a speech synthesizer. A variant of this technology, called a “text-to-speech system”, converts written normal language text into audible speech. Synthesized speech can be created by concatenating sound patterns, representations of which are stored in a database. The representations are retrieved from the database and combined in different patterns to drive a speaker system that produces the audible speech. Alternatively, a speech synthesizer can incorporate a model of the vocal tract and other human voice characteristics which can be driven by the stored representations to create a completely “synthetic” speech output.
Speech recognition and speech synthesis technologies are based on underlying models of human speech. Current prior art speech synthesis and speech recognition systems are built upon one of two theoretical models or an eclectic combination of the two models. In accordance with the first or “segmental” model, speech can be produced by linearly arranging short sound segments called “phones” or “phonemes” to form spoken words or sentences. Therefore, it should be possible to exhaustively pair a particular sound segment arrangement with a corresponding chain of alphabetic letters. However, this goal has proven to be elusive; when such sound segments are stored in a database and retrieved in accordance with alphabetic chains to synthesize speech, the resulting speech is often unclear and “artificial” sounding. Similarly, breaking speech into these segments and combining them to look for a corresponding word in a database produces many incorrect words. Accordingly, other approaches statistically exploit correlations of scattered features and interspersed (nonexhaustive) acoustic segments for speech recognition and synthesis.
In accordance with a prior art second or “articulatory phonology” model, speech is modeled as the result of a series of ongoing and simultaneous “gestures”. Each gesture is a modification of the human vocal tract produced by specific neuro-muscular systems and is classified by the anatomical structures that together produce that gesture. These structures are lips, tongue tip, tongue body, velum (together with nasal cavities) and glottis. Since these gestures may have different temporal spans, the challenge for this approach has been to systematically account for their synchronization. This is typically done by defining “phases” between gestures, but the exact determination of these phases has only been achieved on an ad hoc basis. Hence, “constellation” or “molecule” metaphors are used to bundle gestures together as a basis for speech synthesis.
None of the prior art approaches have provided a systematic and accurate model from which speech synthesis and speech recognition systems can be developed.