In a large vocabulary continuous speech recognition system, highly accurate speech recognition requires a word dictionary in which words and phrases included in the speech are recorded and a language model by which an appearance frequency of each word or phrase may be derived. Due to the limitations of both the capacity of current storage devices for memorizing a dictionary and CPU performance for calculating frequency values, it is desirable that these word dictionaries and this language model be minimized.
Moreover, enormous amounts of time, effort, and expense are required for manual construction of a dictionary containing even only a minimum amount of words and phrases. More specifically, when a dictionary is constructed from text, it is necessary to analyze segmentation of words, firstly, and then to assign a correct pronunciation to each of the segmented words. Since a pronunciation is information on a reading way expressed with phonetic symbols and the like, expert linguistic knowledge is necessary in order to assign such information of a pronunciation in many cases. Such work and expense can be a problem because information such as a general dictionary that's been accumulated may not be useful.
Conventional studies have been made for techniques for automatically detecting, to some extent, character strings into a text that should be recognized as words. The present invention relates to a system, method and program that will automatically generate an ergodic character string that is newly recognized as pronunciation of a word. Prior art techniques used to date merely support manual detection work while others require time intensive manual correction work since the detected character string contains lots of unnecessary words even though the character strings and the pronunciation may only be partially detected. None of the prior art techniques to date provide a language model and a statistical model that result in high voice recognition accuracy in elevated noise and vibration environment due to deficiencies in the proposed language and statistical models.
Voice recognition systems as an alternative for man-machine-interfaces are becoming more and more widely used. However, in aircraft flight environment conditions they have found limited use due to the unique challenges presented by elevated noise levels, unique grammar rules, unique vocabulary, and/or hardware limitations all associated with the cockpit environment. Meanwhile, command recognitions or selections from address book entries in mobile devices, such as mobile phones, are standard functions. In automobiles, speech recognition systems are applied to record, e.g. a starting point and an end point in a navigation or GPS system with low 60-70% accuracy to date. These voice recognition solutions are inadequate for applications that require a high degree of accuracy for safety purposes such us the cockpit flight environment.
Voice Recognition algorithms rely upon grammar and semantics to determine the best possible text match(s) to the uttered phrase(s). Conventionally they are based on Hidden-Markov-models, which enable recognition but require high computing time. Since embedded systems are often employed as computing entities, having limited computing and storing resources has added to the limitation of applications of the voice recognition to the cockpit environment to date, and engendered simplified speech recognition. Constraints in the search space and saving of the resources is coming along with less reliable speech recognition and/or less comfortable handling for the user in addition to the specific limitations imposed by the cockpit environment.
The aircraft operating environment is very unique in the grammar rules that are followed and the vocabulary that is used. The grammar suite is rather extensive including “words” that represent unusual collections of characters (e.g. intersection or fix names). Same goes for the vocabulary with specific code “words” that engender particular sequences of actions in the cockpit that are known only to professionally trained pilots and not available through use of colloquial language. Elongation of the expression to be recognized within colloquial language even without the complexity of the pilotage grammar and vocabulary will lead to extremely high requirements in memory and computing power. These factors make it difficult to develop a comprehensive grammar and vocabulary set for use on an aircraft, and this has represented one of several significant challenges to bringing voice recognition to the cockpit. The elevated noise environment in flight conditions can increase in the cockpit up to 6-7 times the general room noise level found on the ground, which adds to the complexity of the task. To overcome these challenges specialized hardware and an architecture of interdisciplinary algorithms that engender voice recognition in high noise and vibration environments and is required.
Others have attempted to use dynamic grammar for enhancing voice recognition systems. For example, U.S. Pat. No. 6,125,341, entitled “Speech Recognition System and Method,” issued to H. F. Raud et al, discloses a speech recognition system having multiple recognition vocabularies, and a method of selecting an optimal working vocabulary used by the system. Each vocabulary is particularly suited for recognizing speech in a particular language, or with a particular accent or dialect. The system prompts a speaker for an initial spoken response; receives the initial spoken response; and, compares the response to sets of possible responses in an initial speech recognition vocabulary to determine a response best matched in the initial vocabulary. A working speech recognition vocabulary is selected from a plurality of speech recognition vocabularies, based on the best matched response.
U.S. Pat. No. 6,745,165, entitled “Method and Apparatus For Recognizing From Here To Here Voice Command Structures in a Finite Grammar Speech Recognition System,” issued to J. R. Lewis et al, discloses a method and system that uses a finite state command grammar coordinated with application scripting to recognize voice command structures for performing an event from an initial location to a new location. The method involves a series of steps, including: recognizing an enabling voice command specifying the event to be performed from the initial location; determining a functional expression for the enabling voice command defined by one or more actions and objects; storing the action and object in a memory location; receiving input specifying the new location; recognizing an activating voice command for performing the event up to the new location; retrieving the stored action and object from the memory location; and performing the event from the initial location to the new location according to the retrieved action and object. Preferably, the enabling-activating command is phrased as “from here . . . to here.” The user specifies the new location with voice commands issued subsequent to the enabling command. To reduce the occurrence of unintended events, these voice commands are counted so that if they exceed a predetermined limit, the action and object content is cleared from memory.
U.S. Pat. No. 7,010,490, entitled “Method, System, and Apparatus for Limiting Available Selections in a Speech Recognition System,” issued to L. A. Brocious et al, discloses a method and system for completing user input in a speech recognition system. The method can include a series of steps which can include receiving a user input. The user input can specify an attribute of a selection. The method can include comparing the user input with a set of selections in the speech recognition system. Also, the method can include limiting the set of selections to an available set of selections which can correspond to the received user input. The step of matching a received user spoken utterance with the selection in the available set of selections also can be included.
Generally, any variation in the grammar implemented in a voice recognition system is based upon previous commands or states computed within the voice recognition system. Such types of systems would have limited applicability in an avionics environment because the grammar in cockpit management systems is very fragmented for specific cockpit procedural functions. In addition all the language and statistical models for voice recognition solutions to date require voice sample training and speaker sample averaging for accurate performance. This is yet an additional detriment to expanding these voice solutions into the flight environment due to the non-repetitive unique flight environment conditions that make the task unamenable and the prohibitive costs for the hundreds of thousands of samples to be recorded for sufficiency.
The method proposed here for automated lexicon generation to be used in voice recognition provides a natural and synthetic allophone character matrix and uses an ergodic Markov model for parsing natural and synthetic allophones into a salient utterance, resulting a library and a dictionary for the voice commands of these cockpit procedural functions to be recognized from pronunciation or phonemic syntax of allophones, and not by means of translating specific text words that engender display of procedures available to this date in operational cockpits in hard copy or visual display.