The primary means for communication between people is speech. Since the early 1980s, significant progress has been made to allow people to interface with machines using speech through interfaces such as speech to text engines and text to speech engines. The former converts speech to a machine (and user) readable format; the later converts machine readable code to audio signals for people to hear.
Early speech to text engines operated on a theory of pattern matching. Generally, these machines would record utterances spoken by a person, convert them into phoneme sequences and match these sequences to known words or phrases. For example, the audio of “cat” might produce the phoneme sequence “k ae t”, which matches the standard pronunciation of the word “cat”. Thus, the pattern matching speech recognition machine converts the audio to a machine readable version “cat.” Similarly, a text to speech engine would read the word “cat”, convert it into a sequence of phonemes, each of which have a known audio signal, and, when concatenated (and appropriately shaped) produce the sound of “cat” (phonetically: “k ae t”). Pattern matching machines, however, are not significantly robust. Generally, pattern matching machines either operate with a high number of recognizable utterances for a limited variation of voice or operate with a broader variation of voice but a more limited number of recognizable utterances.
More recently, speech recognition engines have moved to continuous or natural language speech recognition (sometimes generically referred to as the processor for convenience). The focus of natural language systems is to match the utterance to a likely vocabulary and phraseology and determine how likely the sequence of language symbols would appear in speech. Generally, a natural language speech recognizer converts audio (or speech) to text in a series of processing steps. First, the audio stream is segmented into frames, which consist of short time-slices of the audio stream. Second, each frame is matched to one or more possible phonemes, or sounds as discussed above. The processor selects the best phoneme, which generally correlates to the strongest match. The processor translates the selected phonemes into words in the third step. The processor next determines the sentence, or sequence of words, that best matches the translated words using a language model. Finally, the sentence, or sequence of words, is normalized into a visually acceptable format of text. For example, a sequence of words that includes “nineteen dollars and thirty six cents” would be normalized to “$19.36”.
Determining the likelihood of a particular sequence of language symbols or words is generally called a language model, which is used as outlined briefly above. The language model provides a powerful statistical model to direct a word search based on predecessor words for a span of “n” words. Thus, the language model will use probability and statistically more likely words for similar utterances. For example, the words “see” and “sea” are pronounced substantially the same in the United States of America. Using a language model, the speech recognition engine would populate the phrase: “Ships sail on the sea” correctly because the probability indicates the word sea is more likely to follow the earlier words “ship” and “sail” in the sentence. The mathematics behind the natural language speech recognition system are conventionally known as a hidden Markov model. The hidden Markov model is a system that predicts the next state based on the previous states in the system and the limited number of choices available. The details of the hidden Markov model are reasonably well known in the industry of speech recognition and will not be further described herein.
Speech recognition engines using natural language may have users register with an account. More often than not, the user's device downloads the recognition application, database, and user audio profile to the local device making it a fat or thick client. A user audio profile supplies speaker-dependent parameters required to convert the audio signal of the user's voice into a sequence of phonemes, which are subsequently converted into a sequence of words using the combination of a phonetic dictionary (words spelled out in their phonetic representations) and a language model (expected phraseology). In some instances, the user has a thin client device where the audio is recorded (or received if not necessarily recorded) on the client and routed to a server. The server has the recognition application, database, and user audio profile that allows speech recognition to occur. The client account provides a user audio profile and language model. The audio profile is tuned to the user's voice, vocabulary, and language. The language model provides data regarding the sequence of known words in the corpus, which corpus may be generated from conversational English, medical specialties, accounting, legal, or the like. The initial training of a natural language speech recognition engine generally digitally records the audio signal of a user dictating a number of “known” words and phrases to tune the user audio profile. The known words and phrases are designed to capture the possible range of phonemes present in the user's speech. A statistical model that maps the user's speech audio signal to phonemes is modified to match the user's specific dialect, accent, or the like. These statistical model modifications are stored in a user audio profile for future recall and use. Subsequent training of the speech recognition engine may be individualized by corrections entered by a user to transcripts when the transcribed speech is incorrect.
As can be appreciated, setting up a natural language speech recognition engine requires individualizing the processor to the specific speaker. The user audio profile improves the accuracy of speech recognition as it optimizes the system for a user's specific dialect, pronunciations, or the like. However, the user audio profile training process can be tedious, time consuming, and cumbersome for the user. This is especially true in a technical service profession, such as, for example, healthcare services, financial services, legal services, and the like. The user audio profile for the technical service professions may require more extensive training due to the many technical terms associated with the profession that may not be common in the conventional language of the user. In part due to the initial time commitment, some service providers may elect not to use a speech recognition system as the initial time commitment is not recovered quickly enough to justify the initial time commitment when less efficient alternatives are immediately available. For example, healthcare service providers (e.g., doctors) can dictate medical notes to a recording that may be subsequently transcribed. Many of the dictated medical notes are over telephone based systems where the microphone in the telephone handset is used to record the audio, the speaker in the telephone handset is used to replay the audio, and the touch pad is used to control features of the recording. Other mechanisms for capturing dictated audio are a desktop computer, a workstation, a laptop computer, a tablet, a smartphone, a cellular telephone, a portable audio recorder, a personal digital assistant, or the like, to name but a few exemplary devices. The recording of the dictated medical notes is transcribed into the medical file by a trained technician (e.g., a live person) and returned to the provider for correction, if any.
Thus, against this background, it is desirable to develop improved apparatuses and methods to initially train a user audio profile for a user of a natural language speech recognition system to reduce or eliminate the need for the user to invest an initial time commitment to use the natural language speech recognition system.