The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions.
1.1—Voice Recognition Nomenclature
As used herein, the terms “Voice Recognition” (VR), “Speech Recognition” (SR), “Automatic Speech Recognition” (ASR), “Computer Speech Recognition (CSR)”, and just “Speech To Text” (STT) are used interchangeably. Throughout this specification where ever one of these terms occurs any of the other terms may be substituted to obtain different embodiments.
1.2—Different Scientific Approaches
There are at least two widely used scientific approaches in use today for implementing Voice recognition, which include (1)—Hidden Markov models, and (2)—neural networks. The methods and systems disclosed herein are approach-independent, and may incorporate any of the above specified approaches or any other underlying scientific approach used to implement voice recognition.
1.3—Evolution of Voice Recognition
Earlier versions of voice recognition software were limited to Navigation and Discrete Dictation programs. Speech recognition software used for “Navigation” is limited to commands that control an application. “Discrete Dictation” systems identify each individual word that is spoken; thus requiring the speaker to pause between each word, so that the computer can identify each word.
Later software uses “Continuous Dictation” systems. With continuous systems, users speak at a natural pace. When spoken at a natural pace, words are blurred together and the acoustics of each word, that is, the way each word sounds and/or is pronounced, changes depending on the preceding and subsequent words.
1.4—Some Principles of Voice Recognition
Understanding how voice recognition software works is a helpful to understanding the causes of voice recognition errors, and the basic problem associated with voice recognition technology.
Speech may be converted to digital text based on vocabulary models and language models as follows:
1.4.1—the Vocabulary Model (which May Also be Referred to as the “Vocabulary Dictionary”)
“Vocabulary Models” is a database that stores matches between multiple samples of the acoustics of the spoken word in association with the digital text of the word in a pre-defined dictionary (e.g., a vocabulary dictionary).
The vocabulary model can be created by the cumulative input of all previously spoken words (word acoustics), associated with the digital text of the word, where the spoken words have been previously correctly recognized by the voice recognition software.
The vocabulary model will include errors in voice recognition that were corrected. In other words, recordings having words that were previously incorrectly recognized or that the software was not able to recognize (e.g., such as when the acoustics of a spoken word could not be definitively associated with any acoustic word samples in the vocabulary dictionary) that have subsequently been corrected (e.g., the acoustics of the word, as spoken by the user, is added to the vocabulary dictionary and is associated with the correct digital text of the word in the vocabulary dictionary), so that in the future the same word in the same context and/or when pronounced the same way for other reasons (and therefore has the same acoustics) will be recognized.
The vocabulary module may be constructed (in whole or in part) by extracting the acoustics of spoken words in the language module (which is described below) associated with the correct digital text of the word from the language module.
1.4.2—Language Model (Also Known as the “Language Dictionary”)
When users talk at a natural pace (continuous speech), words are blurred together and the acoustics of each word changes depending on the preceding and subsequent words. The function of the Language module is to choose sentences which contain the specific preceding and subsequent words which appear in the sentence that is being processed by the vocabulary module (which is used to identify the digital text associated with the word being recognized).
The function of the language model is to assist the vocabulary model to choose both preceding and subsequent words in a sentence, or part of a sentence, that are likely to occur in a sentence that is being processed by the vocabulary module.
The language model can be created and/or augmented by the cumulative input of the acoustics of all previously user spoken words (e.g., the corresponding user spoken sentence and/or word acoustics and the correct digital text spelling of the words) that have been correctly recognized by the Voice Recognition software.
It should be noted that the Language Model will include sentences that were previously spoken where the voice recognition initial was not able to identify the word being spoken with previous word voice recognition errors that have subsequently been corrected.
It is the purpose of the Language model that the accumulated sentences contained therein (and corresponding sentence and/or word acoustics) may be the same, or at least have the same previous and subsequent words that appear in the sentence being processed by the Vocabulary module.
1.5—Subject Specific Speech Environment
The term “subject-specific speech” simply means when “everybody” is talking about precisely the same subject (e.g., Industry, Professional or Government Job-Specific Function), the meaning of words become more clear and precise, and it is the norm that the same and similar sentences and phrases are used repetitively on a regular basis.
The subject-specific approach is the only scenario in which the Speech Recognitions' vocabulary dictionary can realistically contain the required words, with the same previous and subsequent words and corresponding acoustic properties of each of the words, in the vocabulary model (i.e., vocabulary dictionary).
The subject-specific approach is the only scenario in which the Speech Recognitions' language model can realistically and effectively enable the vocabulary by having a high probability of containing sentences, (and corresponding sentence/word acoustics) which include preceding and subsequent words that are likely to occur in a sentence being processed by voice recognition software utilizing the vocabulary model.
1.6—Voice Recognition Errors
Voice recognition errors occur when the acoustics of the spoken word do not definitively (that is, do not statistically definitively) match (for example, a value representative of how good of a match was found is not as great as a particular threshold that characterizes a good match) any of the acoustical samples of:
1.6.1—Any of the acoustical samples of the pronunciation of a word associated with the digital text of said word in Vocabulary Dictionary.
1.6.2—As previously mentioned (see: 1.4.2 above), when users talk at a natural pace (continuous speech), words are blurred together and the acoustics of each word changes depending on the preceding and subsequent words.
The above problem is complex due to the way people speak, as follows: A person will pronounce words differently depending on the time of day, as well as in accordance with their emotional state. Also, during a single presentation or conversation, a person will pronounce the precisely same word, located in different sentences, differently.
1.6.3—Thus, in the case that the spoken word, within a spoken sentence, being processed by the voice recognition software, examining words in the Vocabulary Dictionary as per above, and said spoken word in said spoken sentence contains previous and subsequent words which are located in a sentence in the language module, the acoustic pronunciation of “middle word” (surrounded by the previous word, and followed by the subsequent word), the acoustic pronunciation of the middle word, together with the digital text spelling of the word, located in said library dictionary are provided to the voice recognition module to aid in the examination of said spoken word.
“new words” refers to word acoustic pronunciations and associated digital text that are not contained in the Vocabulary Dictionary. In addition to new words and the issues referenced above, some causes of word voice recognition errors are:    1—Ambient background noise or mispronunciation of a word changes the acoustics of the word.    2—As mentioned above, continuous speech changes the acoustics of individual words due to effects from the preceding and subsequent words.    3—Thus, it is advantageous that the vocabulary dictionary contain multiple acoustic versions of a single word. The more acoustic versions of a word, the better. All of the acoustic versions of the word are associated with a digital text spelling of said word. The more acoustic versions of the words that are absent from the vocabulary dictionary, the higher the probability that voice recognition errors will occur.    4—Thus, it is advantageous that the language dictionary contain multiple digital text sentences stored in the language model, together with the acoustic properties of each word in the sentence—the more, the better. The fewer digital text sentences in the language model, the higher the probability that voice recognition errors will occur.    5—In the case that the language model is domain-independent, meaning that the language model is derived from (e.g., includes) sentences relating to multiple subjects (e.g., any subject), the language model is less able to effectively assist the vocabulary model to choose both preceding and subsequent words in a sentence contained in the language model, that also appears in the sentence being processed by the vocabulary module.
1.7—Different Voice Recognition “Modes” & “Types”
1.7.1—Voice Recognition Modes
Speaker-Dependent Speaker Mode
In order to increase recognition accuracy, many voice recognition systems require the user to undergo a voice recognition training process to enable the system to “get to know” the general characteristics of how the specific user pronounces words. While there are several types of training, typically, text sentences are presented to the user, and the user reads out-load into a microphone these text sentences. Of course, the more sentences and paragraphs read by the user the bigger the sampling of how the user pronounces words, and the better the voice training that results. The problem with voice recognition training is that the level of voice recognition accuracy is limited to the amount of voice recognition training, which for commercial purposes (acceptance by the user), is usually limited to one hour or less.
In an embodiment, “Speaker-Dependent training never stops,” meaning that as the user uses the system, the more of the users input is used for training.
Speaker-Dependent Training
In an embodiment, Speaker-Dependent Training (training attuned to a single speaker's voice), every pronunciation of every word in every sentence spoken during every voice recognition session ever conducted by every user is captured, on a cumulative ongoing (post error-correction) basis, and is stored in knowledge-base. The knowledge base may be a relational database (or other database) that may be located remotely from the user (e.g., stored in “the cloud”) that stores a recording of the acoustics and digital text associated with a word, subject-specific vocabularies and language dictionaries for each of a collection of specific subject. Although throughout this specification, a relational database or RDB are referred to, any other type of database may be substituted for a relational database to obtain different embodiments.
During Voice Recognition session processing, the Voice Recognition system will access and search the cumulative central remote subject-specific Vocabulary Dictionary to determine if the acoustics of each word that is being processed is either a “known word” or a “voice recognition error”.
During the voice recognition error-correction process (described below), voice recognition errors will be corrected (using the actual voice of the speaker and thereby the acoustics of each voice recognition error word and the associated digital text spelling of the word) will be added to the cumulative central remote subject-specific RDB & remote subject-specific Vocabulary & Language Dictionary. Thus, the error-correction process cumulatively improves the voice recognition accuracy of “all users” on an ongoing basis.
Alternately, in order to reduce the search processing to only one specific “speaker-dependent” users' words and sentences, the RDB containing data relating to the speakers “user-id and “speaker-mode” (i.e., speaker-dependent) may be used to periodically download mini vocabulary dictionaries containing only one speaker-dependent user's cumulative data to the PC of each and every speaker-dependent user of the voice recognition system.
During Voice Recognition session processing session for a specific speaker-dependent user, the Voice Recognition first search the speaker-dependent users' PC mini vocabulary dictionary system to determine if the acoustics of the word being processed is a “known word”. Only in the case that the word being processed by the voice recognition system is found to be “not known” to the speaker-dependent users' PC mini vocabulary dictionary, then the cumulative central remote subject-specific Vocabulary Dictionary will be searched to determine if the acoustics of a word being processed is either a “known word” or a “voice recognition error”.
During the voice recognition error-correction process (described below), voice recognition errors will be corrected and thereby the acoustics of each voice recognition error word and the associated digital text spelling of the word) will be added to the cumulative central remote subject-specific RDB and remote subject-specific Vocabulary Dictionary. Thus, the error-correction process cumulatively improves the voice recognition accuracy of “all users” on an ongoing basis.
Speaker-Independent Speaker Mode
There are many applications, such as inputting an audio recording of one or more people talking (e.g. “any-person talking) during which voice recognition has no sampling of the speakers' voice, which is inherently less accurate than “Speaker-Specific Speech”. The only training the voice recognition system has is preloaded (background) samples of user speech which comes together with the product.
Here too, even with speaker-independent speech, “User-Independent training never stops”.
Speaker-Independent Training
With speaker-independent training (training attuned to any speaker's voice), every pronunciation of every word in every sentence spoken during every voice recognition session ever conducted by each and every user is captured, on a cumulative ongoing (post error-correction) basis, and is stored in the knowledge-base (e.g. a central remote subject-specific RDB & The remote subject-specific Vocabulary & Language Dictionaries).
While processing a session, during voice recognition, the voice recognition system may access and search the cumulative central remote subject-specific Vocabulary Dictionary to determine if the acoustics of each word that is being processed is either a known word (a pronunciation-of-a-word already in the knowledge base) or a voice recognition error (a pronunciation-of-a-word not in the knowledge base).
During the voice recognition error-correction process (described below), voice recognition errors are corrected (using the actual voice of the speaker and thereby the acoustics of each voice recognition error word and the associated digital text spelling of the word) are added to the cumulative central remote subject-specific RDB and the remote subject-specific vocabulary and language dictionaries. Thus, the error-correction process cumulatively improves the voice recognition accuracy of “all users” on an ongoing basis.
During the processing of a voice recognition session for a specific speaker-independent user the cumulative central remote subject-specific vocabulary dictionary is searched to determine if the acoustics of a word being processed is either a known word or a voice recognition error.
During the voice recognition error-correction process (described below), voice recognition errors are corrected and then the acoustics of each voice recognition error word and the associated digital text spelling of the word) is added to the cumulative central remote subject-specific RDB & remote subject-specific vocabulary dictionary. Thus, the error-correction process cumulatively improves the voice recognition accuracy of “all users” on an ongoing basis.
1.7.2—Voice Recognition Types
Sentences & Continuous Unedited Text
There are basically two ways in which voice recognition systems are used (i.e. Two “Types”)
Sentences
First, user dictation systems are provided that include a Graphical User Interface (GUI) and/or a voice command interface that enables the user, during the voice recognition session, to edit each spoken sentence with grammatical punctuation, such as a capital letter for the beginning of the sentence, commas, semicolons, and a period at the end of each sentence. In an embodiment, the minimum requirement for a sentence is a capitalize letter in the first word of a sentence and a period at the end of the sentence.
Continuous Unedited Text
A second type of voice recognition will be referred to as continuous unedited text, which refers to voice recognition systems that can capture the voice of one or more people talking, without the use of a structured text dictation system (structured text dictation system, as described above, enables user initiated grammatical punctuation). With this use of voice recognition, the voice recognition system captures a person or people talking on-the-fly and receives no indication of where a sentence begins, where a sentence ends (i.e., a period) or any other grammatical information. As a result, the voice recognition output for continuous unedited text is a continuing string of individual lower case text words, including voice recognition errors.
In this specification, the term “continuous unedited text” is used interchangeably with the term to as “continuous unedited speech”—either may be substituted for the other to obtain different embodiments.
Continuous unedited speech may be used in either the user-specific-speaker mode or user-independent speaker mode.
1.8—Technology that Improve the Performance of Voice Recognition:
1—Speech Enhancement: (Existing Technology)
Speech Enhancement technology aims to improve speech quality by using various algorithms. The objective of enhancement is improvement in intelligibility and/or overall perceptual quality of degraded speech signal using audio signal processing techniques.
Enhancing of speech degraded by noise, or noise reduction, is a field of speech enhancement, and used for many applications such as mobile phones, VoIP, teleconferencing systems, speech recognition and hearing aids.
Without specific mention, and by way of inclusion, the above detailed speech enhancement technology may be included in any embodiment of this specification, such as the embodiments disclosed in the “Summary of the Invention” and “Detailed Description of the Invention” section of this specification.