In the last few years Large Vocabulary Continuous Speech Recognition (LVCSR) has improved to the point where it is more or less speaker independent. For example, Apple, Google and Microsoft have deployed speech recognition systems to transcribe voice mail, provide driving directions, act as personal assistants, and such like. The focus has been to create systems which work acceptably for a large number of users (consumers), who are willing to tolerate some errors in exchange for the convenience of speaking rather than manual entry via keyboards and touch screens.
Due to the requirement that systems work “well enough” for millions of users, the focus has been on raising the average accuracy across many users, rather than the accuracy for one particular user. Well enough in the context of the present application means an acceptable level of accuracy (typically below 100%) for a majority of the consumers. For example, well enough may in certain instances include an accuracy of 85% word/sentence recognition which is acceptable for 90% of the consumers. To capture 80% of the consumers, for example, an accuracy of 80% may be sufficient and to capture 95% of the consumers, an accuracy of 97% may be required. The percentages are exemplary only and not intended to be limiting.
Consumer grade systems do not work so well for professionals, since they lack domain specific vocabulary and phraseology, and thus would either have too high an error rate, or not return text with the requisite formatting (capitalization, abbreviations, symbols, and the like). In certain aspects, consumer grade systems do not have the appropriate lexicon or words that the professionals use as the corpus of material used to generate the models do not exist. Professionals usually have very high unit labor costs, and it does not make economic sense for them to fix recognition and formatting errors—they would be better off sending the whole job to an offshore transcriptionist, whose labor rate is (typically) a small fraction of the professional's. Professionals tend to have specialized phraseology and vocabulary, which is important to them, but is of limited or no utility to a wider audience. Thus, it would be desirable to provide speech recognition which has been customized to recognize their specific vocabulary and phraseology, without deploying it to other users where it may create confusion if unexpected words/phrases appear in their recognition results.
An example of specialized vocabulary is a medical professional who wishes to use the terminology of the International Classification of Disease from the World Health Organization, of which the tenth edition is currently being implemented, (hereinafter ICD10 terminology) when documenting her patient consultations, so that she can be compensated for the services actually rendered. In similar vein, a customer care agent may wish to use product or service specific language when documenting a particular customer's issues during a call, chat, or e-mail. Similarly, accountants may have standard vocabulary of recording generally accepted accounting principles (GAAP) or the like. Other industries that have user specific language may include lawyers (such as for example this patent application), mechanics for particular car models, etc.
Previous generations of speech recognition, such as those developed by Microsoft, Nuance, SRI, BBN, and others, have had their own proprietary ways to allow users to include new words (and their pronunciation) and to extend the language modeling to reinforce recognition of particular phrases or combinations of words. An oft repeated cycle in technology is that innovations which begin as proprietary implementations later emerge as open source distributions. Speech recognition is no exception. In the last few years we have seen the emergence of open source systems using neural nets and finite state transducers. For instance, the Kaldi open source speech recognition project can be found at kaldi-asr.org. Similarly, Carnegie Mellon University has maintained an open source recognizer “Sphinx” (cmusphinx.sourceforge.net) for many years. These newer systems are superior to their predecessors in both public and proprietary domains, although they have been aimed mainly at researchers and not so much at commercial applications. Although the technology components have changed over the years, the basic order of operations (“phases”) in recognition remains the same: there is an initial acoustic analysis, followed by a first pass decoding to produce a list of candidate transcriptions, followed by second pass of rescoring to determine the best transcription.
Based on the above, it is desirable to be able to combine a general purpose recognizer with a domain specific one on a user-by-user basis. The technology of the present application focuses on how to customize a recognition system which uses Finite State Transducers (FSTs) in its decoding and/or rescoring phases by combining separate FSTs, each of which handles a different recognition and/or rescoring scenario. One application is to enable an end user (or customer as described above) to leverage the benefits of both a general purpose recognition system catering to a group of similarly situated users, as well as user specific vocabulary and phraseology which may only be of value to that one user and no one else or a limited number of end users.