When spoken words are dictated, they may be converted into text using various software applications. Components of these software applications may take input text and may manipulate that input text. The goal of this process is to attempt to turn spoken words into a final written document with as few errors as possible. A wide variety of terminology must be recognizable. In one particular field, for example, medicine, doctors or other practitioners may be dictating, for example, patient records. These doctors or medical practitioners may practice medicine in a wide variety of specialties ranging from radiology to mental health.
Speech recognition, in its most simple form, identifies the elementary units of input text. These elementary units may be called tokens, which are part of a larger string of text. Typically speech recognition systems need to be “trained” to recognize text and therefore, to properly implement a speech recognition system, it is desirable to define these tokens as accurately as possible. If the tokens are improperly defined, there may be errors in the recognition of the text, resulting in a bad “translation” of the text.
A major component in speech recognition systems is the tokenizer. Generally speaking, a tokenizer is a component in a speech recognition system that receives input text, which may be, for example, in human-readable form, and matches that input text to a particular lexicon or language model (“LM”). Tokenizers generally cannot use audio input; therefore, a tokenizer must use other means to hypothesize the tokens that were dictated to produce the output text. Thus, a tokenizer may have the ability to draw distinctions among separate usages of text. One possible problem with tokenizers is that various tokens in the output text may have any of the following characteristics: (1) different spellings from terms used in the language model (e.g., “w/o” versus “without”); (2) numeric forms (e.g., “2” versus “two”); (3) multiple different spellings in the LM (i.e., variant token forms such as “grey” versus “gray”); (4) boundaries that do not correspond to the component tokens; and (5) internal punctuation (e.g., “obstetrics/gynecology”). For example, a tokenizer may be configured to draw distinctions between various uses of the abbreviation “St.”. On one hand, the abbreviation “St.” may be used as an abbreviation of the word “saint”. On the other hand, the abbreviation “St.” may be used as an abbreviation of the word “street”. A tokenizer may be configured to make a distinction between the two usages of the same string of text.
Development of a tokenizer, however, can be very complex and tedious. In order to develop a competent tokenizer, the tokenizer needs to have substantial contextual information regarding the input text. Because the given lexicon for a particular language model is finite, there is an inherent problem creating a tokenizer that can assist in accurately identifying particular strings of characters, such as, for example, words, numbers, abbreviations, or acronyms. In reality, in everyday speech individuals use the lexicon in a manner such that variants of the terms in the lexicon are potentially infinite. That is, one problem that arises is that a finite set of tokens are utilized to define an open set of tokens that may appear in everyday usage.
Currently, tokenizers are rule-based programs. This means that programmers write individual code-like rules to address various usages or various combinations of a string of text. For example, one example of a rule-based operation for a tokenizer may include a line of code that instructs the tokenizer to recognize that a three-digit number preceded by white-space and followed by a hyphen is part of a phone number or a social security number. Additionally, the code may also include an instruction that instructs the tokenizer that if the hyphen is followed by two digits and another hyphen it is a social security number, whereas if it includes two numbers followed by another number, then it is a phone number. The complexity of a rule-based system becomes readily apparent when looking at this simple example. Many of these tokenizers will require multiple lines of code to recognize each token. Debugging these rule-based tokenizers is extremely tedious. Updating the tokenizers or adding to the lexicon also becomes quite tedious.
Modern tokenizers may simply receive input text and look at the string of characters that are included as part of the input string. Then the tokenizer will run through its various rules in an attempt to classify the particular token in question. Some tokenizers may then output a variety of different possible output tokens. For example, two or three possible output tokens may be produced. These tokenizers, however, may fail to select a “best” token or candidate, or may select the incorrect token.
The present invention seeks to solve some of these shortcomings of prior art systems by utilizing a data-driven empirical tokenizer rather than the rule-based tokenizers. Such a data-driven empirical tokenizer can be achieved by implementing the various embodiments of the invention described herein.