1. Technical Field
This invention relates to the field of text tokenization and more particularly to a method and system for supporting customized tokenization of domain-specific text.
2. Description of the Related Art
Tokenization is the process of separating text into words, punctuation and optionally, phrases. Tokenization can include case folding of words at the beginning of a sentence and special formatting modifications to an input string, such as is sometimes done with numbers. Tokenization plays a critical role in the building of speech recognition vocabularies. Tokenization can also be used in coordination with other components of a speech recognition system, for instance with a speech correction tool or a speech analysis tool for updating a system language model. To ensure consistency, it is essential to have one common tokenizer for all applications needing a particular type of token processing so that the concept of what makes up a word remains in agreement.
The inherent difficulty associated with processing a variety of electronic text can cause the expansion of the complexity of the source code forming the tokenization program. Typically, several hundred lines of source code are needed to form a tokenization program able to convert written forms to spoken forms, dividing character streams at logical word boundaries. Tokenization code can become particularly complex and troublesome in view of the multiple uses for common symbols, such as the apostrophe, comma, period and numbers. Since general purpose tokenizers cannot correctly process text in all domains, it is essential that vocabulary builders have the flexibility to customize this process. Specifically, when building a new vocabulary, it is common to make minor modifications to the general purpose tokenizer in order to correctly tokenize domain-specific strings.
Present systems address the need for context-specific tokenization, also referred to as domain-specific tokenization, in two ways. First, a vocabulary requiring special tokenization can be distributed without a vocabulary-specific tokenizer. The general-purpose tokenizer can then be used by the speech recognition system when the vocabulary becomes active, for instance during correction. In consequence, the tokenization used in building the vocabulary can differ from the tokenization used for updating the system language model. Second, as an alternative, a vocabulary requiring special tokenization can be distributed with a vocabulary-specific tokenizer which includes general purpose rules in addition to domain-specific rules.
Where a vocabulary requiring special tokenization is distributed without a vocabulary-specific tokenizer, inconsistencies can arise between the vocabulary and the personal language model. In the alternative case, where a vocabulary is distributed with a vocabulary-specific tokenizer, improvements or bug fixes directed toward future versions of the general purpose tokenizer will require rebuilding and redistributing the vocabulary-specific tokenizer of the domain-specific vocabulary. Furthermore, external software developers building vocabularies will not be able to develop vocabulary-specific tokenizers because the external developers lack the proprietary knowledge of the speech recognition system necessary for the development of the general-purpose tokenizer. Thus, no present system provides for a flexible customized tokenization system, capable of processing vocabulary-specific tokenization schemes while treating all vocabularies uniformly. Accordingly, there is a long-felt need for a flexible tokenization system, capable of processing vocabulary-specific tokenization schemes while treating all vocabularies uniformly.