1. Field of the Invention
The present invention relates generally to language recognition and more particularly relates to a system and method for creating and using electronically encoded lexicons that include regular expressions for augmenting the lexicon to cover large sets of entries such as fractions, dates, and real numbers.
2. Background of the Invention
Many software applications, such as spelling verifiers or handwriting recognizers, employ lexicons (or dictionaries) in order to quickly and efficiently derive valid words or qualifying character sequences for comparative or generative purposes. In spelling verifiers and some recognizers, input data (unqualified character sequences) are compared with accepted lexicon entries and are thereby verified or invalidated. In some recognizers, a handwritten word is transcribed through a process that aligns the features of the handwritten word to the best-fitting lexicon entry. In both cases, no valid result is produced unless it is represented in the lexicon. Spelling verification software invalidates misspellings and the lexicon-restricted handwriting recognizer cannot correctly transcribe (unusual) words not found in the lexicon. For example, since the word xe2x80x98cladsxe2x80x99 is not typically found in most lexicons, it would likely be invalidated by spelling verifiers and incorrectly transcribed (perhaps as xe2x80x98dadsxe2x80x99) if written and machine-recognized.
Large lexicons are desired to provide complete coverage over the legitimate range of items (words, numbers, abbreviations, codes, mnemonics, etc.) that may be entered. Unfortunately, no general-purpose lexicon can even approach completeness since the inclusion of even a reasonable subset of common fractions, dates, and real numbers, for example, would increase the lexicon size by many orders of magnitude. As a result, the word coverage of the recognizer or spelling verifier suffers from either incompleteness or it must be augmented by additional, external logic that must be integrated with the lexicon. Methods that depend on external logic are generally awkward and incomplete. Rule-based methods can be used for generating or evaluating character sequences, but such methods generally require that software be prepared by or with an expert before the lexicon and/or system is released for common use.
Such expert or rule-based systems are generally non-extensible, or are only extensible by highly qualified experts. Consequently, a general method for encoding the vast number of more common numbers, fractions, and dates within a lexicon is highly desirable.
A method for using regular expressions to represent (typically large) subsets of entries in a lexicon is described. Meta-characters are used to represent a class of characters. One or more meta-characters and (optionally) standard characters are used to create a regular expression string. Each such string represents or encodes a class or set of one or more words or lexicon entries. Regular expression strings are entered into the lexicon with all other lexicon word entries.
In accordance with a method for creating an encoded lexicon, a meta-character is defined as representing a set of at least two symbols. An unencoded lexicon is then read and a substitution process is performed where the meta-character is substituted into entries in the lexicon which comply with the meta-character definition. Preferably, the meta-character is a user defined entry in a meta-character definition table having a plurality of meta-characters.
A method of searching an encoded lexicon encoded with a meta-character representing at least two symbols includes the steps of submitting a search string having a plurality of symbols; substituting the symbols represented by said meta-character to generate a meta-string; and comparing the entries of the encoded lexicon for at least partial matches to the meta-string. The at least partial matches are then expandable by substituting the meta-characters in the partial matches for the symbols represented by the meta-character.
In accordance with the present invention, a method for generating unencoded word hypotheses to test against submitted data uses an encoded lexicon having a tree based structure. The tree based encoded lexicon includes a plurality of entries which have a plurality of linked character nodes, including a first node and a terminal node. At least one of the nodes is a meta-character representing at least two other characters. The generative method includes the steps of generating first hypotheses including first node characters of all entries of the encoded lexicon. The first current hypotheses are tested against the submitted data to determine whether the hypotheses are probable partial matches. For each probable branch identified, those hypotheses are refined by adding characters from subsequent linked character nodes. The refined hypotheses are then tested against the submitted data to determine whether the refined hypotheses are probable partial matches. For each further probable branch which is identified where the current node is not a terminal node, the steps of refining and testing the hypotheses are repeated. For each further probable branch reaching a terminal node where the current refined hypothesis is a probable match to the submitted data, the refined hypothesis is provided as an output indicative of a probable match.
In accordance with one form of the present invention, a computer-based language verification system using a lexicon is formed with a first data file having an unencoded lexicon and a second data file having a meta-character definition table. The meta-character definition table is a many to one mapping between a meta-character and a plurality of language characters. The system includes a computer processor which is coupled to the first and second data files and generates a third data file therefrom, which is an encoded lexicon. The processor generates the encoded lexicon by reading entries from the unencoded lexicon data file and substituting meta-characters from the second data file into the lexicon entries in accordance with the meta-character definition table.
Preferably, the computer-based language verification system includes an input device coupled to the processor for submitting a language string for verification. The processor receives the language string, performs meta-character substitution on the string in accordance with the meta-character definition table to generate a meta-string, searches the third data file for entries which at least partially match the meta-string, and expands the at least partially matching entries in accordance with the meta-character definition table.
In a further embodiment of the computer-based language verification system, the input device includes a pen-based interface for receiving handwriting input strings. Alternatively, the input device includes a spelling verification program operating on a text document. In yet another alternate embodiment, the input device includes a speech recognition system having an audio input system and a speech processor to form the language string from spoken utterances.
Preferably, the second data file is a user extensible data structure which has a user friendly syntax. Meta-characters can be defined to represent a class of multi-character words or strings. In addition, meta-characters can be nested, i.e. one meta-character definition can include a previously defined meta-character.
These and other features, objects and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.