The present invention deals with natural language processing. More specifically, the present invention relates to a tokenizer or tokenizing an input string to accommodate further natural language processing of the input string.
Natural language processing systems are used to develop machine understanding of human expressions. For example, some computer systems attempt to take action based on natural language inputs to the computer system. In order to do this, the computer system must develop an “understanding” of the natural language expression input to the computer. The natural language expressions can be input, for example, as a query in an information retrieval system. The natural language processing system attempts to understand the meaning of the query input by the user, in order to improve precision and recall in the information retrieval task. Of course, there are many other applications for natural language processing, including command and control, document clustering, machine translation, etc.
One of the principle challenges of a natural language processing system is to identify the boundaries of words in written text. Once the words are identified, the grammar of the particular language being used arranges the words in linguistically significant chunks or constituents. The grammar identifies the hierarchy connecting those constituents and thus constructs a representation for the input sentence.
Many words in written text can be thought of as character strings surrounded by spaces. For example, using a heuristic it can be seen that the preceding sentence contains 15 words and that the last word in the sentence is “spaces”. We are able to recognize “spaces” even though that word was written with a period attached (i.e., “spaces.”). However, if the sentence were to end in an abbreviation (say, the string “TENN.”, for Tennessee), the period would form an integral part of the word. Thus, recognizing when to treat punctuation characters as part of a word and when not to do so is a major challenge for a natural language processing system.
Now consider the string “15 MB”, which is normally interpreted as “15 megabytes”. There are no spaces between the “15” and “MB” and yet the string is normally analyzed as being composed of two separate words, pronounced “15” and “megabytes”, respectively. Also, consider the fact that the grammar will want to know that this construction is the same as the construction for a version written with a space, namely, “15 MB”, so that it can treat both versions (with or without a space) identically. Thus, recognizing when to separate digits from alphabetical characters poses another significant challenge for word recognition components of natural language processing systems.
Further consider strings consisting of multiple punctuation characters. Emoticons belong in this class, but so do less glamorous items such as the less than or equal to sign <=, the greater than or equal to sign >=, and the arrow sign: ==> to name a few. It is likely that a natural language processing system will want to treat these as single items. However, distinguishing these from other sequences of multiple punctuation characters, such as, for example, the sequence !)” in the sentence “This is a test (really!)”, is also a task that needs to be addressed.
Many other similar difficulties must be addressed as well. For instance, the above examples do not even mention expressions which are highly difficult to interpret, such as “12:00a.m.-4:00p.m.”. Further, the above examples do not address other difficult issues, such as electronic mail addresses, drive path names, and uniform resource locators (URLs).
For the purpose of the present application, the term “token” will refer to any input text flanked by white spaces or by the beginning and/or end of the input string. The term “word” is used to identify the linguistic unit (or units) into which a given token is broken or segmented after undergoing the tokenization process.
Prior tokenizers have suffered from two major problems. First, the tokenization process was performed independently of any knowledge contained in the systems lexical layer. Also, all the knowledge that the tokenizer needed for tokenizing an input string was hard coded in system code. Therefore, prior systems simply implemented the rules used to break input strings apart into tokens without caring whether the segmentation or tokenization made any sense, given the lexical or linguistic knowledge in the lexical layer.
For example, prior systems typically had a rule hard coded which required colons to be separated from surrounding text. Therefore, the example mentioned above “12:00am-4:00pm” would be separated into the following 5 tokens:
12:0am-4:00pm
Of course, when this expression is handed to the later natural language processing components, it is basically undecipherable.
Prior systems also had such rules as deleting the period at the end of an input string. Therefore, an input string which ended with the terms “ . . . at 9:00 A.M.” would have its final period deleted, resulting in the token “A.M”. In order to recognize this token as a valid lexical word, the lexicon or system dictionary against which the tokens are validated was required to include “A.M” as an entry.
Similarly, such prior systems were not language-independent, by any means. For example, the fact that the English contraction 'll, (as in the word they'll) constitutes a separate word was handled by scanning for a single quote and then scanning for ll following that quote. In effect, this was the logical equivalent of regular expression matching. This meant that separate code had to be added to the system for French, for example, to handle the myriad of contractions present in that language which also use a single quote. The code written had to reflect the content of the lexicon, and had to anticipate what forms were lexicalized. For instance, lexicalized elided form such as m′ and l′ had to be listed in system code, so that they would be separated from the following words, allowing forms such as aujourd'hui, where the quote is part of the word, to stay together.
Prior tokenizers exhibited still other problems. For example, prior tokenizers required that hyphenated strings be kept together as a single word. Such a heuristic was adequate for many cases, such as “baby-sitter” or “blue-green”, which are most probably single words. However, for a substantial number of cases this approach is inadequate. The hyphen is often used instead of a dash. Therefore, in sentences such as “This is a test—and a tough one at that.” the token “test—a” should be presented to the grammar as two separate words, rather than a single word. Because prior implementations of tokenizers did not have access to the lexicon of the language or to any other linguistic knowledge, resolving this at tokenization time was virtually impossible, leading to inappropriate syntactic analysis later in the natural language processing system.