1. Technical Field
The invention disclosed broadly relates to data processing methods and more particularly relates to an improved computer method for language-independent text tokenization.
2. Background Art
The identification of words and their delimiters is a necessary step for any natural language processing task. Word isolation is also called "tokenization" because the process generates word and punctuation tokens. Since many linguistic tasks depend on dictionary or database lookup, it is imperative to be able to isolate words in a way that will consistently match against a dictionary or a search database
The technique for isolating words affects the content of the dictionary or database and the applications which use it. Should the dictionary include contractions like "can't" or hyphenated words like "mother-in-law"? If capitalized words are allowed, will there be a distinction between "Victor" and "victor", "Bill" and "bill"? Will the dictionary contain multiple word entries such as "hot dog" or abbreviations like "etc."? Will words with numeric characters such as "42nd" or "B-52" be allowed? How should telephone numbers with hyphens and area codes in parentheses be tokenized?
If the dictionary is designed for languages other than English, word isolation needs to accommodate language-specific conventions. In French, contracted prefixes are found on many words, e.g., "1'enveloppe" (the envelope); these prefixes are generally not included as part of the dictionary entries. Similarly, some hyphenated French words, such as "permettez-moi" (permit me) also need to be recognized as separate words even though they are attached with a hyphen.
The objective of the current invention is to provide a means of isolating words from a stream of natural language text in a consistent way for a plurality of computer hardware and natural languages by use of a character categorization table. The categorization table simplifies definition of tokens over the prior art and makes it possible to customize the process for specific computers or for specific linguistic or database applications.