The present invention relates to automated language analysis systems, and relates to such systems embodied in the computer for receiving digitally encoded text composed in a natural language. In particular, it relates to systems for tokenizing and analyzing lexical matter found in a stream of natural language text. The invention can be used, for example, to provide text processing on word processors and microprocessor-controlled typewriters.
Automated language analysis systems embedded in a computer typically include a lexicon module and a processing module. The lexicon module is a "dictionary" or database containing words and semantic knowledge related to each word. The processing module includes a plurality of analysis modules which operate upon the input text and the lexicon module in order to process the text and generate a computer understandable semantic representation of the natural language text. Automated natural language analysis systems designed in this manner provide for an efficient language analyzer capable of achieving great benefits in performing tasks such as information retrieval.
Typically the processing of natural language text begins with the processing module fetching a continuous stream of electronic text from the input buffer. The processing module then decomposes the stream of natural language text into individual words, sentences, and messages. For instance, individual words can be identified by joining together a string of adjacent character codes between two consecutive occurrences of a white space code (i.e. a space, tab, or carriage return). These individual words identified by the processor are actually just "tokens" that may be found as entries in the lexicon module. This first stage of processing by the processing module is referred to as tokenization and the processor module at this stage is referred to as a tokenizer.
Following the tokenization phase, the entire incoming stream of natural language text may be subjected to further higher level linguistic processing. For instance, the entire incoming stream of text might be parsed into sentences having the subject, the main verb, the direct and indirect objects&gt;(if any) prepositional phrases, relative clauses, adverbials, etc., identified for each sentence in the stream of incoming natural language text.
Tokenizers currently used in the art encounter problems regarding selective storage and processing of information found in the stream of text. In particular, prior art tokenizers store and process all white space delimited characters (i.e. "tokens") found in the stream of text. But it is not desirable, from an information processing standpoint, to process and store numbers, hyphens, and other forms of punctuation that are characterized as "tokens" by the prior art tokenizers. Rather, it is preferable to design a tokenizer that identifies as tokens only those character strings forming words that are relevant to information processing.
Prior art tokenizers have the additional drawback that each token extracted from the stream of text must be processed by each higher level linguistic processor in the automated language analysis system. For instance, each token must be processed by a noun phrase analysis module to determine whether the token is part of a noun phrase. This system results in an extensive amount of unnecessary higher level linguistic processing on inappropriate tokens.
Clearly, there is a need in the art for a tokenizer capable of more advanced processing that reduces the overall amount of data being processed by higher level linguistic processors and increases the overall system throughput.
Accordingly, an object of the invention is to provide an improved tokenizer that identifies a selected group of tokens appropriate for higher level linguistic processing.
Other general and specific objects of the invention will be apparent and evident from the accompanying drawings and the following description.