Applications such as text analysis, data mining, and query processing involve detecting features of a given piece of text. Features may take the form of words, sequences of words, specific parts of speech, etc. For example, “heart attack” might be a textual feature that is associated with a specific medical condition. That feature might have variations, such as different names for the same condition (e.g., “heart failure,” “cardiac arrest,” etc.), or misspellings (e.g., “heart attack,” “heart attack,” etc.), which are to be treated in the same way when analyzing the text.
Various models are used for text processing. For example, regular expressions may be used to match input against certain types of patterns. Or, input text can be matched against a dictionary of specific words and/or phrases. Tries, prefix trees, and suffix trees are other structures that may be used to analyze and recognize input text. Text analyzers are normally written using an ad hoc combination of these (or other) approaches. Such text analyzers are normally written from scratch, with a specific text recognition task in mind.
Certain types of text analyzer systems have been created to deal with specific situations. The Lex and Flex systems are lexical analyzer generators; they generate programs that take character streams as input and generate token streams as output, by recognizing user-specified regular expressions in the character stream. The Yacc and Bison systems are parser generators; they generate programs that take token streams as input, and that perform user-specified actions, such as building parse trees, based on recognition of certain grammatical structures in the token stream. These systems all focus on processing input that meets a narrow formal language specification. Lex and Flex generate lexical analyzers whose text analysis abilities are mainly limited to recognizing input in the regular language class (i.e., those language that can be described by regular expressions). Yacc and Bison generate parsers whose analysis abilities are largely limited to recognizing input in a very narrowly defined subset of the context-free language class. Since unstructured text (e.g., web pages, journal articles, books, etc.) is written in natural language, these systems may be unsuited to analysis of unstructured text. In theory, it may be possible to use regular expressions to define the rules for analysis of unstructured text. However, doing so may be prohibitively difficult.