1. Field of the Invention
This invention relates generally to structures used to locate patterns in computer text. More particularly, the invention relates to a system and method for creating computer structures from example strings of text.
2. Description of the Background Art
Much text that appears in a computer user's day-to-day activities contains computer recognizable patterns of text that have semantic significance including phone numbers, e-mail addresses, post-office addresses, zip codes and dates. In a typical day, for example, a user may receive extensive word-processing and e-mail text files which contain several of these patterns. A language analysis program referred to as an extractor can then be used to search any of the text document for recognizable patterns. The extractor accesses a data file referred to as a library, which contains computer data referred to as structures. These structures are what the extractor follows when searching computer text to recognize a pattern. A structure comprises one or more definitions, such that the extractor must prove true one of the definitions in order to identify a pattern in the computer text. The application of a structure to computer text by an extractor is termed "parsing".
A conventional notation for defining structures is the Backus Naur Form (BNF), which is both difficult to understand and difficult to write. A definition using BNF consists of the name of the structure (such as the name "Date"), followed by the symbols ":=", further followed by a sequence of definition items. Each definition item in a BNF definition specifies an element of the pattern of text that the structure recognizes. A definition item may be a specific string which causes the extractor to recognize only the specific string; or the definition item may refer to a structure with a plurality of alternative definitions, causing the extractor to recognize any one of a plurality of specific strings. For example, a definition item may specify a lexical category structure which enables the extractor to recognize a particular kind of string such as numbers, letters, punctuation, spaces, tabs, carriage returns or the like.
A BNF structure that recognizes, for example, a date pattern might have a definition written as: EQU &lt;Date&gt;:=&lt;Month&gt; Number "," Number.
In this example, the definition comprises four definition items. The definition item "&lt;Month&gt;" refers to another structure which may be defined as: EQU &lt;Month&gt;:="January".vertline."February". . . .vertline."December";
where the symbol ".vertline." separates alternative definitions. Thus, "Month" is the name of the structure and contains twelve definitions; each definition contains a single definition item consisting of the name of a month. It will be appreciated that both the "Date" and "Month" structures are reducible to a specific expression of lexical category structures and/or exact strings of text.
In a given extractor, a programmer explicitly stores within the structure library appropriate structures for recognizing patterns valuable to a user. Since it is impossible for a programmer to anticipate all possible valuable patterns, it is desirable to provide the user with a means for extending the structure library. Some previous parsing programs enable the user to access the structure library and to add new structures. The drawback of this approach is that many users find it difficult to understand and write these formal structure-defining notations such as BNF. Therefore, a system and method are needed to create pattern-recognizing structures using a simple user interface which does not require specialized programming skills or familiarity with formal structure notations.