1. Field of the Invention
The invention relates generally to a system and method for extracting relevant information from raw text data. More particularly, the invention concerns itself with a system and method for identifying patterns in text using structures defining types of patterns. In this context a “pattern” is to be understood as a part of a written text of arbitrary length. Thus, a pattern may be any series of alphanumeric characters within a text. Particular examples of patterns that might be identified in a text, such as a word-processor document or an email-text, are dates, events, numbers such as telephone numbers, addresses or names.
2. Description of the Background Art
Technologies for searching interesting patterns in a text presented by a computer to a user (in the following “computer text”) are well-known. U.S. Pat. No. 5,864,789 is one example of a document describing such a technology.
A system that searches patterns in a computer text and provides to the user some actions based on the kind of identified patterns is described in two variants The first variant is an application termed “AppleDataDetectors” and the second variant an application termed “LiveDoc”.
Both variants use the same method to find patterns in an unstructured text. The engine performing the pattern search refers to a library containing a collection of structures, each structure defining a pattern that is to be recognized. FIG. 1 gives an example of seven different structures (#1 to #7), which may be contained in such a structure library. Each of the seven structures shown in FIG. 1 defines a pattern worth recognizing in a computer text. The definition of a pattern is a sequence of so-called definition items. Each definition item specifies an element of the text pattern that the structure recognizes. A definition item may be a specific string or a structure defining another pattern using definition items in the form of strings or structures. For example, structure #1 gives the definition of what is to be identified as a US state code, the definition following the “:=” sign. According to this definition, a pattern in a text will be identified as a US state code if it corresponds to one of the strings between quotation marks, i.e. one of the definition items, such as AL or AK or WY (Note that the symbol “|” means “OR”).
The structure #7 gives a definition of what is to be identified as a street address. In this context, a street address is to be understood as a postal address excluding the name of the recipient. A typical example of a street address is: 225 Franklin Street, 02110 MA Boston. According to the definition given by structure #7, a pattern is a street address if it has elements matching the following sequence of definition items:                a number in the sense as defined by structure #4, followed by        some spaces, followed by        some capitalized words, followed by,        optionally, a known street type in the sense as defined by structure #5 (the optional nature being indicated by the question mark behind the brackets surrounding “known_street_type”), followed by        a coma or spaces, followed by,        optionally, a postal code in the sense as defined by structure #3, followed by        some spaces, followed by        a city in the sense as defined by structure #6.        
This definition of a street address is deliberately broad in order to ensure that the application is able to identify not only street addresses written according to a single specific notation but also addresses written according to differing notations.
However, an application using such a broad definition is prone to the detection of a large number of false positives. For example, with the definition of a street address given above, the pattern “4 Apple Pies” will be wrongly recognized as a street address. The obvious solution to reduce the number of false positives is to make the structure definitions narrower. Yet, with narrow definitions there is an increased risk of missing interesting patterns.
At least certain embodiments of the present invention provide a method and system for identifying patterns in text using structures, which increase the flexibility of structure definitions and which, in particular, permit the formulation of structure definitions that lead to more accurate results during pattern identification.