1. Field of the Invention
The present invention relates to methods, systems and articles of manufacture for detecting useful data from blocks of text or sequences of characters.
2. Description of the Background Art
Various methods of detecting data in text are well-known. For example, such methods can be used to analyse bodies of text, such as e-mails or other data received by or input to a computer, to extract information such as e-mail addresses, telephone and fax numbers, physical addresses, IP addresses, days, dates, times, names, places and so forth. In one implementation, a so-called data detector routinely analyses incoming e-mails to detect such information. The detected information can then be extracted to update the user's address book or other records.
Known methods of detecting data include pattern detection methods. Such a method may analyse a body of text to find patterns in the grammar of the text that match the typical grammar pattern of a data type that the method seeks to identify. In general, in such a method, a grammatical function is assigned to each block, such as a word, in the text. The method then compares sequences of grammatical functions in the text to predetermined patterns of functions, which typically make up the types of data to be detected. If a match is found, the method outputs the blocks corresponding to the sequence of grammatical functions as the detected data.
As an example, such a method may assign a single digit from 0 to 9 followed by a space with the function DIGIT; two or more digits with the function NUMBER; two or more letters adjacent with the function WORD; and so forth. Once the functions have been assigned, patterns can be detected. For example, an associated name and address may have the pattern of neighbouring functions: NAME, COMPANY, STREET, POSTAL_CODE, STATE, where some of the functions may be optional.
Such pattern detection methods have generally proven highly effective. However, there remain difficulties in correctly picking out names of organisations and some addresses from bodies of text, as well as in matching all names to an address.
Known methods of detecting data also include statistical learning methods. In general, in such a method a computer program is trained to locate and classify atomic elements in text into predefined categories based on a large corpus of manually annotated training data. Typically, the training data consists of several hundred pages of text, carefully annotated to identify desired categories of data. Thus, in the corpus, each person name, organization name, address, telephone number, e-mail address, etc must be tagged. The program then scans the annotated text and learns how to identify each category of data. Following this training stage, the program may process different bodies of unannotated text and pick out data of the desired categories.
Such methods are heavily reliant on both the text chosen for the training corpus and the accuracy with which it is annotated, not to mention the algorithm by which the program learns. In addition, such programs output as a result all the data matching a particular category. For example, although such programs are particularly successful in identifying complete addresses, they cannot then output the individual elements of a detected address. Accordingly, they are unable to output the street line of an address as a distinct component going to make up the whole address.