Many prior methods of identifying and highlighting text in a document rely on syntactic or semantic parsing methods, or probabilities, or string matching.
Generally, parsing methods are slow and fail, or degrade, dramatically given less than highly structured text. Many parsers work well for highly structured prose found in newspapers but do not work well on less structured text.
Probabilistic methods such as those used in Hidden Markov Models require substantial training sets and, generally, are not very accurate.
String matching on large lists generally require substantial storage capacity and are either limited to recognizing specific spellings or are slow if not so limited.
Therefore, there is a need for a more accurate and faster method of identifying text and highlighting the same. The present invention is such a method.
U.S. Pat. No. 5,287,278, entitled “METHOD FOR EXTRACTING COMPANY NAMES FROM TEXT,” discloses a method of identifying a name of a company by first locating its suffix (i.e., Company, Corporation) and then locating the beginning of the company's name. The present invention is not limited to using suffixes, and uses additional steps not disclosed in U.S. Pat. No. 5,287,278. U.S. Pat. No. 5,287,278 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5819,265, entitled “PROCESSING NAMES IN A TEXT,” discloses a device for and a method of identifying a proper name in text by identifying capitalized words and specially designated words. Then, leading and trailing substrings (e.g., spaces, punctuation) are removed. Then, the identified word is split, if possible, until it cannot be split any further. The result is a list of possible proper names. The present invention does not use the same method as does U.S. Pat. No. 5,819,265 and includes steps that are not disclosed in U.S. Pat. No. 5,819,265. U.S. Pat. No. 5,819,265 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,832,480, entitled “USING CANONICAL FORMS TO DEVELOP A DICTIONARY OF NAMES IN A TEXT,” discloses a device for and a method of creating canonical forms of a proper name in text by first establishing an equivalence group where each name, which was identified using the method of U.S. Pat. No. 5,819,265, shares an attribute (e. g., professional title, suffix, last name, personal title, first name, prefix, nickname, organization place, organization tag, organization name). Then, selecting the name with a high confidence score as an anchor. Then, designating one or more names that share an attribute with the anchor as a variant of the anchor. The present invention does not use the same method as does U.S. Pat. No. 5,832,480 and includes steps that are not disclosed in U.S. Pat. No. 5,832,480. U.S. Pat. No. 5,832,480 is hereby incorporated by reference into the specification of the present invention.