Devices, typically suitable programmed computing devices, that perform automated document analysis are well known in the art. Examples of products that perform automated document analysis include Early Case Assessment software provided by Complete Discovery Source, Inc., Redact-It software by Open Text Corp, Intelligent Data Extraction software by Extract Systems and automated redaction software by Adlibs Software. Among other features, some of these products perform company name recognition analysis and provide enhanced man-machine user interfaces in which the occurrence of company names in document text are displayed and highlighted. Ideally, the processing performed to implement such company name recognition analysis will lead to few, if any, false positives and few false negatives (misses) that would otherwise lead to an inaccurate representation of the document text presented by such user interfaces. However, this is not always the case.
Various machine-implemented techniques for performing company name recognition analysis are known in the art. For example, the analysis devices may be equipped with a predefined lists of company names and perform simple comparisons to identify occurrence of matches to entries in the predefined list. However, such predefined lists are invariably incomplete and, in any event, constantly changing due to companies changing names, new companies coming into existence, old names falling into disuse, etc. Consequently, company name recognition analysis that relies solely on list-based comparisons tend to be relatively inaccurate.
In another technique, the analysis device identifies as names all capitalized words in the text that do not start sentences, that are found in a dictionary or that are found in a list of people names. However, extracting all capitalized words meeting these criteria results in many false positives for company names to the extent that various other names (e.g., product names, professional/legal terms, etc.) are likely to be identified as company names, as well as many false negatives because companies are often named for people or things, and often start sentences (e.g., “Apple shall . . . ”).
Further still, regular expression pattern matching is a well-known technique for recognizing the occurrence of well-defined patterns in text. Consequently, pattern recognition techniques generally work well for recognizing phone numbers, currencies, and social security numbers, for example, but do not fare well with company names that do not always follow a well-defined letter/digit sequence and are often used inconsistently (e.g., shortened) even within a single document. While it may be possible to design a regular expression that could work for company names in some instances, the resulting regular expression would likely be unwieldy and poor performing.
Thus, techniques and devices that overcome the operational shortcomings of prior art devices/products and improve operation of the man-machine interface (to at least the extent that less errors are presented) would represent a welcome advancement in the art.