The present invention relates generally to automated information retrieval, and more particularly to a system and method for extracting company names from text.
One of the major problems in the accurate analysis of natural language is the presence of unknown words, especially names. While names account for a large percentage of the unknowns in a text, they can also be the most important piece of information in a text; for what the text is about (topic analysis), for extracting information from text (data base generation), and for indexing a text for full-text retrieval.
Company names are particularly important for knowledge-based financial applications. With mixed case input, a program can easily extract company names by looking backward from a company name indicator (i.e., Incorporated, Corporation, etc.) to the first non-capitalized word. This simple heuristic fails to correctly identify approximately 10% of real company names and fails entirely with upper case input.
What is needed is a more accurate method which extracts company names from mixed case text and which also works for upper case text.