Human language is not always precise. It often requires using terms and phrases that, by themselves, may be ambiguous in terms of their meaning or their ability to distinguish and uniquely identify a particular person, place or thing. A word or phrase can be ambiguous because it may be associated with a plurality of different subjects or entities. A reference to “Paris,” for instance, could refer to a city in the country of France, cities in the States of Texas, Tennessee or Illinois, or even a person (e.g., “Paris Hilton”).
Ambiguity may also arise when a single entity, such as a person, organization or place, is routinely identified by or associated with a multitude of different words, phrases and/or abbreviations. For example, companies and organizations often have multiple trade names, abbreviations, nicknames or acronyms, while some company names are frequently misspelled. Still more ambiguity can arise, for example, when a large number of people share the same name (e.g., “Mr. John Smith”), when a famous individual shares a name with non-famous individuals (e.g., Mr. Michael Jackson), when a single individual is associated with potentially many different organizations simultaneously or consecutively over time, or when an organization has a large number of well-known heterogeneous parts, sub-organizations or subsidiaries (as in “The Smithsonian Institute,” which has 19 museums, 9 research centers and more than 140 affiliate museums around the world).
Entity mention disambiguation is the process of resolving which unique entities (e.g., persons, organizations or places) are the intended subjects of certain references (typically referred to in the art as “mentions”) in the documents of a given corpus of documents concerning certain names, words or phrases. Although humans are reasonably good at resolving ambiguous entity mentions in written and spoken language by using the context in which the ambiguous words or phrases appear, conventional automated systems and processes have heretofore failed to achieve adequate levels of performance and reliability in disambiguating entity mentions in electronic documents, especially when the sources of the electronic documents comprise very large collections, such as the National Library of Medicine's “PubMed” online database, or the United States Patent and Trademark Office's online patent database.