1. Field of the Invention
The general field of this invention relates to reducing ambiguity in entity name spotting by using additional information available through the context of a data subset.
2. Description of the Related Art
Entity Name Spotting is a term used in data mining and natural language processing that describes the process of finding names of objects of interest in the underlying data set. Examples include spotting names of geographic locations (cities, countries, etc), organizations (companies, etc) or people's names (politicians, CEOs, artists, etc).
Entity name spotting is a difficult problem in domains where ambiguity is common (or even embraced).
Typically, a sentence will have multiple, potentially overlapping entity names that need to be spotted. In these instances a decision must be made as to which entity name is the correct one. Most data-sets are comprised of individual data objects, and subsets of data objects are usually related to each other by structure, topic, or domain. For example, data-sets derived from internet discussion forums consist of individual user comments (the individual data objects), and all comments from a specific thread form a subset of related data objects. Moreover, all threads related to a specific topic or domain may form another subset of related data objects.
In the music domain, data objects may belong to a subset as specific as a band's MySpace page, to as broad as a UUNET discussion group on country music. These cases provide the overall context in which more focused entity name spotting can occur.
The entity name itself can be of multiple types. In the music domain, the entity name being spotted can include an individual artist, a band, an individual track, a record album, etc. Spotting for multiple types of entity names within discussions poses interesting challenges, such as spotting track/album/band names made up of typical stopwords or common words, (e.g., new, hello, yesterday etc.), determining which band/artist a track/album belongs to, (e.g., a cover of a popular track), etc.
The current state of the art focuses on named entity disambiguation by mapping entities in an input document against a predefined set of category tags, disambiguating names in web data by leveraging clustering algorithms and linguistically derived features to achieve disambiguation, entity (or name) disambiguation using ontologies for background information, and filtering unstructured content in a web-service database using query constraints at runtime, where the constraints include name spotting constraints.