With the large (and ever-growing) amount of information available on the World Wide Web, it can be desirable to assign deep structure and understanding to otherwise unstructured documents (e.g., textual documents) by determining content of the documents based on existing knowledge bases. For instance, given a document and an entity knowledge base, it can be desired to determine which entity in the entity knowledge base is mentioned in the document, or alternatively indicate that no entity from the entity knowledge base is mentioned in the document. However, in a web-scale entity knowledge base, it is highly likely that different entities may share a common name. Thus, different entities may have the same surface form (e.g., mention), which can lead to an inherent ambiguity when attempting to resolve a mention of an ambiguous entity to a matching entry in the entity knowledge base. By way of illustration, the entity knowledge base can include more than one “Joe Smith” (e.g., a doctor, a football player, etc.); following this illustration, a mention of “Joe Smith” in a document can be ambiguous since such mention may refer to “Joe Smith” the doctor, “Joe Smith” the football player, etc. from the entity knowledge base or an unknown “Joe Smith” outside of the entity knowledge base.
To resolve a mention of an entity in a document using an entity knowledge base, conventional approaches commonly extract features from the document and compare those features to feature representations of the entities in the entity knowledge base. Traditional approaches for entity disambiguation oftentimes employ a type of similarity measure that evaluates a distance between a context of a mention in a document and a context of an entity in an entity knowledge base. Examples of conventional similarity measures include cosine (with term frequency-inverse document frequency (TF-IDF)), Kullback-Leibler (KL) divergence (with a language model), and Jaccard. Other conventional approaches have explored the co-occurrence of multiple entities and perform joint recognition.
However, conventional entity disambiguation approaches can be problematic. For instance, various conventional approaches can build feature vectors out of the mention and the candidate entities, and perform similarity comparisons among feature vectors. Yet, some features can be missing, and features from different facets of the entity can be missing for different occurrences of the mention. Accordingly, a fixed weight for each dimension of the feature vector can be learned, while many of the features can be language components such as “the”, “and”, “a”, “an”, etc., which may not be discriminative for entity disambiguation. According to another example, for many conventional approaches, the features are designed and engineered for a particular entity knowledge base. Accordingly, such approaches can lack scalability and extensibility when applied to different domains where the entity knowledge bases therein have different schema.