Natural Language Processing (NLP) is a well-known technology field which deals with analysis and understanding of natural language texts by computers. Extensive research is already going on in this field for several decades and good progress had been achieved, even though there is still much to be improved until computers will get close to humans in understanding free and general-purpose texts.
A specific topic within the NLP field is Named Entity Recognition (NER). A named entity may be a real-world object, such as a person, a location, an organization, a product, etc., that can be denoted with a proper name. It can be abstract or have a physical existence. Examples of named entities include Barack Obama, New York City, Volkswagen Golf, and anything else that can be named. Named entities can simply be viewed as instances of classes of entities (e.g., New York City is an instance of a city). Many natural language sentences contain named entities, and understanding such sentence requires (i) identifying the existence of named entities, and (ii) for each term identified to be a named entity, finding what does that named entity refer to.
For example, in order to correctly understand the sentence “I am familiar with New Jersey” a computer must detect that the word “Jersey” is part of the named entity “New Jersey” (a state of the US) and does not refer to its stand-alone meaning of “knitted clothing”. One may argue that the fact that in the above example Jersey starts with a capital letter makes the task trivial. However, it should be remembered that in many cases the text to analyze is received from a speech-to-text automatic converter, where no capital letters can be known. Additionally, not all languages capitalize first letters of names as English does. If the above sentence would be written in Hebrew, there would be no detectable difference between the two interpretations of “Jersey” Similarly, in a language such as German wherein all nouns are capitalized (and not just ‘proper nouns’ as in English), there would be no detectable difference between two interpretations of any given noun that has an everyday meaning in addition to its meaning as a named entity.
Much research has gone into solving the Named Entity Recognition task and reasonably good solutions exist. Typically, NER implementations make extensive use of a dictionary, an encyclopedia, a database or a knowledge base for identifying a term as a named entity and for extracting its meaning. Wikipedia is the most commonly used source that is used for that purpose, because of its large size and diversity of covered fields.
However, recognizing a term in a sentence to be a named entity does not always bring with it immediate understanding of what it refers to. Consider the sentence “I admire Washington”. Prior art Named Entity Recognition systems should have no difficulty in identifying “Washington” to be a named entity, but will face difficulties when having to determine which entity the sentence refers to—(i) Washington D.C., (ii) Washington State, (iii) Washington Irving, (iv) George Washington, or (v) another person called Washington. The problem of distinguishing between multiple candidate interpretations of a given named entity is called Named Entity Disambiguation (NED).
As for the NER problem, much research was done for solving the NED problem. Several examples of such research papers are:    1. Gentile, A. L., Zhang, Z., Xia, L. and Iria, J., 2010, January. Semantic relatedness approach for named entity disambiguation. In Italian Research Conference on Digital Libraries (pp. 137-148). Springer, Berlin, Heidelberg.    2. Hoffart, J., Seufert, S., Nguyen, D. B., Theobald, M. and Weikum, G., 2012, October. KORE: keyphrase overlap relatedness for entity disambiguation. In Proceedings of the 21st ACM international conference on Information and knowledge management (pp. 545-554).    3. Hoffart, J., 2015. Discovering and disambiguating named entities in text. Ph.D. thesis.    4. Mann, G. S. and Yarowsky, D., 2003, May. Unsupervised personal name disambiguation. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4 (pp. 33-40). Association for Computational Linguistics.            All of the four above documents are incorporated herein by reference in their entirety.        
The issue of identifying the existence of a NED problem is straightforward—for example, a NER implementation that uses Wikipedia for identifying named entities can easily detect the existence of multiple Wikipedia pages referring to the same name. (Wikipedia actually provides an explicit “disambiguation page” for every name having multiple pages matching it, with the disambiguation page pointing to the different candidates for the name). The really difficult issue is determining which of the multiple candidates matching the same name is the one referred to in the analyzed text.
Prior art NED implementations use a variety of approaches. The simplest ones take their decision by looking only at the competing Wikipedia entries, without referring to additional inputs. For example, some NED systems determine which of the potential Wikipedia pages is most “popular” and pick it to be the intended meaning. Popularity of a given Wikipedia page can be determined, for example, by counting how many other Wikipedia pages contain a link pointing to the given page. Thus, in such NED systems the candidate Wikipedia page having the highest number of incoming links from other Wikipedia pages will always be selected for a given named entity.
More advanced NED systems do not always select the same candidate for all occurrences of a given named entity, but use the context in which the named entity appears in the analyzed text to customize the selection for each specific occurrence. The context of an occurrence of a named entity is typically taken to be the full sentence in which it appears, the full paragraph in which it appears, the full chapter in which it appears, or even the full document in which it appears. The competing Wikipedia pages are analyzed against the context of the named entity, and the “relatedness” between each of the competing pages and the context text is evaluated. The page found to be the most “related” to the context text is picked to be the intended interpretation of the named entity in its current occurrence.
“Relatedness” between two blocks of text obviously does not have a clear-cut measurement scale. Consequently, many algorithms were proposed in the research literature for measuring relatedness.
Some relatedness measuring algorithms are based on counting words that appear in both blocks.
Other relatedness algorithms are based on counting links from other blocks of text into the blocks whose relatedness is evaluated, working under the assumption that a third block containing links pointing to the two blocks whose relatedness is evaluated is a proof of a relation between them.
Other relatedness algorithms allocate different weights to different words, with some words contributing more to the relatedness when appearing in both text blocks. For example, words that appear in headers or in links may be considered more “important” and should contribute more, or words that appear multiple times in one or two of the compared text blocks should contribute more than words that appear only once in each block.
Going back to the NED disambiguation issue, the currently most advanced disambiguation algorithms use a semantic approach that takes into account the fact that there is usually some inter-dependence between multiple named entities appearing in the same context. When an analyzed block of text contains multiple ambiguous named entities, each having multiple candidates to select from, it is reasonable to assume the correct selections for the multiple named entities depend on each other. This is in contrast to all previously described disambiguation algorithms that deal separately with each ambiguous named entity and pick the most reasonable candidate for it independently of the disambiguation selections made for the other ambiguous named entities.
Ignoring inter-relations between adjacent ambiguous named entities may result in a clearly incorrect interpretation of the analyzed sentence, even when each one of the named entities in question is assigned its most reasonable interpretation. For example, a text about a football game may say “The game between Amsterdam and Barcelona will take place in Madrid”. Each of the three named entities (“Amsterdam”, “Barcelona” and “Madrid”) is ambiguous—it may refer either to a city or to a football club associated with a city. As the text is known to be about football, most NED algorithms disambiguating each ambiguous named entity on its own will resolve all three named entities as referring to a football club, which is indeed the most reasonable decision for each of the three ambiguities. The analyzed sentence would then be assumed to mean “The game between Ajax Amsterdam and FC Barcelona will take place in Real Madrid,” which is obviously wrong. Thus, ignoring the semantic relations between named entities might lead to easy-to-detect failures in disambiguation.
The semantic approach to NED resolves the multiple ambiguities of the above example by searching for solutions in a single joint vector space combining the possible selections for all three named entities. In this case, a solution vector has a length of three and the solution space has 2×2×2=8 possible values from which we can choose.
In some implementations of the semantic approach, the disambiguation is achieved using semantic relatedness scores obtained with a graph-based model, taking into account the semantic relationships between all named entities. For each Wikipedia page that is a candidate for one of the ambiguous named entities, a list of features is extracted—words in the page title, most frequently used words in the page, words from categories of the page, words from outgoing links in the page, etc. We then construct a graph whose nodes are the candidates and the features, and the graph is used for determining semantic relatedness. The easier it is to move on the graph between two nodes, the more related are the nodes. The entities disambiguation algorithm is then based on a random walk of the graph. Applying such a semantic NED algorithm to the above example should produce the correct interpretation of “The game between Ajax Amsterdam and FC Barcelona will take place in Madrid.”
Most NED implementations of the prior art rely for the disambiguation task only on (i) entries of the dictionary, encyclopedia, database or knowledge-base in use (e.g. Wikipedia) corresponding to potential candidates for the ambiguous named entities, and (ii) the textual context of the ambiguous named entities, usually making use only of the other named entities appearing within the textual context, but in some implementations also making use of words in the textual context which are not named entities. At least one prior art NED implementation (KORE, see the second research paper listed above) adds another source of information for disambiguating an ambiguous named entity—Internet websites that are associated with the candidates for the ambiguous named entity. For example, if the candidate entity is a person, then we may use his Internet home page. If the candidate entity is a company, then we may use its Internet website. If the candidate entity is performer, then we may use his/her fan website.
The use of NLP is widespread and the technology is applied in many fields of use. Consequently, the use of NER and NED is also widespread, as practically all NLP implementations require named entities recognition and disambiguation.
For example, NED is widely used in understanding search queries. When a user asks Google's search engine “What is the height of Washington?” the search engine needs to determine what is the meaning of the named entity “Washington” in the query—does the question refer to the height of a person or to the level above sea level of a city.
Another use of NED is in the field of content enrichment for video content consumers. When a user watches video content (a movie, a program, a news broadcast, etc.) on a viewing device (a TV set, a laptop, a smartphone, a tablet, etc.) it is common to present to him recommendations for related video content or other related information he may be interested in watching. The related content may be other movies or programs dealing with similar topics, biographical information about people mentioned or seen in the watched content, etc.
In many cases, the determination of what enriching content to recommend to the user is derived from the text heard in the sound track of the currently watched content. That text is obtained either from the subtitles of the movie or program that are provided in the stream of the watched content, or from an automatic conversion of the spoken text as it is heard in the sound track into written text using a speech-to-text conversion engine. In the case of subtitles in old movies, the text might be burned into the video, in which case extracting it from the video may require OCR technology. Regardless of the way by which the analyzed text is obtained, NED may be required. For example, if the text contains the named entity “Washington” there is a need to know if this refers to Washington State, Washington D.C. or George Washington. This determination will decide whether the TV system will recommend to the user the movie “Disclosure” (which was filmed in Washington State and takes place in Washington State), the TV series House of Cards (which takes place in Washington D.C.) or a documentary about George Washington.
The success rate of prior art NED implementations is not satisfactory. Even a success rate of 85% is considered to be very good (see the Hoffart Ph.D. thesis mentioned above). This is certainly not good enough for many real-world applications. A TV user may become highly frustrated when 15% of the recommendations he gets from his TV system turn out to be completely non-related to what he is currently watching.
Therefore, there is clearly a need for NED implementations that provide better success rates than what is achievable with prior art NED solutions.
The following United States published patent applications are incorporated herein by reference in their entirety: United States Patent Publication 20170161367, United States Patent Publication 20170153782, United States Patent Publication 20170147924, United States Patent Publication 20170147635, United States Patent Publication 20170147557, United States Patent Publication 20170124065, United States Patent Publication 20170060835, United States Patent Publication 20170039272, United States Patent Publication 20170011092, United States Patent Publication 20160335234, United States Patent Publication 20160306984, United States Patent Publication 20160306789, United States Patent Publication 20160275148, United States Patent Publication 20160203130, United States Patent Publication 20160188597, United States Patent Publication 20160124937, United States Patent Publication 20160117360, United States Patent Publication 20160110350, United States Patent Publication 20160085740, United States Patent Publication 20160078245, United States Patent Publication 20160055845, United States Patent Publication 20160048655, United States Patent Publication 20160012040, United States Patent Publication 20160012021, United States Patent Publication 20160012020, United States Patent Publication 20150332049, United States Patent Publication 20150331850, United States Patent Publication 20150286629, United States Patent Publication 20150269139, United States Patent Publication 20150161237, United States Patent Publication 20150095306, United States Patent Publication 20150081281, United States Patent Publication 20140337372, United States Patent Publication 20140316768, United States Patent Publication 20140297252, United States Patent Publication 20140282219, United States Patent Publication 20140214820, United States Patent Publication 20140195532, United States Patent Publication 20140142922, United States Patent Publication 20140136184, United States Patent Publication 20140101542, United States Patent Publication 20140074886, United States Patent Publication 20140046653, United States Patent Publication 20140039879, United States Patent Publication 20130346421, United States Patent Publication 20130311467, United States Patent Publication 20130275438, United States Patent Publication 20130238312, United States Patent Publication 20130198268, United States Patent Publication 20130173604, United States Patent Publication 20130166303, United States Patent Publication 20130080152, United States Patent Publication 20120324350, United States Patent Publication 20120271624, United States Patent Publication 20120203772, United States Patent Publication 20120117078, United States Patent Publication 20120102045, United States Patent Publication 20110258556, United States Patent Publication 20110246442, United States Patent Publication 20110246076, United States Patent Publication 20110225155, United States Patent Publication 20110125735, United States Patent Publication 20100235313, United States Patent Publication 20100185689, United States Patent Publication 20100145902, United States Patent Publication 20100145678, United States Patent Publication 20100076972, United States Patent Publication 20100004925, United States Patent Publication 20090319257, United States Patent Publication 20090204596, United States Patent Publication 20090192968, United States Patent Publication 20090164431, United States Patent Publication 20090157705, United States Patent Publication 20090144609, United States Patent Publication 20080319978, United States Patent Publication 20080301112, United States Patent Publication 20080208864, United States Patent Publication 20080154871, United States Patent Publication 20080126076, United States Patent Publication 20080071519, United States Patent Publication 20080065621, United States Patent Publication 20080040352, United States Patent Publication 20070233656, United States Patent Publication 20070214189, United States Patent Publication 20070106493, United States Patent Publication 20070067285, United States Patent Publication 20070016580, United States Patent Publication 20060247983, United States Patent Publication 20060149555, United States Patent Publication 20060136385, United States Patent Publication 20060136208, United States Patent Publication 20060136196, United States Patent Publication 20060010138, United States Patent Publication 20050251382, United States Patent Publication 20050216443, United States Patent Publication 20050080613, United States Patent Publication 20050049852, and United States Patent Publication 20030217052.