The present invention relates to a technique for automatically generating training data for disambiguation of a word or word string (hereinafter referred to as “entity”) related to a topic to be analyzed.
Analyzing users' voices on major spots (e.g., sightseeing spots) of a city and on an event (e.g., motor show) is important for the local government and the event organizer in understanding reputations of, and needs for, the city and the event. To collect users' voices for analysis, the use of social media has been considered in recent years. A social media tool, particularly microblogging, has more immediacy than traditional blogging. Therefore, what users feel on event sites and sightseeing spots is expected to be more directly reflected in social media messages.
Messages related to a topic (e.g., city or event) to be analyzed can be collected from social media sites by predefining a set of entities related to the topic and extracting messages including at least one entity contained in the set. However, if any of the entities has ambiguity, the messages collected by the method described above may include those unrelated to the topic. Therefore, it is necessary to disambiguate the entity and eliminate messages unrelated to the topic.
Many of conventional semantic disambiguation algorithms have been based on supervised learning using a tagged corpus (see, e.g., A. Davis, et al., “Named Entity Disambiguation in Streaming Data.” Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 2012). In the example described above, the tagged corpus or training data is a set of messages in which each entity is assigned a binary label indicating whether the entity is related to the topic to be analyzed. However, in social media where various topics are created every day, it is not realistic to manually generate training data. It is thus necessary to develop a technique for automatically acquiring training data for disambiguation.
The publications D. Spina, et al., “Discovering Filter Keywords for Company Name Disambiguation in Twitter.” Expert Systems with Applications 40.12 (2013): 4986-5003 and Z. Miklos, et al., “Entity-based classification of twitter messages.” International Journal of Computer Science & Applications 9. EPFL-ARTICLE-174746 (2012): 88-115, disclose techniques for automatically acquire such training data. Specifically, the techniques disclosed in these publications use company websites and Wikipedia to acquire training data for disambiguation of ambiguous company names.
The publication of E. L. Murnane, et al., “RESLVE: leveraging user interest to improve entity disambiguation on short text.” Proceedings of the 22nd international conference on World Wide Web companion. International World Wide Web Conferences Steering Committee, 2013, discloses a technique in which, for users who write articles for Wikipedia, an interest model is built to disambiguate entities included in messages sent via social media by the users.
Japanese Patent Application Publication No. 2003-22275 discloses a technique in which, upon receipt of a user's search request including a search term and a user's selection of a field that matches a search purpose, a document search is performed by referring to a field-specific co-occurrence term DB and adding one or more co-occurrence terms.
Japanese Patent Application Publication No. 2014-002653 discloses a technique in which morphemes acquired within a predetermined period are extracted as co-occurrence terms to identify, as a co-occurrence term, a morpheme that occurs in the same document as a search keyword.
Japanese Patent Application Publication No. 2014-032536 discloses a technique that extracts a document including a default topic tag from a plurality of documents, calculates a frequency of occurrence of a word in the extracted document, and extracts a document related to a topic from documents other than the document including the default topic tag by using the calculated frequency of occurrence.
The techniques disclosed in Spina, et al., Miklos, et al., and Murnane, et al. use external knowledge, such as Wikipedia. This means that these techniques are highly dependent on the extensiveness of the external knowledge. However, it is true that even Wikipedia, which is expected to serve as one of the most extensive knowledge sources, cannot fully cover information about entities which are generally unknown, and that it is not always easy to deal with the diversity of topics discussed in social media.
The technique disclosed in Japanese Patent Application Publication No. 2003-22275 registers a document that has been used or is determined to be usable, and extracts co-occurrence terms from the registered document. However, to allow the extracted co-occurrence terms to effectively function as training data for disambiguation, it is essential to register an associated field together with the document. This involves manual work and is costly. Moreover, the document that has been used often cannot fully cover information about entities which are generally unknown.
In the technique disclosed in Japanese Patent Application Publication No. 2014-002653, a morpheme that occurs in the same document as a search keyword is identified as a co-occurrence term. However, the search keyword is not always used in the sense intended in the document. Therefore, even if a morpheme occurring in the same document as the search keyword is extracted as a co-occurrence term, the extracted morpheme may not function as training data for disambiguation of the search keyword.
The technique disclosed in Japanese Patent Application Publication No. 2014-032536 uses the frequency of occurrence of a word in a document which includes a default topic tag indicating a topic, to extract a document related to the topic. However, for example, if the document includes only one default topic tag, the document may not necessarily be a document on the topic. This is more so if the only default topic tag included in the document has ambiguity. Therefore, a word that frequently occurs in a document including a default topic tag indicating a topic cannot be used as training data for disambiguation.