1. Field of the Invention
The present invention relates to a technique for extracting predicate expressions representing defects from text data related to use of products belonging to a specific product area.
2. Description of Related Art
Recently, defect detection techniques for detecting defects occurring in company's products by analyzing through text mining the data of users' voices about use of products, for example, report data from users in bulletin boards, complaint sites, and the like or inquiry data in a customer support center, have attracted attention. Early detection of defects enables an earlier response and thus can improve company's competitive power by preventing losses and impairment of reputation.
In defect detection techniques based on text mining, expressions about defects are extracted from a huge number of expressions occurring in a huge amount of text data, and, for example, a deviation and a change in the distribution of the extracted expressions are captured to finally detect defects to be attended to. In general, a dictionary of expressions related to defects to be extracted is manually created. However, expressions about defects vary widely and with the product area. Thus, it is difficult to manually create the dictionary, and thus it is desired that the dictionary be created using a computer.
The following patent literatures and non patent literatures will be described below as they relate to the present invention:    [PTL 1] Japanese Unexamined Patent Application Publication No. 2005-235014.    [NPL 1] Sakai, Umemura, Masuyama, “Kootsuu-jiko-rei ni fukumareru jiko-gen'in-hyoogen no shimbun-kiji kara no chuushutsu (Extraction of Expressions concerning Accident Cause contained in Articles on Traffic Accidents)”, Shizen-gengo-shori (Journal of natural language processing) Vol. 13, No. 2, April 2006.    [NPL 2] S. D. Saeger, K. Torisawa, J. Kazama, “Looking for Trouble”, Proceedings of the 22nd International Conference on Computational Linguistics (Coling2008), pages 185-192, Manchester, August 2008.    [NPL 3] Kakimoto, Yamamoto, “Koobun-hen o mochiita nippoo kara no shoogai-joohoo chuushutsu (Extraction of trouble information from daily reports by using syntactic pieces)”, Gengo-shori-gakkai (The Association for Natural Language Processing), Dai 14-kai nenji taikai, Happyoo-rombun-shuu (In Proceedings of the 14th Annual Meeting of The Association of Natural Language Processing), March 2008.    [NPL 4] Kurita Mitsuharu, and three others, “Web-fooramu no koobun-joohoo o mochiita toraburu-shuuto bunsho chuushutsu (Troubleshoot Document Extraction Using Sentence Structures of Web Forums)”, Joohoo-shori-gakkai (Information Processing Society of Japan), Zenkoku taikai kooen-rombun-shuu, Dai 70-kai (In Proceedings of the 70th IPSJ National Convention), March 2008.
Patent Literature 1 discloses a technique for automatically creating a dictionary used in mining. Patent Literature 1 discloses an expression extraction device extracting evaluation expressions from text in which the evaluation of a specific object to be evaluated is described. Each of the evaluation expressions indicates the evaluation of the object to be evaluated. The expression extraction device includes a registered expression storage unit that registers, as a registered expression, an evaluation expression for which the polarity is predetermined. The positive polarity represents a positive evaluation and the negative polarity represents a negative evaluation. It also includes an expression extraction unit for extracting a plurality of evaluation expressions and a conjunctive expression from the text, the conjunctive expression indicating the conjunctive relationship between the evaluation expressions. It also includes a registered expression detection unit for detecting the evaluation expression including the registered expression registered in the registered expression storage unit, out of the plurality of evaluation expressions. It also includes a polarity determination unit for determining that the evaluation expression has the same polarity as the registered expression.
Moreover, techniques for extracting expressions related to defects include those described in Non Patent Literatures 1 to 4. Non Patent Literature 1 discloses a method for acquiring accident cause expressions and the seed expressions by: repeating the process for defining expressions each of which is modified by expressions representing accident causes as seed expressions; manually giving a seed expression to automatically acquire accident cause expressions modifying the seed expression; automatically acquiring seed expressions from the acquired accident cause expressions; and further acquiring accident cause expressions from the acquired seed expressions.
Non Patent Literature 2 discloses a method for collecting expressions generally likely to be related to troubles by supervised learning. More specifically, Non Patent Literature 2 discloses a technique for collecting expressions generally likely to be related to troubles using, as positive evidence: (1) structural pattern information on hyponyms of “trouble” (lexico-syntactic patterns for hyponymy relations) and (2) dependency relations between negated verbs and objects (dependency relations between expressions and negated verbs) and using, as negative evidence, (3) dependency relations between non-negated verbs and objects (dependency relations between expressions and non-negated verbs).
Non Patent Literature 3 discloses, as a method for expanding a trouble information dictionary: searching a syntactic piece list for a preceding section of trouble information to be expanded; acquiring top-ten frequent subsequent sections taken by the preceding section as a high-ranking subsequent section list; searching the syntactic piece list using the ten subsequent sections in the high-ranking subsequent section list; acquiring top-ten frequent preceding sections taken by the subsequent sections as a high-ranking preceding section list; connecting subsequent sections subjected to expansion to the preceding sections in the high-ranking list; and then adding the results to the trouble information dictionary.
Non Patent Literature 4 discloses a technique for extracting, from known troubleshoot documents, constructions that frequently occur in the text; more specifically, a technique for extracting constructions that frequently occur in known troubleshoot documents by converting sentences included in the troubleshoot documents to undirected graphs and acquiring a sub-graph common to the graphs.
However, the technique for automatically creating a dictionary disclosed in Patent Literature 1 covers evaluation expressions and uses tendencies in evaluation expressions, i.e., tendencies in which, in many cases, evaluation expressions successively occur, positive evaluation expressions come before and after a positive evaluation expression, and negative evaluation expressions come before and after a negative evaluation expression. Thus, the technique in Patent Literature 1 cannot be applied to expressions related to defects in which such tendencies are not observed.
Moreover, the method disclosed in Non Patent Literature 1 extracts accident cause expressions. Moreover, expressions likely to be related to troubles, which are extracted by the method disclosed in Non Patent Literature 2, are nouns. In general, such nouns represent, for example, entities in which defects have occurred as well as the causes of the defects. Thus, expressions representing defect phenomena occurring in products cannot be extracted by the methods disclosed in Non Patent Literatures 1 and 2.
Moreover, a syntactic piece acquired by the method disclosed in Non Patent Literature 3 as trouble information represents a dependency relation or a series of phrases. Moreover, constructions that frequently occur in troubleshoot documents are acquired by the method disclosed in Non Patent Literature 4. In defect detection techniques based on text mining, it is important to capture, for example, a deviation and a change in the distribution of extracted expressions, as described above. To this end, extracted expressions need to be sufficiently included with frequency in data to be analyzed. Since the frequency of occurrence of long objects to be extracted, such as syntactic pieces and constructions, is low, such long objects are inappropriate as expressions to be registered in a dictionary of expressions related to defects.