Aspects of the exemplary embodiment relate to discourse analysis and find particular application in connection with a system and method for categorizing particular types of issues in technical forum posts.
Organizations often provide automated and semi-automated question answering systems to assist customers with a variety of tasks, such as selecting a suitable product to meet the user's criteria, troubleshooting a problem with a device or a medical condition, and the like. Developing such systems from scratch often involves the creation of a structured knowledge base in which questions are associated with possible answers. However, this can be extremely time consuming and often still leaves gaps, particularly when products are modified or new ones introduced.
There is wealth of knowledge available from a disparate range of online resources, such as web forums. Forum users often introduce new problems and solutions, which may related to new devices, and they describe first hand user experiences with rich information in terms of which solutions are better than others and why. This creates new opportunities for organizations seeking to automate parts of user support and customer care services. However, it also creates challenges in being able to transform such noisy and, frequently, unstructured data into a form that is useful for the enterprise. Mining frequently discussed problems, identifying trends, and enriching a corresponding knowledge base, for example, prove difficult with this type of data.
Various attempts at mining forum posts have made for various purposes. Raghavan, et al., “Extracting Problem and Resolution Information from Online Discussion Forums,” COMAD, p. 77, 2010, hereinafter, Raghavan 2010, describes a method that distinguishes between problem and solution posts where the forum structure does not indicate it. A CRF classifier is trained, based on discourse move annotated technical forum corpora. The classifier distinguishes between relevant discourse moves, which describe problems, problem queries, suggest solutions and resolution steps, and those that are irrelevant for the classification, like greetings and messages to the author.
Others have used techniques for dialogue act tagging and coherence-based discourse analysis to identify and link problem and solution pairs in troubleshooting forum posts. See, Kim, et al., “Tagging and linking web forum posts,” Proc. 14th Conf. on Computational Natural Language Learning, pp. 192-202, 2010; and Wang, et al., “Predicting thread discourse structure over technical web forums,” Proc. Conf. on Empirical Methods in Natural Language Processing, pp. 13-25, 2011. Links are labeled according to their relationship to the previous discourse act, e.g., as ADD, CONFIRMATION, CORRECTION, etc. These discourse markers are then used in the detection of resolved problems. See, Wang, et al., “The Utility of Discourse Structure in Identifying Resolved Threads in Technical User Forums,” COLING, pp. 2739-2756, 2012.
Identifying and characterizing forum threads have been studied for the classification of troubleshooting threads. One approach distinguishes between specific vs. general problems, the complete or not complete initial post in the thread, and resolved or not resolved threads. See, Baldwin, et al., “Automatic thread classification for Linux user forum information access,” Proc. 12th Australasian Document Computing Symp. (ADCS 2007), pp. 72-9, 2007 (hereinafter, Baldwin 2007). Another performs clustering of similar troubleshooting posts and builds hierarchies among post types. See, Medem, et al., “Troubleminer: Mining network trouble tickets,” IFIP/IEEE Intl Symp. on Integrated Network Management-Workshops (IM'09), pp. 113-119, 2009,.
Investigations that aim at identifying and typing sentences/sections in forum posts are described in Sondhi, et al., “Shallow information extraction from medical forum data,” Proc. 23rd Intl Conf. on Computational Linguistics: Posters, pp. 1158-1166, 2010. CRF and SVM classifiers are used to distinguish between sentences describing physical examination and those describing medication. Mukherjee, et al., “Help Yourself: A Virtual Self-Assist Agent,” Proc. Companion Publication of the 23rd Intl Conf. on World Wide Web (WWW '14 Companion), pp. 171-174, 2014, extracts segments from documents. Here, each segment corresponding to a different topic found in the document is defined as a basic intent.
A review of question classification in question answering systems is provided in Loni, “A survey of state-of-the-art methods on question classification. Literature Survey,” TU Delft Repository, pp. 1-39, 2011. Loni defines question classification as the task of predicting the entity type or category of the expected answer. However, traditional question answering does not deal with identifying and extracting the questions from unstructured text, but only typing them. Li, et al., “Learning question classifiers,” Proc. 19th Int'l Conf. on Computational Linguistics (COLING '02), Vol. 1, pp., 1-7, 2002, proposes a taxonomy. But this is oriented towards open domain information retrieval and the categories are not necessarily useful in other domains.
There remains a need for a categorization framework for issues which focuses on the type of answer being sought.