The exemplary embodiment relates to the linguistic processing arts and finds particular application in connection with a system and a method for enriching documents, such as how-to guides, with links from actionable phrases to relevant information elsewhere in a document corpus.
How-to guides are widely used for providing instructions on how to accomplish a specific task, e.g., how to choose a PC, how to install an application on a smartphone, or how to cook spaghetti. There are numerous websites which allow users to post how-to guides so that others can search them. Most of the guides are written by enthusiastic, not-paid contributors, who do not generally consider that there may be relationships between the newly-created content and previously-created guides. Knowledge bases (KBs) that contain such content are valuable through the sum of their single entries, but because each entry is created largely independently, users (software, agents, and managers) cannot take advantage of the accumulated knowledge that could be developed by the aggregation of related entries.
This is also the case for many commercial settings. Customer care departments managing KBs containing how-to guides for troubleshooting and implementation do not always follow rigorous processes for their creation. Business pressure and short iteration time frames do not allow time to re-organize and optimize the KBs regularly. In addition to making customer care sessions longer, this can cause problems where troubleshooting sessions are handled by software (e.g., a virtual agent) with little or no human supervision. For example, the virtual agent may be designed to provide a user with a pointer to a single entry in the KB, which may contain one or more instructions. If the user has a problem with one of the provided instructions, the only way of solving this is may be to start another interaction/session.
It would be desirable therefore, to be able to inter-link KB entries, allowing relevant information to be acquired from other parts of the KB.
While there have been studies on organization of web forums, how-to knowledge, sometimes referred to as procedural knowledge, is often still poorly organized. See, e.g., Zhang, et al., “Automatically extracting procedural knowledge from instructional texts using natural language processing,” LREC'12, 2012, hereinafter, “Zhang 2012.” Methods have been proposed for linking part of one document to other document(s). The objective is to link a step, such as “Install an operating system” to its sub-steps, e.g., “format a disc,” “create disc partitions,” “install drivers for a video card,” and so forth. See, Pareti, et al., “Integrating know-how into the linked data cloud, Knowledge Engineering and Knowledge Management, pp. 385-396, 2014 (hereinafter, “Pareti 2014”). In that approach, the text is first segmented into steps and then a text search engine is used to find a set of candidate links for each step. A trained classifier is used to filter out irrelevant results. However, the results can still be quite noisy.
Others have developed methods for extraction of text spans. See, Zhang 2012, and Cécile Paris, et al., “Automated knowledge acquisition for instructional text generation,” Proc. 20th Annual Int'l Conf. on Computer Documentation, SIGDOC '02, pp. 142-151, 2002, hereinafter, Paris 2002.
U.S. Pub. No. 20120150920 describes a method for linking parts of a physical device shown in a graphical interface to corresponding noun phrases in a knowledge base that refer to the parts of the device. Verbs linked to the noun phrases are also identified using a lexicon of verbs that refer to physical actions on a device. However, the problem of linking extracted spans of text to other documents has not been addressed.