1. Field of the Invention
The present invention generally relates to a method of analyzing contents of two electronic databases, typically in text form, as a form of data mining. Specifically, a first database contains data on problems and a second database contains data on solutions. A method is presented to discover knowledge gaps wherein, for problems in the first database, there is missing an appropriate corresponding solution in the second database.
2. Description of the Related Art
A typical example of electronic databases assisting in solving real world problems is a scenario involving the helpdesk operator. Human helpdesk operation is very labor intensive and therefore expensive. Consequently, automation of helpdesk problem solving represents a key objective for providers of electronic customer services.
For a typical conventional system, a xe2x80x9cfree formxe2x80x9d computer helpdesk data sets consist primarily of short text descriptions, composed by the helpdesk operator for the purpose of summarizing what problem a user had and what was done by the helpdesk operator to solve the problem. A typical text document (known as a problem ticket) from this data set consists of a series of exchanges between an end user and an expert helpdesk advisor, for example:
1836853 User calling in with WORD BASIC error when opening files in word. Had user delete NORMAL.DOT and had her reenter Word. She was fine at that point. 00:04:17 ducar May 2:07:05:656PM
Such problem tickets may be comprised only of a symptom and resolution pair as in the above example, or they may span multiple questions, symptoms, answers, attempted fixes, and resolutions-all pertaining to the same basic issue. Problem tickets are opened when the user makes the first call to the helpdesk and closed when all user problems documented in the first call are finally resolved in some way. Helpdesk operators enter problem tickets directly into the database. Spelling, grammar, and punctuation are inconsistent. The style is terse and the vocabulary is very specialized. Such problem tickets are normally saved in some kind of data base which maintains a record of all user interactions/help desk operator interactions over a given time period. This record is referred to as a xe2x80x9chelpdesk logxe2x80x9d.
In addition to a log of problem tickets, most helpdesk support units have some repository of solutions that document how to solve the most commonly occurring problems. In the present application, this repository of solutions is referred to as a xe2x80x9cSolutions Knowledge Basexe2x80x9d (SKB). While the implementation of an SKB may vary, at their most fundamental level they most often consist of a set of electronic text documents, each of which solves one or more specific user problems.
The problem that this invention addresses is that of rapidly discovering the areas or categories of problems in the help desk logs that are not well represented in the Solutions Knowledge Base. In the present application, such areas of poor representation are referred to as xe2x80x9cknowledge gapsxe2x80x9d. The more rapidly and accurately these knowledge gaps are discovered, the better that engineering or other resources can be applied to write new solutions that will have the most beneficial impact.
Past approaches to finding knowledge gaps relied primarily on expert, comprehensive knowledge of both the problem space and the Solution Knowledge Base, or else a manual perusal of text documents in the helpdesk log and the SKB. The first approach relies too heavily on scarce expert resources while the second is impractical for large helpdesk logs and SKBs.
In view of the foregoing and other problems, it is, therefore, an object of the present invention to provide a structure and method for discovering and isolating knowledge gaps between two databases.
It is another object of the present invention to provide a method of discovering a class of documents that are most unlike a known set of document classes.
It is yet another object of the present invention to provide a method of determining where to best apply resources for finding solutions to problems.
It is yet another object of the present invention to provide a method to cross correlate two databases in a way that identifies possible content deficiencies in one of the two databases.
It is yet another object of the present invention to provide a method of improving knowledge base quality.
It is yet another object of the present invention to decrease the cost of knowledge base maintenance.
A main idea of this invention is to analyze, data mine, and summarize the text data sets of problem reports (problem tickets) using an automated unsupervised clustering algorithm in concert with a human data analyst. A goal is to discover those classes of problem tickets that are not well represented in a set of solution documents.
Generally, with the invention, one solution to the above problems is based on the following procedure, which has been successfully implemented in a computer program. In this description it is assumed that an initial helpdesk log text data set, i.e., a problem database P, and a solution knowledge base text data set S have been developed. To identify knowledge gaps, the following steps are executed:
1. Identify a dictionary D of frequently-used words in the problems database P.
2. Count the occurrences of dictionary words in documents of the problems database P.
3. Develop a set of problems categories C in problems database P.
4. For solutions database S, generate a new vector space model, by counting occurrences of the words in D in each document in S.
5. Calculate the distance between every document in S and the mean (centroid) of every problems category C
6. For each category Cj, find the distance of the nearest document in S. Call this the category gap score.
7. Sort the categories in order of decreasing gap score.
8. List the first N categories of the highest gap scores.
Although the following discussion continues with the example of a helpdesk operation, this is only one of various possibilities. For example, other organizations that could benefit from this invention might include an airline maintenance organization or an automotive workshop. A Patent Office could use it to develop and routinely update patent categories, based on correlating a database of issued patents and/or pending applications with a database containing patent categories. A customer service organization or sales organization could use it by setting up a first database to document sales requests or customer complaints and a second database to document the solutions ultimately resolving the request or complaint. Similarly, an organization developing a maintenance manual or a procedures manual could use this method to identify and address gaps in their coverage, either as an initial pre-release screening or as part of a routine update process.
In a first aspect of the present invention, a method of determining knowledge gaps between a first database P containing a set of problems records and a second database S containing solutions documents is disclosed, including developing a set of clusters of the problems records of the first database P, each cluster having a centroid, developing a dictionary D having entries based on lexicographical patterns in the problems records in the first database P, developing a vector space correlated to the solutions documents in the second database S, where the vector space is based on the dictionary D entries, developing a listing of distances between the cluster centroids and the vector space, and determining a knowledge gap for each cluster, where the knowledge gap is defined as the minimum distance in the listing.
In a second aspect of the present invention, an apparatus for discovering a class of documents most unlike a known set of document classes is disclosed, including a computer having at least one CPU, a first database P containing a set of problems records and accessible to the computer, and a second database S containing solutions documents also accessible to the computer, wherein the computer contains a program providing instructions described above.
In a third aspect of the present invention, a system for determining knowledge gaps between a first database P containing a set of problems records and a second database S containing solutions documents is disclosed, including a computer having at least one CPU, a first database P containing a set of problems records accessible to the computer, and a second database S containing solutions documents accessible to the computer, wherein the computer contains a program providing instructions described above.
In a fourth aspect of the present invention, a signal-bearing medium is provided that tangibly embodies a program of machine-readable instructions executable by a digital processing apparatus to perform the above-mentioned method of discovering knowledge gaps between a first database P containing a set of problems records and a second database S containing solutions documents.
With the unique and unobvious aspects of the invention, it is possible in any general information retrieval problem to discover a class of documents that are most unlike a known set of document classes. The invention also provide an improvement in the knowledge base quality and a decreased cost of knowledge base maintenance.