The present disclosure relates generally to case-based reasoning (CBR), and in particular, to a method of using archive data to retrieve case-based information.
CBR attempts to model human reasoning by recalling previous experiences and adapting them to a current situation. In CBR, the primary knowledge source is a memory of stored cases recording prior specific episodes, or issues. A common application of CBR is in diagnostic systems. The notion is that similar prior issues are a useful starting point for solving new issues. Thus, the success of a CBR knowledge system will depend on both the case base (stored memory) as well as the system's ability to recognize and retrieve past cases that will be helpful to the current issue. Typically, a case is produced after an issue is solved and includes a description of the symptoms (e.g., fault codes, data plots, visual inspections), a description of the root cause of the issue and a resolution of the issue (e.g., repairs). In addition to structured case records, there may be useful diagnostic information stored in archive records that have not been converted into structured case records. Archive records may not have been converted for a variety of reasons such as cost to convert when compared to the expected number of accesses to the records.
A company that provides diagnostic services may store a large volume of archive records containing data describing solutions, or “solved problems.” For example, call centers usually have a mechanism for logging calls with some kind of summary of problem and proposed solutions. Warranty databases generally keep records of labor operations for products under warranty. Medical insurance systems similarly keep a record of each billable medical expense. From such archives and databases it is possible to extract problems and solutions, to create a case base for a diagnostic CBR. In practice, however, this historical data (archives) is difficult to use and implement in an efficient manner. The archives often contain “dirty,” incomplete, inconsistent, and/or out-of-date data. In addition, the archive data may include unstructured data. These and other issues require a resolution before the case base is created.
A data record is “dirty” if the data record is defective in some way. In particular, a dirty data record may have fields that are blank, that should be filled in, or has fields that are filled in with incorrect data, and/or the format of the data record itself is corrupted in some manner. If the record includes blocks of text, then in certain situations such blocks of text are considered “dirty” if the text is defective in some way, including nonstandard spellings and abbreviations, incorrect grammar, sentence fragments, and any other characteristics that makes the text block difficult to read or process.
The concept of “dirty data” (also referred to herein as “corrupted data”) is context and system dependent. Dirty data is capable of being cleaned, repaired, or filtered. Cleaning and repairing, in this context, are essentially synonyms. The two terms refer to filling in blank values, correcting incorrect values and fixing faulty formatting. In short, a correct record replaces the defective record. There are many methods for cleaning data, some requiring little human supervision and others entirely done by a human. With text blocks, for example, the block could be scanned by a spell checker or edited by a human editor or both. However, both of these methods are error-prone and consume a great deal of time to be performed.
Lazy learning is a branch of machine learning that (1) defers processing until there is a request for information, (2) when there is such a request, generates the answer by processing just the stored data needed to generate the answer, and, (3) when the query has been answered, discards the answer instead of saving it. The idea behind lazy learning is to generate knowledge “just in time” from raw data rather than to precompute the knowledge for future use. Lazy learning has emerged from the need to save, or redistribute in time, the computational and storage burdens of machine learning methods.
Many companies maintain large archives of “solved problems” with significant portions of records being stored as text. In an environment where neither time nor resources are constrained, these large archives can be analyzed to create a case base data base following standard CBR methodologies. However, resources are typically constrained. Therefore, solved-problem archives are often abandoned, and new cases are created from new information (e.g., analytical cases might be fabricated by interviewing experts, and/or a new CBR friendly process for capturing cases might be established in the call-center workflow). However, valuable information about how issues were resolved in the past will be lost if the archive records are abandoned. Therefore, there is a need for a process to use the information about past issues (and their symptoms and solutions) without requiring the conversion of all of the archive data records into a CBR format. This process would allow for accessing solutions that have worked in the past, while avoiding the cost of converting a large volume of potentially infrequently accessed archive records into a CBR format.