Electronic data is available from a myriad of sources. The data can be structured, as is the case with coded fields and timestamps, or unstructured, as found in free-form open language text fields (such as those often labeled “comments” or “description”).
Rule-based systems, data mining systems, data analytics systems, and many other systems work on structured data. However, there is important information in unstructured data, and processing and analysis of this information can provide great business value. To enable this processing and analysis, the unstructured data needs to be converted into the structured type of data on which these systems work.
A glossary is a collection of specialized terms or phrases and their meanings. Many methodologies for processing text documents and fields are based on using such glossaries. For example, one use for glossaries is in processing (and performing text analytics on) post-sales product data and failures, such as warranty claims and field service reports. The glossaries are used to pull out symptoms, causes, actions, and components from failure reports.
In many instances, the size of a glossary can be extremely large. For instance, in the area of automotive warranty claims, a car has several thousand parts and each part typically has several types of failures. Further, failures can be caused by interdependencies between different parts. While this creates a very large number of entries, this number significantly increases when one considers variation in language to describe these failures. This is due to the fact that there are many ways to say the same thing, even in perfectly written English. Factoring in typographical errors, improper grammar, abbreviations, and other factors adds further complexity.
It is impractical for a human to manually create all of the possible variations of each canonical form, let alone create all the possible canonical forms. There are procedures that assist in manual creation, by partially automating some of the glossary creation process. FIG. 1 shows a sample flow for a manual glossary creation process that includes initial automated phrase extraction. The flow begins with the creation of variations at the simple word level. Then, phrase variations that correspond to one specified phrase are automatically built. However, with such a process human intervention is necessary, at least for glossary administration and for rule administration. So while these tools aid in glossary creation, too much human effort is still necessary to justify the text analytics.