1. Field of the Invention
This invention relates generally to rule-based parsing of text documents and, more particularly, to a method for establishing a co-relationship between terms in a document—including terms such as parts, service actions, symptoms, and failure modes—which uses sentence boundary detection to establish tentative term pairings, and then analyzes each tentative term pairing for validity.
2. Discussion of the Related Art
Modern vehicles are complex electro-mechanical systems that employ many sub-systems, components, devices, and modules, which pass operating information between and among each other using sophisticated algorithms and data buses. As with anything, these types of devices and algorithms are susceptible to errors, failures and faults that can affect the operation of the vehicle. To help manage this complexity, vehicle manufacturers develop fault models, which match the various failure modes with the symptoms exhibited by the vehicle.
Vehicle manufacturers commonly develop fault models from a variety of different data sources. Also, given the enormous volume of warranty data available in electronic format, a need arises to automatically classify and cluster these documents in order to identify the best-practice diagnostic knowledge from the documents. These data sources include engineering data, service procedure documents, text verbatim records from customers and repair technicians, warranty data, and others. While all of these types of data sources can be useful for creating fault models, or classifying or clustering documents, these activities can be time-consuming, labor intensive, and in some cases somewhat subjective. In addition, manually-created fault models may not consistently capture all of the failures modes, symptoms, and correlations which exist in the vehicle systems. Similarly, the documents clustered or classified without taking into account term co-relationships, such as a part and a symptom, or a symptom and a service action, or a part and a failure mode, may not provide accurate best-practice diagnosis knowledge discovery from the clustered documents. Therefore, methods have been developed to automatically extract diagnosis data that can be used for fault model construction or classifying/clustering documents by establishing correct correlation between the terms extracted from various types of documents. It is particularly challenging to extract diagnosis data from unstructured documents, such as those containing text verbatim data from repair technicians, as these documents typically contain sentence fragments, abbreviations, misspellings, and other shorthand notation which makes analysis difficult. Nonetheless, these unstructured text documents may contain a wealth of service history information which can be valuable to include in fault models or can be used to classify/cluster documents correctly.
There is a need for a methodology which enables the extraction of diagnosis data from unstructured text documents, such as service technician text verbatim documents, by establishing valid term co-relationships. The term co-relationship data can be used in an overall fault model development methodology, to improve the efficiency and accuracy of fault model creation from unstructured text documents. This data can also be used to classify/cluster documents correctly and meaningfully to be able to discover best-practice diagnosis knowledge.