Extracting, filtering, and evaluating information from large text documents in electronic format can be difficult due to a lack of inherently defined structure and high dimensionality in the available information sources in the text documents. For example, Aircraft Maintenance and Operations Support System (AMOSS) requires knowledge from airline maintenance manuals to make intelligent decisions. These airline maintenance manuals in human-readable format consist of text in unstructured format, such as flowcharts for fault isolation, repair procedures for fault rectification, observations and a list of fault codes for various possible faults.
Current information-mining techniques such as hierarchical keyword searches, statistical and probabilistic techniques, and summarization using linguistic processing, clustering, and indexing dominate the unstructured text-processing arena. The most prominent and successful of current information-mining techniques require large databases including domain-specific keywords, comprehensive domain-specific thesauruses, and computationally intensive processing techniques.
For example, in classifying and identifying fault information from complex unstructured data, humans can only understand/read airline manuals, such as maintenance manuals, fault isolation procedures, troubleshooting manuals, repair manuals, wiring diagram manuals and so on. Currently document-to-knowledge (D2K) tools are used to convert unstructured, meaningless data in the manuals to meaningful data. D2K tools use a text extraction module as the major processor for extracting intelligent information from the manuals, including unstructured text. The text extraction module in the D2K tool uses a regular expression-based search engine to identify and classify fault information from the manuals. Regular expressions sometimes are referred to as regex, grep, or pattern matching. Writing regular expressions is very cumbersome and time consuming. Further, writing regular expressions requires domain knowledge and expertise in the field. Furthermore, writing regular expressions requires significant human effort. In the case of aircraft manuals consisting of large text documents, it can take nearly 4 months to classify and identify fault information using the regular expression search engine. Whenever there is a change in the unstructured text documents due to updates, extracting the desired intelligent information from the updated text document using regular expressions could be a very time consuming task. In addition, the accuracy of the identified and classified information is only about 70%.