Data mining broadly seeks to expose patterns and trends in data, and most data mining techniques are sophisticated methods for analyzing relationships among highly formatted data, such as numerical data or data with a relatively small fixed number of possible values. However, a vast amount of information consists of textually-expressed information, including many database fields, reports, memos, e-mail, web pages, product descriptions, social media, and external news articles of interest to managers, market analysts, and researchers.
Text mining is an extension of the general notion of data mining in the area of free or semi-structured text. In comparison to data mining, text data analysis (also referred to as “text mining” or simply “text analysis”) refers to the analysis of text, and may involve such functions as text summarization, text visualization, document classification, document clustering, document summarization, and document cross-referencing. Thus, text data analysis may help a knowledge worker find relationships between individual unstructured or semi-structured text documents and semantic patterns across large collections of such documents.
Research in the area of text mining has its roots in information retrieval, which began around 1960, when researchers started to systematically explore methods to match user queries to documents in a database. However, recent advances in computer storage capacity and processing power, coupled with massive increases in the amount of text available on-line, have resulted in a new emphasis on applying techniques learned from information retrieval to a wider range of text mining problems. Generally speaking, text mining requires the ability to automatically assess and characterize the similarity between two or more sources of text.
In general, text mining depends on the twin concepts of “document” and “term.” As used in this disclosure, a “document” refers to a body of free or semi-structured text. The text can include the entire content of a document in its general sense, such as a book, an article, a paper, a data record or the like, or a portion of a traditional document, such as an abstract, a paragraph, a sentence, or a phrase, for example, a title. Ideally, a document describes a coherent topic. In addition, a document can encompass text generated from an image or other graphics, as well as text recovered from audio or video formats.
On the other hand, a document can be represented as collections of “terms,” each of which can appear in multiple documents. In some cases, a term can consist of an individual word used in the text. However, a term can also include multiple words that are commonly used together, for example, the part name “landing gear.” This type of term is at times referred to as a “multiword term.”
Documents such as data records are created in many different applications, such as to provide product description, a record of observations, actions taken or the like. In many instances, the data records are populated by free-form text that is entered by an author in order to document a particular event or activity. In order to sort, interpret, process or otherwise perform data analytics over the data records, it is oftentimes desirable to perform data or text mining to identify particular terms or multiword terms, such as part names, within the data records, and from which particular information may then be identified. For example, it may be desirable to identify every data record that includes a particular part name so as to identify trends or issues or to otherwise discern the current status. Since data records are commonly populated with free-form text, it may be difficult to consistently identify particular part names within the data records. In this regard, different expressions may be utilized to represent the same concepts, such as in the case of synonymous terms for the same concept. Additionally, certain information, such as part names, within a data record may be abbreviated or misspelled or acronyms may be employed which further complicate efforts to consistently identify particular information within the data records.
By way of example, the airline industry relies upon data records entered by personnel in support of their engineering activities and engineering activities of industrial robots during pre-production, production and post-production of an aircraft or other manufactured product. In a more particular example, mechanics create data records relating to the results of inspections, repairs that have been undertaken and the like. The principal job of these mechanics is to maintain the aircraft in conformance with a schedule, such as a flight schedule or a maintenance schedule. These duties typically leave only limited time for documentation of the activities undertaken by the mechanics. As such, the mechanics may create data records in a relatively expedited fashion including, for example, the liberal use of abbreviations and acronyms, some of which are widely understood and some of which are developed ad hoc by the mechanics based upon, for example, the working conditions. As with the creation of any written record, the resulting data records may include spelling errors, erroneous spaces in words, omissions of spaces between words, or other typographical errors. Such misspellings and abbreviations may make it somewhat difficult to identify a particular word within a data record. By way of example, a computer may be referenced within a data record as a “computer,” a “comptr,” a “compter,” a “computor” or a “computo.” Complicating the situation, “comp” within a data record may reference a computer; however, it may, instead, reference a compressor, compartment, or a compensator.
The inconsistencies within data records as to the manner in which part names are referenced therefore makes any subsequent identification of part names within the data records a challenge. This challenge is exacerbated by the large number of different part names, such as several thousand part names in the airline industry, with some of the part names only varying slightly from other part names. The challenge may also lead to inaccurate or incomplete data on which engineering or other activities of personnel and industrial robots are performed on a manufactured product, or in some instances on which personnel or industrial robots fail to perform such activities. Within the airline industry, the terminology, including the part names, may vary from airline to airline, from model to model, from fleet to fleet and/or from location to location, thereby further increasing the complexity of any subsequent efforts to analyze the data records. Furthermore, the number of data records may also be substantial and, in some instances, may number in the hundreds of thousands, thereby requiring that any technique for analyzing the data records be quite efficient if it is to be practical.
Therefore it would be desirable to have a system and method that takes into account at least some of the issues discussed above, as well as other possible issues.