Data records are created in many different applications, such as to provide a record of observations, actions taken or the like. In many instances, the data records are populated by free-form text that is entered by an author in order to document a particular event or activity. In order to sort, interpret or otherwise process the data records, it is oftentimes desirable to identify particular information, such as a part name, within the data records. For example, it may be desirable to identify every data record that includes a particular part name so as to identify trends or issues or to otherwise discern the current status. Since data records are commonly populated with free-form text, it may be difficult to consistently identify particular part names within the data records. In this regard, as the data records are frequently authored by different people, different expressions may be utilized to represent the same concepts. Additionally, certain information, such as part names, within a data record may be abbreviated or misspelled or acronyms may be employed which further complicate efforts to consistently identify particular information within the data records.
By way of example, the airline industry relies upon data records entered by mechanics relating to the results of inspections, repairs that have been undertaken and the like. The principal job of these mechanics is to maintain the aircraft in conformance with a schedule, such as a flight schedule or a maintenance schedule. These duties typically leave only limited time for documentation of the activities undertaken by the mechanics. As such, the mechanics may create data records in a relatively expedited fashion including, for example, the liberal use of abbreviations and acronyms, some of which are widely understood and some of which are developed ad hoc by the mechanics based upon, for example, the working conditions. As with the creation of any written record, the resulting data records may include spelling errors, erroneous spaces in words, omissions of spaces between words, or other typographical errors. Such misspellings and abbreviations may make it somewhat difficult to identify a particular word within a data record. By way of example, a computer may be referenced within a data record as a computer, a comptr, a compter, a computor, or a computo. Complicating the situation, “comp” within a data record may reference a computer; however, it may, instead, reference a compressor, compartment, or a compensator.
As a more particular example, one standard part name is: Overhead Panel Bus Controller (L). Within a data record, however, the Overhead Panel Bus Controller (L) may be differently referenced, such as follows:                COMPLAINT: REF ADD 913 STS MSG LEFT O/H PNL BUS CONTROLLER INTERMITENT TAGS OFF K02648Y        RESOLUTION: FIM ACTIONED AS PER MSG 23-48802 OPBC REPLACED IAW MM 23-93-01 GRND CHKS AND TESTS C/OUT SATIS TAGS ON B25092GAs demonstrated in the foregoing example, the Overhead Panel Bus Controller (L) may be referenced as an “O/H PNL BUS CONTROLLER” and an “OPBC”.        
The inconsistencies within data records as to the manner in which part names are referenced therefore makes any subsequent identification of part names within the data records a challenge. This challenge is exacerbated by the large number of different part names, such as several thousand part names in the airline industry, with some of the part names only varying slightly from other part names. Within the airline industry, the terminology, including the part names, may vary from airline to airline, from model to model, from fleet to fleet and/or from location to location, thereby further increasing the complexity of any subsequent efforts to analyze the data records. Furthermore, the number of data records may also be substantial and, in some instances, may number in the hundreds of thousands, thereby requiring that any technique for analyzing the data records be quite efficient if it is to be practical.
Techniques have been developed to identify information within data records that include free-form text. For example, efforts have been made to construct a knowledge base including lists of synonyms for at least some of the part names that appear within the data records. In this regard, the list of synonyms may include spelling variations including common misspellings as well as different names for the same part that are employed by different airlines. The data records may then be searched to identify data records that include one or more words or phrases as well as data records that include one or more synonyms for the words or phrases. Because of the substantial number of variations for any one word or phrase and further because of the challenges associated with handling ambiguities within a list of synonyms as a result of the absence of any context, the development of lists of synonyms for a number of words or phrases may be impractical such that efforts to develop a knowledge base including a synonym list for various words or phrases may prove to be less effective than desired.
Pattern recognition tools have also been developed to identify information within data records containing free-form text. In this regard, text mining algorithms and statistical methods have been developed to derive patterns based on context words with varying levels of success. However, in instances in which the context words have a large number of variations, it has proven somewhat difficult for pattern recognition tools to have as high of a rate of success as would be desired.
Natural language processing techniques have also been developed in which each sentence in a data record is parsed as subject, verb, object, etc. and semantic meaning is attached thereto. Such natural language processing techniques have proven to be a challenge as the large number of ad hoc spellings and incorrect spellings make the identification of lexical items difficult, while the ungrammatical style of writing that is employed within some data records may increase the difficulty of parsing.
Spell checkers have also been suggested in conjunction with the authoring and processing of data records. In this regard, a spell checking tool would ask an author or other user to select the correctly spelled version of a word if a word were determined to be misspelled. In addition to being relatively impractical given the large number of ad hoc and wrong spellings, such spell checking tools generally do not address acronyms and abbreviations which are frequently included within data records.
Another approach is to manually write detailed patterns based on regular expressions. This approach provides a great deal of power and flexibility in dealing with many variants and misspellings. However, most users are not particularly adept in writing regular expressions, even with the use of tools to build basic regular expressions for the words in a part name and to help them assess the results against the data. Furthermore, regular expressions cannot deal with certain types of common errors such as character transpositions and the different patterns for the same word may be required depending upon the context, which further complicates the analysis. Finally, building adequate regular expressions is very time consuming, making it difficult to extend the list of part names covered to new models or customers.
As such, it would be desirable to provide an improved technique for identifying words or phrases within data records. In this regard, it would be desirable to provide an improved technique for identifying words or phrases within data records consisting of free-form text, such as that entered by mechanics or other authors.