The present invention relates to an apparatus, system, and method for predicting and accurately reproducing linguistic properties of character and word sequences using techniques involving affix data preparation, generation, and prediction.
Automated document preparation systems have been available for some time. These systems allow a plurality of individuals to dictate information to a transcription center where the dictated information is stored, transcribed and processed for distribution in accordance with a predetermined arrangement. Such systems are commonly employed in the healthcare industry where physicians, nurses and other medical professionals are required to maintain detailed records relating to the status of the many patients they see during the course of their daily routine.
As with virtually all industries, the healthcare industry in particular is beset by a need for readily available information. From physicians to patients the ready availability of information is somewhat limited when one looks to the availability of information in other fields. While much of the known scientific information relating to medicine is available via public and/or private databases, the manner in which the data is gathered and analyzed is very similar to methods which have been utilized since the development of the printing press.
That is, physicians typically conduct research on an individual basis and publish reports telling of the information they have found through their research. The basis for their research is, however, usually information of which they have first hand knowledge or information which has been previously published by other physicians.
In addition to the limited availability of information for use by physicians, the available information regarding the practice of medicine is stored and prepared in an arcane manner not readily understandable by the conventional patient. As such, medical patients are often forced to rely entirely upon information given to them by their personal physicians, and consequently overlook alternate procedures which may be preferable to those suggested by their personal physician.
Automated document preparation systems for some time have incorporated natural language processing to enhance document processing and information retrieval. For example, a natural language processor linked with a text normalization processor may be configured to compile relevant information related to reports generated by an automated document preparation system. The relevant information may be information related to diagnosis of diseases, treatment protocols, billing codes and the like. The relevant information may be compiled and indexed for later retrieval and research.
In the conventional natural language processors, morphological analysis and stemming techniques have been implemented to enhance natural language processing and information retrieval. Morphological analysis may include inflectional and derivational of natural language text. More particularly, inflectional analysis may involve determining patterns in paradigms and derivational analysis may involve the process of word formation. Computational methods applied to morphological analysis and generation in natural language parsing; text generation; machine translation; dictionary tools; text-to-speech and speech recognition; word processing; spelling checking; text input; information retrieval, summarization, and classification; and information extraction.
However, drawbacks and disadvantages are associated with the text processing engines. For example, the conventional information extraction engine is typically constructed using databases or tables of terms. In the medical fields, these tables often encompass several million of terms (words and phrases). The size of these tables not only encumbers computer memory resources, but also encumbers the performance of the normalization engine. More specifically, as the tables grow larger, the time required to search the tables grows larger. It would also be desirable to apply the same generation and prediction methods for a number of information extraction processing steps such as uninflection, underivation, and part-of-speech prediction; and for these methods to work equally well over words and phrases. The problem of processing text is burdened by the fact that it is not possible to list all possible terms. Consequently, prediction technology should not only provide precise information about the terms of which it has direct knowledge, but also be able to accurately predict information for novel or out-of-vocabulary terms.
Several shortcomings of the prior art that are addressed by the patent are: (a) enforcing the requirement that the prediction method is capable of perfectly rendering information supplied by the data set used to generate the predictor; (b) providing a method of excluding data from the generation process; (c) providing a method of incorporating exceptional data into the generation process; and, thereby, (d) providing the ability either to replace completely the original data set or to combine perfect rendition of the information in a data set and highly accurate prediction for novel or out-of-vocabulary terms.