The invention disclosed herein relates generally to natural language data processing and, more particularly, to a computer based method for enhancing natural language data for subsequent processing, and for retrieving natural language data from the enhanced natural language data using morphological and part-of-speech information.
Sophisticated techniques for archival of electrical signals representative of natural language data allow most business organizations and libraries to store vast mounts of information in their computer systems. However, regardless of how sophisticated the archival techniques become, the stored information is virtually worthless unless such information can be retrieved when requested by an individual user.
Typical techniques for retrieving a desired text use "keyword" and "contextual" searches. Each of these techniques requires the user to provide a fairly precise query or else retrieval of the desired text can be greatly compromised. As an example, a user may attempt to retrieve information on "heat seeking missiles", and request every textual fragment containing the word "seeking". Unfortunately, this technique would fail to retrieve fragments such as "successful missiles sought heat sources" or "smart missile seeks heat".
Alternatives have been suggested wherein a predetermined root of the keyword to be searched and a truncation mask are combined to increase the probability of matching various word endings or inflections. For example, if the exclamation point symbol "!" represents the truncation mask, then "seek!" would match "seeks" or "seeking" but still would fail to match "sought" and thus this approach does not fully resolve problems created by differences in word inflection.
A method for using morphological information to cross reference keywords used for information retrieval is disclosed in U.S. Pat. No. 5,099,426. The method described therein is primarily concerned with generating a compressed text and then searching for information in the compressed text using intermediate indexes and a compiled cross reference table. Although the method of the present invention also uses morphological information, the present invention has no requirements either for text compression or any such intermediate indexes and cross reference table. Further, the method described in the foregoing patent does not employ word sense disambiguation or part-of-speech (POS) information to refine the search. Accordingly, if the method described therein was utilized to search a text for "recording", it would likely find occurrences such as "record" "records", "recorded", "recording"and possibly "prerecorded" and "rerecorded"; however, such method offers no provisions to refine the search to retrieve those occurrences only where "recording" is used as a noun, for example. Thus, it is desirable for the retrieval method to allow the user to specify word usage as part of the search strategy. In this manner, the user may request that occurrences of a predetermined word be retrieved only when the predetermined word is specifically used, for example, either as a verb, adjective, or noun.
It is therefore an object of the invention to provide an improved natural language data retrieval method which is not subject to the foregoing disadvantages of existing information retrieval methods.
It is a further object of the invention to provide a method for enhancement of natural language data such that the enhanced data may be conveniently used in a subsequent natural language processing scheme such as natural language data retrieval.
It is yet a further object of the invention to provide a natural language data retrieval method which uses morphological and part-of-speech information to increase the probability of retrieving selected textual information.