1. Technical Field
The invention disclosed broadly relates to data processing and more particularly relates to linguistic applications in data processing.
2. Background Art
Text processing and word processing systems have been developed for both stand-alone applications and distributed processing applications. The terms text processing and word processing will be used interchangeably herein to refer to data processing systems primarily used for the creation, editing, communication, and/or printing of alphanumeric character strings composing written text. A particular distributed processing system for word processing is disclosed in the copending U.S. patent application Ser. No. 781,862 filed Sept. 30, 1985 entitled "Multilingual Processing for Screen Image Build and Command Decode in a Word Processor, with Full Command, Message and Help Support," by K. W. Borgendale, et al., assigned to IBM Corporation. The figures and specification of the Borgendale, et al. patent application are incorporated herein by reference, as an example of a host system within which the subject invention herein can be applied.
Document retrieval is the function of finding stored documents which contain information relevant to a user's query. Prior art computer methods for document retrieval are logically divided into a first component process for creating a document retrieval data base and a second component process for interrogating that data base with the user's queries. In the process of creating the data base, each document which is desired to be entered into the data base, is associated with a unique document number. Then the words comprising the text of the document are scanned and are compiled into an inverted file index. The inverted file index is the accumulation of each unique word encountered in all of the documents scanned. As each word of a document is scanned, the corresponding document number is associated with that word and a search is made through the inverted file index to determine whether that particular word has been previously encountered in either the current document or previous documents entered into the data base. If the word has not been previously encountered, then the word is entered as a new word in the inverted file index and the document number is associated therewith. If, instead, the word has been previously encountered, either in the current document or in a previous document, then the location of the word in the inverted file index is found and the current document number is added to the collection of previous document numbers in which the word has been found. As additional documents are added to the data base, each respective unique word in the inverted file index accumulates additional document numbers for those documents containing the particular word. The inverted file index is stored in the memory of the data processor in the document retrieval system. A document table can also be stored in the memory, containing each respective document number and the corresponding document identification such as its title, location, or other identifying attributes. Typically, prior art techniques for creating a document retrieval data base required a scanning of the entire document in the compilation of the inverted file index. After the inverted file index and the document table have been created in the computer memory, the second stage in the prior art computer methods for document retrieval can take place, namely the input by the user of query words or expressions selected by the user to characterize the types of documents he is seeking in a particular retrieval application. When the user inputs his query words, each word is compared with the inverted file index to determine whether that word matches with any words previously entered in the inverted file index. Upon making a successful match with the query word, the corresponding document numbers for the matched entry in the inverted file index are noted. If additional words are present in the user's input query, each respective word is subjected to the matching operation with the words in the inverted file index and the corresponding document numbers for matched words are noted. Then, a scoring technique is employed to identify those documents having the largest number of matching words to the words in the user' s input query. The highest scoring documents can then have their titles or other identifying attributes displayed on the display monitor for the computer in the retrieval system. An example of such a prior art document retrieval system is the IBM System/370 Storage and Information Retrieval System (STAIRS) which is described in IBM publication GH12-5123-1 entitled "IBM System/370 Storage and Information Retrieval System/Virtual Storage--Thesaurus and Linguistic Integrated System," November 1976. Another such system is described in U.S. Pat. No. 4,358,824 to Glickman, et al. entitled "Office Correspondence Storage and Retrieval System," assigned to the IBM Corporation.
Although these prior art document retrieval systems work well, because documents have different topics and are written by different authors at different times, the user may seek only the particular document of a certain author and/or certain subject or date. This retrieval-related information is referred to as the retrieval parameters. This becomes particularly true with business correspondence where the user desiring to retrieve a document may remember only the author, date, recipient, address, subject statement, or other document parameter. It would therefore be desirable to have a document retrieval system which isolates the business correspondence parameters in the process of a data base creation, thereby facilitating the retrieval of business correspondence through the use of queries comprising such business correspondence parameters. The problem of reliably retrieving business correspondence is further compounded when the user compiles a query containing terms which are not exactly the same as the terms in the parameters compiled into the data base during the data base creation phase. It would be desirable to have a document retrieval system suitable for retrieving business correspondence using terms in a query which are different in their linguistic structure, syntax or semantics from the terms employed in the compilation of the data base.