This invention relates to a method of processing data, and more particularly to a method of processing stored electronic documents to facilitate subsequent retrieval.
It is known to search text-based documents electronically using keywords linked through Boolean logic. This technique has been used for many years to search patent literature, for example, and more recently documents on the Internet. The problem with such conventional searches is that if the search criteria are made broad, the search engine will often produce thousands of xe2x80x9chitsxe2x80x9d, many of which are of no interest to the searcher. If the criteria are made too narrow, there is a risk that relevant documents will be missed.
There is a real need to provide a search engine that will filter out unwanted results while retaining results of interest to the user. An object of the invention is to provide such a system.
According to the present invention there is provided a method of processing electronic documents for subsequent retrieval, comprising the steps of storing in memory a summary structure database describing the structure of summary records associated with each document, each structured summary record having at least one descriptor field with predefined allowed entries identifying a characteristic of the document; storing in memory predefined keyword criteria associated with said allowed field entries; analyzing each document to build a text index listing the occurrence of unique significant words in the document; and comparing said text index with said keyword criteria to determine the appropriate field entry for the associated descriptor field.
Examples of descriptor fields with limited allowed field entries are category and location. The category field might have as possible field entries: Finance, Sports, Politics. The location field might have as possible entries: Africa, Canada, Europe.
The individual field entries are in turn associated with certain keyword criteria. For example, the criteria for the financial field entry. might be: shares, public, bankrupt, market, profit, investor, stock, IPO, quarter, xe2x80x9cfund managerxe2x80x9d. The criteria for the sports field entry might be: football, ball, basketball, hockey, bat, score, soccer, run, baseball, xe2x80x9cWayne Gretskyxe2x80x9d, xe2x80x9cChicago Bullsxe2x80x9d, xe2x80x9cMichael Jordanxe2x80x9d.
It will be appreciated that the keyword criteria are chosen in view of the likelihood that any document containing those keywords will be associated with the particular category.
In a preferred embodiment, the structured summary also includes fields having unlimited entries. Examples of such fields are a keyword field and an excerpt field. The keyword field may list the words having the highest count in the text index. The excerpt field may list the sentences containing the highest occurrence of keywords.
The structured summary can be established according to a standard profile that is the same for all users, or in one embodiment the profile can change in accordance with a particular user""s need. In this case, a user profile is stored in a profile database.
The structured summaries normally include pointers to the memory locations of the associated documents so that during a subsequent search, a user view relevant summaries and quickly locate the associated document as required.
The invention also extends to a system for processing electronic documents for subsequent retrieval, comprising a memory storing a summary structure describing the structure of summary records associated with each document, each structured summary record having at least one descriptor field with predefined allowed entries identifying a characteristic of the document; a memory storing predetermined keyword criteria associated with said allowed field entries; means for analyzing each document to build a text index listing the occurrence of unique significant words in the document; and means for comparing said text index with said keyword criteria to determine the appropriate field entry for the associated descriptor field.
The invention still further provides a method of retrieving electronic documents which are associated with a structured summary record containing a pointer to the document and having at least one descriptor field representative with predefined allowed field entries identifying a characteristic of the document, comprising searching through the summary records for records having specific field entries, and identifying the documents associated with the records matching the search criteria.