The invention relates generally to text management systems.
Each year organizations spend countless hours searching through documents and images, organizing filing systems and databases. Even with large information retrieval systems, considerable resources are needed to index documents, guess which key words will locate needed information, search through pages one query at a time, and sort through all the irrelevant data that the search actually yields.
A number of studies evaluating large information retrieval systems show that these systems are retrieving less than 20 percent of the documents relevant to a particular search, and at that the same time only 30 percent of the retrieved information is actually relevant to the intended meaning of the search request. One of the key reasons for poor retrieval results is that the people who perform retrieval only know the general topics of their interest and do not know the exact words used in the texts or in the keyword descriptors used to index the documents.
Another study analyzed how long it would take to index 5000 reports. It was assumed that each user was allowed 10 minutes to review each report, make indexing decisions by selecting the keywords, and record the information. At this rate, it would take 833 hours or 21 weeks for one full-time person (at 40 hours per week) to process the documents. The users would also need extra time to verify and correct the data. Under such an approach, the user must index incoming documents on a daily basis to keep the system from falling hopelessly behind. In addition, since the user chooses the relevant search terms, all unspecified terms are eliminated for search purposes. This creates a significant risk that documents containing pertinent information may not show up during a search because of the user's subjective judgments in selecting keywords.
Many text retrieval systems utilize index files which contain words in the documents with the location within the documents for each word. The indexes provide significant advantages in the speed of retrieval. One major disadvantage of this approach is that for most of the systems the overhead of the index is 50 to 100 percent of the document database. This means that a 100 Mbyte document database will require an index ranging from 50 to 100 Mbytes. This adds mass storage costs and overhead to the system.
Automated indexing processes have been proposed. For example, in the book, INTRODUCTION TO MODERN INFORMATION RETRIEVAL, by Salton and McGill (McGraw Hill, 1983) a process for automatically indexing a document is presented. First, all the words of the document are compared to a stop list. Any words which are in the stop list are automatically not included in the index. Then, the stems of the remaining words are generated by removing suffixes and prefixes. The generated atoms are then processed to determine which will be most useful in the search process. The inverse document frequency function is an example of such a process. The resulting index of this document, and other documents, may then be searched for articles relevant to the user.
The technique of truncating words by deleting prefixes and suffixes has also been applied to reduce storage requirements and accessing times in a text processing machine for automatic spelling verification and hyphenation functions. U.S. Pat. No. 4,,342,,085, issued Jul. 27, 1982 Glickman et al. describes a method for storing a word list file and accessing the word list file such that legal prefixes and suffixes are truncated and only the unique root element, or "stem", of a word is stored. A set of unique rules is provided for prefix/suffix removal during compilation of the word list file and subsequent accessing of the word list file. Spelling verification is accomplished by applying the rules to the words whose spelling is to be verified and application of the said rules provides, under most circumstances, a natural hyphenation break point at the prefix-stem and stem-suffix junctions.