The invention relates generally to morphological analyzers and to text management systems.
Each year organizations spend countless hours searching through documents and images, organizing filing systems and databases. Even with large information retrieval systems, considerable resources are needed to index documents, guess which key words will locate needed information, search through pages one query at a time, and sort through all the irrelevant data that the search actually yields.
A number of studies evaluating large information retrieval systems show that these systems are retrieving less than 20 percent of the documents relevant to a particular search, and at that the same time only 30 percent of the retrieved information is actually relevant to the intended meaning of the search request. One of the key reasons for poor retrieval results is that the people who perform retrieval only know the general topics of their interest and do not know the exact words used in the texts or in the keyword descriptors used to index the documents.
Another study analyzed how long it would take to index 5000 reports. It was assumed that each user was allowed 10 minutes to review each report, make indexing decisions by selecting the keywords, and record the information. At this rate, it would take 833 hours or 21 weeks for one full-time person (at 40 hours per week) to process the documents. The users would also need extra time to verify and correct the data. Under such an approach, the user must index incoming documents on a daily basis to keep the system from falling hopelessly behind. In addition, since the user chooses the relevant search terms, all unspecified terms are eliminated for search purposes. This creates a significant risk that documents containing pertinent information may not show up during a search because of the user's subjective judgments in selecting keywords.
Many text retrieval systems utilize index files which contain words in the documents with the location within the documents for each word. The indexes provide significant advantages in the speed of retrieval. One major disadvantage of this approach is that for most of the systems the overhead of the index is 50 to 100 percent of the document database. This means that a 100 Mbyte document database will require an index ranging from 50 to 100 Mbytes. This adds mass storage costs and overhead to the system.