The introduction and increasingly wide usage of computers in the past thirty years has made previously unavailable information increasingly accessible. This "information explosion" has increased exponentially in the past decade with the advent of personal computers (PCs), the large-scale linking of computers via local and wide area networks (LANs and WANs), and related events.
Further, the rapid growth of Internet and intranet technologies has resulted in vast amounts of information which can be accessed on-line. Much of this information is in the form of free format text, such as found in textual documents. One means of searching such information is to navigate through hypertext links using an Internet browser or by traversing the folders of a file system or a document management repository. However, because of the large amount of information, full text indexes of all of the words in the documents are rapidly becoming an essential tool to find needed information.
Indexing is the process of cataloging information in a collection of texts in an efficient and coherent manner so that it can be easily accessed. Most traditional indexing and retrieval schemes are ineffective when dealing with large quantities of variable length document text data.
As PCs have risen from their infancy, when relatively small amounts of data (on the order of kilobytes) were accessible by a single PC, to their current state, in which gigabytes of disparate data are accessible from a single PC, old methods for managing and accessing data are no longer effective.
For a collection of texts, the ability to retrieve data is directly related to the amount and quality of information in the index. For example, the index may contain only the titles of the documents. Or it may contain only certain key terms. The recommended solution is to provide indexing and searching on every word in the collection of texts.
The present invention relates to the class of indexing techniques known as full text indexing. A full text index consists of a word list for a collection of texts which resembles the index of a textbook. It can be viewed as a word list with an ascending order list of numbers associated with each word. Like the index of a book, the numbers refer to the indexing unit, or "granule" (e.g., page 6), where the word occurs in the source text. The core of the problem addressed by full text indexing is how to find documents (or parts thereof) when one does not know by whom they were written, when they were written, or what their contents are, yet one has an idea of the words, phrases, ideas, and possibly the dates involved. Thus, there are generally two search modes contemplated by full text indexers: (1) locate mode, i.e., searches for a specific document known to exist, but about which only fragmentary information is known (e.g., the date or author of the document); and (2) research mode, i.e., searches for documents pertaining to a certain category of information, where it is known whether the documents exist (e.g., documents pertaining to education in the 19th century.
Due to the large quantity of data that must be indexed today, some of the major indexing problems to be addressed are the speed of index creation and access and size of the index. Regarding the speed of index creation, because the data being indexed is constantly changing, a full text indexer must be able to create a new index quickly when data changes. The index must also be quick to locate and access information in the index. Also, since storage space is important and the size of the index is closely related to access speed, it is highly desirable that the index be small relative to the data being indexed.
Limited memory availability when building a full text index quickly creates another problem relating to the relative frequency of words being indexed. The DOS environment, for example, is an especially limited environment for indexing. A word like "the" may occur in almost every indexing unit. A word like "optometrist" might occur in only a few indexing units. If the index is created in a single pass, the word list and the index elements for each word must coexist in the computer's memory. When a new word is encountered, the amount of memory necessary to store the references to that word cannot be known until all documents have been read. A series of small memory allocations would make the index for high frequency words inefficient. Large allocations waste memory.
Full text indexes must be updated as new documents are created and existing documents are changed or deleted, since references to deleted or noncurrent documents can use a significant amount of disk space, and can also decrease the efficiency of retrievals via the full text index. Such space can be reclaimed by reindexing the documents. However, such a reindexing process can be quite time consuming. Another mechanism for reclaiming space utilized by references to deleted documents is by compressing the full text index so that the index contains only active references (nondeleted documents). Since reindexing is generally considerably slower than compressing the index, it is preferable to recover wasted space by compressing the index rather than reindexing the documents.