This patent application discloses an invention which may optionally form a portion of a larger system. Other portions of the larger system are disclosed and described in the following co-pending patent applications, all of which are subject to an obligation of assignment to the same person. The disclosures of these applications are herein incorporated by reference in their entireties.
METHOD AND SYSTEM FOR AUTOMATIC HARVESTING AND QUALIFICATION OF DYNAMIC DATABASE CONTENT, William J. Bushee, Thomas W. Tiahrt, and Michael K. Bergman, and Filed Jul. 24, 2001, application Ser. No. 09/911,522 now pending.
SYSTEM AND METHOD FOR EFFICIENT CONTROL AND CAPTURE OF DYNAMIC DATABASE CONTENT, William J. Bushee and Thomas W. Tiahrt, Filed Jul. 24, 2001, application Ser. No. 09/911,434 now pending.
1. Field of the Invention
The present invention relates to radix search tries and more particularly pertains to a new system and method for flexible indexing of document content for facilitating the rapid search and retrieval of large collections of documents.
2. Description of the Prior Art
The use of lexicographic (digital) search trees is known in the prior art. A radix search trie is a digital search tree with a fixed alphabet size. Each edge in the trie represents a character in the alphabet. Each internal node represents a string prefix. Each external node represents a string. The tree records the minimal prefix set of characters required to differentiate all strings in the string set. Strings are found by following an access path defined by the string""s characters.
Trie variations have developed into three broad categories: array based tries, where arrays of pointers are used to access subtrees; binary search tree based tries, where a binary tree is used to traverse the trie; and list based tries, where linked lists provide access linkage.
Array lookup can be relatively fast, but is typically limited to small alphabet sizes, since large-sized alphabets have too many null pointers. Binary search trees are relatively compact, but each bit must be examined, so binary search trees are relatively slower than arrays. Linked lists are relatively more storage efficient than arrays, but have relatively slower access times than arrays.
When extremely large numbers of strings are to be indexed, storage efficiency relatively greater than an array trie, and relatively faster access than a linked list trie or binary search trie is desirable.
In view of the foregoing disadvantages inherent in the known types of radix search tries now present in the prior art, the present invention provides a new system for flexible indexing of document content wherein the same can be utilized for facilitating the rapid search and retrieval of large collections of documents.
The invention contemplates a method for flexible indexing of document content, and includes obtaining a collection of documents to be indexed, storing said collection of documents in a single document information stream, parsing each one of said documents into constituent words to facilitate indexing, creating a plurality of stem words to be indexed by stemming each word into a standard prefix, and indexing each stem word.
There has thus been outlined, rather broadly, the more important features of the invention in order that the detailed description thereof that follows may be better understood, and in order that the present contribution to the art may be better appreciated. There are additional features of the invention that will be described hereinafter and which will form the subject matter of the claims appended hereto.
In this respect, before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not limited in its application to the details of construction and to the arrangements of the components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments and of being practiced and carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting.
As such, those skilled in the art will appreciate that the conception, upon which this disclosure is based, may readily be utilized as a basis for the designing of other structures, methods and systems for carrying out the several purposes of the present invention. It is important, therefore, that the claims be regarded as including such equivalent constructions insofar as they do not depart from the spirit and scope of the present invention.