The present invention relates generally to the field of information management, and, more particularly, to the field of full text indexing.
The introduction and increasingly wide usage of computers in the past thirty years has made heretofore unavailable information increasingly accessible. This information or data explosion has increased exponentially in the past decade with the advent of personal computers and the large scale linking of computers via local and wide area networks. As the amount of available data and information increases, management and retrieval of that information has become an increasingly important and complex problem. An essential element to such management and retrieval is indexing.
Indexing is the process of cataloging information in an efficient and coherent matter so that it can be easily accessed. Traditional indexing and retrieval schemes, however, are ill equipped to accommodate the creation of indexes which store linguistic, phonetic, contextual or other information about the words which are indexed. Indexing of such information can advantageously provide more flexibility in the types of indexing queries which are implemented which, in turn, provides a more robust and powerful indexer. Due to the large amount of information which must be managed by during the creation of such an index, it is desirable that the processes and apparatuses used in the creation of such an index operate in an efficient manner which conserves resources such as memory yet which still provides acceptable processing times.
A computer system and method for creating a full text index that is able to accommodate linguistic, phonetic, conceptual, contextual and other types of relational or descriptive information. Indexable text can comprise alphabetic, numeric or alphanumeric characters as well as special character sets.
One embodiment of the present invention is a method in a computer system for creating a word list associated with a source text including one or more documents. Each document is comprised of one or more granules, wherein each granule defines an indexing unit of text including one or more words. The computer system searches at least a portion of one of the documents for a first word. The computer system creates a parent structure which is associated with the first word and which has a location list. The computer system stores the location of the granule containing the first word in the location list of the parent structure for the first word. The computer system creates one or more child structures which are associated with one or more child words, where each child word is associated with the first word and the child structure has a location list associated therewith. The computer system stores the location of the granule containing the first word in the location of the child structure.
Another embodiment of the present invention is a computer system for creating a word list associated with a source text including one or more documents. Each document comprises one or more granules, in which each granule defines an indexing unit of text including one or more words. The computer system has a parent structure associated with a first word, wherein the first word is located in one of the documents. The parent structure comprises a location array for storing the location of the granule containing the first word. The computer system has a child structure comprising a location array for storing the location of the granule containing the first word, wherein the child structure represents a child word and the child word is an attribute of the first word.
Still other aspects of the present invention will become apparent to those skilled in the art from the following description of a preferred embodiment, which is by way of illustration, one of best modes contemplated for carrying out the invention. As will be realized, the invention is capable of other different and obvious aspects, all without departing from the invention. Accordingly, the drawings and descriptions are illustrative in nature and are not restrictive.