1. Field of the Invention:
This invention relates to the field of indexing digital data prior to archival storage thereof, the indexing technique facilitating later retrieval of the data from archival storage by the use of a binary search. This invention has particular Utility in the storage and retrieval of static digital data; i.e., digital data that is not updated or changed after creation and archival storage thereof.
2. Description of the Related Art:
Archive and report distribution systems generally provide indexed access to both digitally stored statement data and digitally stored report data. Statement data is typically indexed on 1 to 5 fields that occur at the beginning of each statement page. Statement data is exemplified by bills and invoices. Statement index examples are name, account number, and date. Report data is typically indexed on 1 to 5 fields that occur on each row, line or record of each report. Report data is exemplified by freight bills, remittance data, and listings. Report index examples are check number, account number, and date. Depending upon the number of rows per report page, storage of report data and its index may consume 50 to 100 times more index storage overhead per page than does the storage of statement data and its index. Since a typical report may contain over 1,000,000 pages, the report index overhead cost can be significant.
Using conventional relational database techniques to digitally store report data and its index often requires more disk space for storing the index than is used for storing the report data itself. Since the general purpose of a statement/report archive system is to store large volumes of statement/report data on low cost optical disks, and to store the statement/report indexes on higher cost magnetic disks, the use of relational database techniques is not cost efficient. In addition, relational database techniques generally provide for the possibility of record insertion and deletion, and these insert/delete functions are not used relative to static statement and report data that does not change once the data is created.
A number of known solutions exist for this high storage overhead problem. One such solution is the IBM/R/DARS product wherein multiple versions of the report are stored, each version being sorted by a field of the report that can be later used for data retrieval. In this approach, a relational database is used to store an index of every 100 or so report pages. The retrieval system now uses a relational database search query that resolves a search key to a 100 page group, this being followed by a sequential search of the data within that 100 page group. While this approach is more efficient in terms of storage than is a fully relational database technique, this approach involves storing a complete copy of the report for each field that can be used for retrieval, this approach requires a significant amount of relatively slow searching techniques, this approach involves numerous CPU intensive search key comparisons, and this approach is not well suited to multiple key search quires, such as the search query, name=Smith, account number=123-456, and date=01/05/94.
U.S. Pat. No. 5,303,361 is of general interest in that it describes a digital search and retrieval system wherein an index file is created, this index file representing the approximate position and relative frequency of every word in every file on a given storage unit. Later, when searching for a word, search of the index ranks the files based upon the relative strength of match with the search request. This index comprises distinct word records that includes a unique digital representation for each word along with one or more file records that include a file code for each file, a density code indicating the relative frequency of occurrence of each word in a file, and a position code indicating the approximate location of the word within a file. When two or more words are included in a search request, the rating is based, in part, upon a combination of the words' density fields, and on whether the multiple words appear in approximately the same location in the file based upon the position fields of the words. The index file of patents utilizes a random 4-byte hashing code for each data file word, and does not teach use of a binary search technique, as in the present invention. In addition, search of the index file of this patent is a hash table sequential search, and this patent does not teach a binary search with optimized resolution of multiple search keys.
U.S. Pat. No. 5,237,678 describes a system for storing and manipulating information in an information base wherein records in an information base comprise one or more fields that have an orderable value, meaning that the fields have a value that is capable of being evaluated and being placed in some order in relation to the value of the field for other records in the information base. This may include numbers, characters of the alphabet, symbols, codes, etc. Topographic maps of these fields of information are stored for use by an output subsystem query, this query being a reference to the information on the basis of a specification of the values of one or more fields. The topographic maps of the fields referenced in the specification are then retrieved and manipulated in accordance with the query, the end result being one or more output maps indicating information base records which do meet the specifications, may meet the specifications, and do not meet the specifications. This patent teaches sequential search of data once range inclusion is determined, rather than the use of binary search techniques as in the present invention.
Published European Patent Application 0 583 108 A2 describes an entity-relation database wherein a plurality of entity or data-receiving fields contain arrays of data elements, the data elements being related to each other in predefined sets, each predefined set including data elements in two data-receiving fields that are called key fields and item fields. Key fields contain an array of data entries each of which is unique; for example, a list of the serial numbers assigned to articles. Key fields are sorted or indexed as entries are made into the field. Thus, the entries of a key field form an ordered array similar to a flat file that can be searched using a binary search process to locate the desired entry.
While prior devices as exemplified above have been generally useful for their limited intended purposes, the need remains for a method and apparatus for storing and retrieving digital data wherein an ordered index file is created for the data, each index file containing a series of multi-byte offsets into the data (described herein are 4-byte offsets that are capable of storing up to 4 billion characters, but the spirit and scope of the invention is not to be limited thereto since 2-byte, 4-byte and 8-byte offsets are all of similar utility), each offset pointing to a field within a row of the data, wherein the total index overhead is minimized by storing only the data offsets and using the data from the statement or report for comparison, wherein upon retrieval a binary search is performed for a key that is contained in a search query, using the index field offsets to determine the order in which to compare fields in the report data, the binary search resolving each field in the search query to a range of rows that match the query, the search technique optimizing the final filtering of matches by using the search query that matched the smallest range as the controlling field for resolving overlap of fields in the search query, and wherein a range check is performed before the binary search to determine whether the search key is outside the range of keys that are in the index.