As set forth in U.S. Pat. No. 4,358,824 state of the art document storage and retrieval is based on manually selecting keywords to represent a document in the system's catalog or index and then effecting retrieval by recalling from memory appropriate keyword terms and either automatically or manually searching the index for an "appropriate" level of match against the prestored keywords. Procedures have been developed in the prior art of abstracting documents and retrieving them based on keyword matching. Each document or record has a unique identifier or key. Scanning a file or records to retrieve one particular document or record requires comparing its keyword or key with the key to one record after another. When a key match is obtained the record can be retrieved. This type of search required a careful organization of the records.
Direct addressing involves assigning each record to a specific location which is large enough to contain one or more records and which can be thought of as a bin or bucket. Since this operation usually requires much less time than a scan or other type of search, direct addressing is preferred as the most rapid means of accessing a single record in a file.
While being preferred, the process of transforming a document keyword or record key, to a corresponding main or external memory storage address, thought of as the bin or bucket location where the record can be found, is subject to certain disadvantages. Clusters and gaps frequently occur due to the manner addresses are assigned. Accordingly the memory, or bin address is frequently derived by means of a key conversion or transformation to randomize the address. This key-to-address transformation is intended to disperse the clusters, making the distribution of storage addresses more nearly uniform, and hence is known as hashing or randomizing. Thus, ideally, a key-transformation-technique should yield a unique address for each document or record and 100 percent utilization of the allocated memory storage space. The distribution should be uniform, and all the storage spaces should be filled.
Unfortunately neither complete randomization nor a completely uniform distribution results when keys are converted to addresses by the usual random conversion transformations or hashing techniques. Rather, known key-to-address transformations attempt, with limited success, to produce addresses intermediate between random and uniform. The results are often unpredictable and frequently result in undesirable overflows. That is, the assignment of more records to a location or bin than it can hold.
The storage and retrieval of information is the subject of various patents such as U.S. Pat. Nos. 3,350,695, 3,614,744, 3,681,781, 3,702,010 and 4,079,447. However, of these, only 3,681,781 pertains to hash addressing or hashing. That invention is not directed to the distribution of addresses, but to techniques for hashing which allow information to be found from approximate key values. For more specific descriptions of hashing techniques such articles as "Key-to-Address Transform Techniques: A Fundamental Performance Study on Large Existing Formatted Files", V. Y. Lum et al, Communications of the ACM, April 1971, and "Hashing Functions", G. D. Knott, The Computer Journal, Vol. 18, No. 3, 1975 are noted. The Art of Computer Programming, Vol. 3: Searching and Sorting, D. E. Knuth, 1973, Addison-Wesley also contains pertinent material. As described in these sources various key-to-address transformations have been developed in the form of conversions which randomize the addresses. Since the object of such transformations is statistical, the terms randomization, and hashing have become associated with them.
Obviously, no single transformation method can satisfy all of the speed, uniformity and simplicity requirements. A great deal of effort, with some remarkable results, has been expended in developing transformations producing a high degree of uniformity in the distribution of records throughout their storage space. However, in many cases the conversion method was tailored to the characters, range and length of the keys. Therefore, such methods lack generality. In accordance with the practice of this invention a transformation or hashing process is provided which not only leads to a greater degree of randomness than those heretofore known, but which does not lack generality. It is effective for both static and volatile files, and it works well with all types, ranges, and lenghts of keys. The procedure need not be modified for keys with irregular lenghts, regular or irregular separations, and different sets of characters. Hence this transformation method can be used for all files and different sets of keys at one or many data processing installations. In effect it takes whatever order may or may not exist in the keys and produces a state of chaos or randomization. The numbers or records which will be assigned to memory locations or bins can be calculated before the transformation procedure is actually used. Thus file space can be properly allocated, overflow problems minimized, and other file utilization characteristics can be optimized.