The present invention relates to managing data on computer systems. More specifically, the present invention relates to the retrieval, insert or update of data records from a particular file within a set of related files.
Data managed by data processing systems is growing at a phenomenal rate. This is due to many factors including the trend of placing new types of data under the control of computers, such as e-mail, multimedia, photos, music and video, as well as the proliferation of data generated by the mass migration of new applications to the Internet In addition, there is typically a need for data owners to retain and preserve data for long periods of time, if not indefinitely. Accordingly, this explosion of new data is typically not displacing other data but is generally cumulative to most of the data previously generated.
Another form of data growth is due to the creation of multiple related files where each related file within a set of related files contains information related to, or corresponding to, information in other files within the set of related files. These additional related files are motivated from one of several possible needs. First, data records in a file may need to be rearranged to achieve a physical clustering of the data records that are frequently accessed together. These data records may have been physically clustered together initially, but as new data records are inserted and existing data records are deleted or updated, the adjacency characteristic may be greatly diminished over time. Reorganizing these data records, by physically clustering them close together, will minimize hardware delays when these records are retrieved together thereby achieving enhanced Input/Output (I/O) performance and improved response time for the user of the data processing system.
Typically, this type of reorganization may take long periods of time and users of the data processing system may not be able to wait for this operation to complete before accessing needed data records. Therefore, advanced data processing systems may provide for concurrent access to data simultaneous with a data reorganization operation. To achieve this feature, data records may be replicated from one file to a related file such that during the reorganization operation two related files exist and serve to satisfy data access requests during the ongoing reorganization. Multiple related files may exist for other reasons as well. Another example of utilizing multiple related files is multiple version support. When data records are updated, it is sometimes required that the previous versions of the record be made available for access as well as the most recently created version. Multiple related files may be used to achieve this type of functionality as well. Prior art data processing systems create an additional index for each related file such that each file in the set of related files has its own independent index. Alternatively, prior art computer systems may use a single index; however, in this case the single index is comprised of index entries that include additional information in order to identify a particular one of a set of related files to be used.
Exponential rates of data growth present problems related to storage capacity and performance. Users of computer systems demand timely responses to their queries, independent of whether there are a few thousand data records or billions of data records. Frequently hardware capabilities within an enterprise lag behind these user requirements placing an ever larger burden upon the software engineer to stretch the capabilities of the enterprise""s hardware resources. Accordingly, computer software engineers look for ways to use the hardware capacity of existing storage devices as efficiently as possible. Example storage devices include magnetic disk, magnetic tape, electronic flash memory, optical devices, etc. Many techniques are known in the art for extending the logical capacity of storage devices. For example, data compression algorithms such as Lempel Ziv Welch (LZW) are commonly used to increase the amount of data that can be stored on a fixed capacity storage device. Indexes stored on storage devices are also compressed by utilizing various algorithms to compress each key field within the index. These and many other algorithms and programming techniques are known in the art for increasing the logical capacity and improving the performance characteristics of various storage devices; however, to keep pace with insatiable storage demands, even more techniques are needed.
Indexes are of special significance to a software engineer because, in addition to consuming space on a storage device, they are closely tied to the performance capabilities achievable by a given data processing system. For example, if each index entry within an index is consuming extra space, then that index may increase in the number of index levels. Indexes are typically hierarchical tree structures and the number of index levels refers to the number of levels within the index hierarchy beginning with the root node and ending with the final leaf node. Searching an index to find a specific data record typically involves reading one index record from each level of the index, so extra index levels within the index may result in a significant degradation of performance during searching operations of the data processing system.
Accordingly there is a need for even more innovative ways for an enterprise to cope with exploding data growth. In an environment where multiple related files are present, it is desirable to increase logical data storage capacity by providing additional software solutions for storing more data in the same amount of space. It is highly desirable to eliminate extra indexes for related files such that a single index is utilized, thereby substantially reducing the space used for the collection of all related files. Further, there is a need to enhance search performance and reduce user response time in this environment by minimizing the size of index entries within an index and, accordingly, minimizing the number of index levels to be traversed during search operations.
To overcome the limitations in the prior art briefly described above, the present invention provides a method, computer program product, and system for utilizing storage efficiently, and improving search performance, in an environment comprising a plurality of related files. Specifically, the invention utilizes a fuzzy data record pointer (xe2x80x9cfuzzyxe2x80x9d, as used herein, means that the data record pointer need not be coincident with the actual data record address) for identification of both a target file and a target data record within the target file.
A target data record is accessed from a target file, selected from a set of N related files, utilizing a fuzzy data record pointer (hereinafter referred to as data record pointer). A modulus, for the data record pointer divided by N, is computed. This modulus value is utilized to select the target file. The data record pointer and modulus is also used to compute a data record address for the target data record.
In this manner a data record pointer is utilized to determine both the target file from a set of N related files and the target data record within the target file to be accessed. This novel technique facilitates the use of a single index for searching a plurality of related files. Further, individual index entries within the index are not expanded to accommodate the additional target file identification information. Accordingly, computer storage is used more efficiently and search performance is improved.