1. Field of the Invention
The present invention relates to a novel method for storing digital data in a storage system comprising a plurality of data storage devices and for managing the storage of digital data among the storage devices to achieve optimum utilization of the data storage devices in the system.
2. State of the Art
Devices used in storing digital data in computer systems have the capability of storing sets of data and the means to access the data sets. Depending upon the sophistication of the technology and design of the individual storage devices, each device can sustain a certain number of accesses each second. Each device is also capable of storing a prescribed amount of data. A significant problem in the storage of digital data on direct access storage devices is characterized as "device bottleneck". That is, when a storage device receives access requests from a processor at a faster rate than the device can sustain, the processor is forced to wait. This results, in many cases, in a catastrophic degradation of throughput, i.e. the rate at which data records are stored on or retrieved from data sets on the storage device.
The growth and diversification of data processing systems has resulted in the need for storing data collections having a wide range of access characteristics. At one extreme it is necessary to access the data as much as 100 or more times per second per megabyte of data stored. While at the other extreme, the data is stored in case it is needed for reference. When closed, a data collection is not accessed at all. If that data collection occupies space on a high performance storage device which is inactive, the use of that storage capability is inefficient and the cost is high. Or if data requires an access frequency beyond the capability of the device on which it is stored, there is a significant degradation of the data processing system in terms of response time and/or throughput reducing the cost performance of the system.
Another characteristic of the use of a data collection is that it is opened during certain periods of time and can be referenced by more than one application program sharing the same data collection. Because the current techniques of data collection placement is unable to effectively consider the time domain of use, there is a significant skew in the utilization of the components of the data storage system such as the channel, controller, and storage devices. This causes degradation of the data processing system performance and the inefficient use of the components of the system.
In data storage systems, the storage devices are connected to the central computer through controllers and channels. While the principle function of the channel and controller is to execute control logic, they also provide the path over which the data is transmitted. The processing power of these two components determine the number of control sequences that can be executed in a given unit of time. Because the demand for data is not a synchronous process, queuing characteristics become apparent. There is an accelerated lengthening of wait time and a rapid increase in queue lengths when utilization goes beyond 30%.
One strategy implemented at some computer installation for managing device bottleneck is to have a person, usually a data base administrator, manually examine the device (that is, scan the contents of the device), and select for removal to other devices sufficient data sets so that accesses to the data sets remaining on the device will not exceed the device capability. Another strategy for managing device bottleneck implemented by some data base administrators is to allocate to a data set much more physical space on a device than is required. The space allocated in excess of requirements is, therefore, not written with data by this or other programs, with a resulting decrease in access demand on the device from other applications which would, except for the excess allocation, be accessing additional data on the device.
When a plurality of data storage devices are available, having different access characteristics (such as disk and tape storage devices), it is known to allocate frequently used data to the disks which are faster and to allocate less frequently used data to the slower devices. A further approach to the problem of bottlenecks provides an interface unit, including buffer storage and a control processor, which queues data for transfer with respect to a central processor and a plurality of storage devices of varying characteristics.
Generally, however, the storage of data is defined by the user/operator of the system. The user/operator prescribes where the data will be placed. Unable to predict the use of data with sufficient precision, the user/operator frequently places data in such a way that the storage device is unable to provide the required number of accesses during periods of peak demand by the central processor, resulting in excessive response times and overall system degradation. This results in considerable effort on the part of the user to tune his system or, in other words, to rearrange his stored data to relieve the excessive access burden on a particular device or set of devices. In the case where the user has several different kinds of storage devices that have different access rates and/or storage capacities, the user will attempt, on an intuitive basis, to place the data sets on devices that more nearly provide the performance required, and when using devices that are not well matched to this requirement, the user will over allocate space in an attempt to assure a sufficient access rate to a articular data set. Thus, even while not effectively accomplished, a great deal of effort by highly skill personnel is required to place data sets and to monitor system performance. If data storage capacity is not used effectively (resulting in wasted storage capacity and/or access capability), the data storage system generally operates in a degraded mode with human intervention occurring when degradation becomes intolerable, and data sets frequently are not placed on devices that most nearly meet the access and storage characteristics of the data.
Furthermore, data sets are generally placed on storage devices without sufficient understanding of the time when the data will be used, As a result, the data being used at any given time may be concentrated on one storage device or collection of storage devices being controlled by a single controller and/or a single channel. This significant skewing of the utilization of storage system components such as the storage devices, controllers, channels, and data paths results in significant wait times and the attendant lower performance of the data processing system.
As a result of the inability to place data sets effectively, it is not uncommon to observe average channel utilization of 15%, controller utilization of 5% to 10% and data storage device access utilization of 5% to 7%. This evidences an inefficient mix of data storage system components and the resultant cost to store the data is significantly higher than is warranted.
There are many factors that can vary the work load of a data storage system, such as the introduction of additional applications, a change in external procedures which alter times of processing, introduction of increased processing capability, a modification in the number of instructions executed per access in an existing application, etc. These influences occur in real time and cause significant changes in access requirements and component utilization that are not detected until a serious problem develops. This lack of system responsiveness to the changing demands requires significant time and effort to analyze the problem and to attempt a solution which at present is at best a cut and try effort.
In my previous U.S. Pat. No. 4,607,346, the entire contents of which are incorporated herein by reference, a method is provided for operating computing apparatus to automatically allocate data sets among storage devices to minimize system degradation by reducing device bottlenecks. In accordance with my previous patent, the access density of each data set being stored is determined, wherein access density is defined as the number of accesses per unit time to the particular data set divided by the volume of that data set. The access density of each of the storage devices in the system is calculated wherein the access density is the number of accesses that the particular device can sustain per unit of time divided by the data storage capacity of the device. Data sets are then allocated and reallocated as a result of continuous monitoring the storage characteristics of the data sets and the storage devices such that the data sets are stored on a storage device having an access density most nearly matching the access density of that data set.
The method of my previous U.S. Pat. No. 4,607,346 does a good job of managing files on a single storage device or on individual storage devices that are part of a group of such storage devices used in combination with the computer apparatus. However, there is no provision for characterizing the unused capacity of a storage device or for managing a plurality of such storage devices in such a manner that balancing of requirements, allocations and characteristics of all the storage devices as a group is achieved. Thus, while the method of my previous U.S. Pat. No. 4,607,346 may achieve good utilization of an individual storage device, it may overwork individual storage devices, especially at high use periods in the operational day, and it fails to fully utilize other storage devices to the best advantage of the storage system, i.e., the entire group of storage devices.
Considerable effort is expended by highly skilled personnel to place data on the storage devices such that device bottlenecks can be avoided and/or the data store can sustain a reasonable throughput. In spite of the effort applied the problems of system bottleneck and system queuing remains.