Computer clusters, or groups of linked computers, have been broadly used to improve performance over that provided by a single computer, especially in extended computations, for example, involving simulations of complex physical phenomena. As shown in FIG. 1, in a computer cluster 10, computer nodes (also referred to herein as client nodes) 12 are linked by a high speed network 14 which permits the sharing of computers resources and memory. Data transfers to or from the computer nodes 12 are performed through the high speed network 14 and are managed by additional computer devices, referred to herein as File Servers 16.
The computer cluster 10 may assume either the compute state (or compute cycle) or the input/output (I/O) state (or I/O cycle), which are typically mutually exclusive. The process of moving data is carried out during the I/O cycle of the computer nodes 12, e.g., the data transfers are executed during time intervals when the computer activity has ceased.
In general, simulations of physical phenomena, and other complex computations, may run for an extended period of time lasting, for example, hours or even days. During the execution of a simulation, “checkpoint” data is written into the program so that, if the application software or hardware fails, the simulation may be restored from the “checkpoint”. The “checkpoint” changes the state of the computer cluster from the compute state to the I/O state, to “write” the cache of the computer nodes to the attached File Servers which place the data in an orderly file system for subsequent retrieval. Since during the I/O cycle no actual computations occur, it is important to keep the I/O cycle as short as possible to maximize the overall computer duty cycle of the computer cluster.
The ratio of computer elements to File Servers is often very large and may exceed 1000 in some implementations. The File Servers 16 file data from multiple computer nodes 12 and assign a unique location for each computer node in the overall file system. Typically, the data migrates from the File Servers 16 to be stored on rotating media such as, for example, common disk drives arranged in storage disk arrays 18.
Since the File Servers 16 satisfy the requests of the computer devices 12 in the order that the requests are received, the disk drives appear to be accessed randomly as multiple computer devices may require access to the servers at random times. In this scheme, disk drives operate as “push” devices, and store the data on demand. Disk drives do not favor the regime of satisfying random requests since the recording heads have to be moved to various sectors of the drive (aka “seeking”), and this “seeking” movement takes a larger amount of time when compared to the actual “write” or “read” operation. To “work around” this problem, a large number of disk drives may be utilized that are accessed by a control system (storage controller) 19 which schedules disk operations in the attempt to spread the random activity over a large number of disk drives to diminish the effects of the disk head movement.
The size of computer clusters and the aggregate I/O bandwidths that is to be supported may require thousands of disk drives for servicing the computing architecture in order to minimize the duration of the I/O cycle. The I/O activity itself occupies only a short period of the overall “active” time of the disk system. Even though the duty cycle of write activity may occupy only a portion of the clusters total operational time, all the disk drives nevertheless are powered in expectation of the I/O activity.
It would therefore be beneficial to provide a data migrating technique between the computer cluster architectures and the disk drives which attains a shortened I/O cycle of the high performance computer clusters and an effective aggregate I/O bandwidths of the disk drives operation provided with a reduced number of disk drives activated for data storage, without excessive power consumption, and while maintaining data reliability as well as data integrity.