1. Field
The present invention relates to data staging between temporary storage and permanent storage, particularly but not necessarily exclusively in high-performance computing (HPC) systems. More particularly, the present invention provides a data staging management system and a method of managing data staging in a computer system.
2. Description of the Related Art
HPC systems or clusters are widely used nowadays for computer simulation in various fields. However, it is often the case that end users with applications (or “jobs”) to be executed, have to wait for a long time to acquire the necessary computing resources (“resources” here denoting CPUs, cores, memory and so forth), which means that computing resources are not efficiently used. One of the major problems causing this issue is a process called “data staging”.
Batch job execution is often employed in HPC systems. A batch job consists of a predefined group of processing actions that require little or no interaction between users and the system. When a job is submitted by a user, it is placed in a queue where it waits until the system is ready to process the job. A job queue may contain many jobs, each specific job being made to wait until the system has processed jobs submitted earlier or having a higher priority. Since user interaction is not required, batch jobs can be accumulated during working hours and then execution can take place overnight. This form of execution can be contrasted with “interactive jobs” in which the user interacts with the system and user requests are acted on in real time.
In job execution in a HPC cluster environment, it is common practice to transfer input data to temporary storage whose I/O speed is very fast, from a permanent storage area before a job starts to run. Similar to such input data transfer, it is also common to transfer output data (that is, the results of executing the job) from temporary storage to a permanent storage area after a job terminates. This data transfer process is called “data staging” in the high performance computing field. Input and output data staging are referred to as “stage-in” and “stage-out” respectively.
FIG. 1 shows a system overview of a HPC system employing data staging. A workstation 1 of an end user is connected via a network 5 to a HPC system 10 comprising a head node 11 and a possibly large number of computing nodes 12 for executing a job. The computing nodes 12 may be collectively referred to as a “solver” for the simulation task to be run, since simulation models generally involve the solution of many equations. The workstation 1 is just a representative example of a potentially very large number of users of system 10.
Simulation often involves the use of commercially-available models of physical phenomena, called “ISV applications” where ISV stands for Independent Software Vendor. Examples of such ISV applications employed by manufacturing companies and research institutes include MSC, CD-Adapco, and SIMILIA.
The HPC system 10 is linked via the network 5 with a temporary storage 20 equipped with its own I/O server 21, and to permanent storage 30 having an I/O server 31. The temporary storage 20 will normally be faster, of smaller capacity, and more local to the HPC system 10 compared with the permanent storage 30. Different storage technologies may also be involved: for example the temporary storage 20 may be solid-state whilst the permanent storage 30 may employ magnetic disks. Transfer between the two types of storage may be initiated, as indicated by dot-dash lines in FIG. 1, by commands from the head node 11, and managed by the I/O servers 21 and 31
The head node 11 contains a job scheduler 110 and is responsible for communications with the user workstation 1 and with I/O servers 21 and 31; it may, but need not, also provide one of the computing nodes.
In general, the job scheduler 110 manages both job execution order of a queue of batch jobs, and data staging associated with each batch job. There may be more than one “data staging job” associated with the same batch job—in particular a data first staging job for staging-in, and a second data staging job for stage-out. Multiple batch jobs may be executed simultaneously, depending on the computing resources demanded, and those available.
Firstly, the job scheduler 110 receives a job execution request from the user workstation 1 as indicated by the dot-dash line extending between the two, and places the job in a queue. The execution request takes the form of a “batch job script” specifying inter alia at least one target data file in the permanent storage 30, where the data for staging can be found. The job scheduler then checks the status of the availability of computing resources among the computing nodes 12. Data staging between the storages 20 and 30, upon command from the job scheduler and managed by the respective I/O servers 21 and 31, is conventionally carried out only after specified computing resources have been allocated to the job in the job scheduler 110. The data flow is indicated by dashed lines at the left hand side of the Figure, and respective storage areas 22, 32 and 23, 33 may be defined for each of the input (pre-execution) and output (post-execution) phases. Then, only after stage-in of the input data to the temporary storage 20 has completed without an error, the staged-in data can be supplied from temporary storage 20 to the computing nodes 12. The stage-in is performed under instruction from the head node 11, the data being transferred directly from the temporary storage 22 via the I/O server 21 to the computing nodes 12 which require the data, as determined by the job scheduler 110. Then, the main job starts to run.
However, the data staging processes carried out before and after solver job execution take a lot of time, leaving computing resources almost idle, which causes low machine efficiency and wastes electrical power. This kind of data staging is called “synchronous” data staging. Because the data staging is carried out after securing the necessary computing resource, the time for I/O processing (stage-in) increases as the target data size increases, which causes inefficient use of the computing resource. The time taken up by I/O processing has to be added on to the time the job is queued awaiting execution. During I/O processing, as already mentioned the computing nodes are almost idle.
In an attempt to ameliorate the above problem, so-called “asynchronous” data staging has been devised. In asynchronous data staging, before the necessary computing resource is allocated to a submitted job in the job scheduler, data staging is independently carried out in order to reduce inefficient use of computing resources during I/O processing. This allows at least some of the time while the job is queued to be put to use for stage-in of the input data. However, the following problems still exist.
Firstly, whilst a user waits for a job to be executed, there is the possibility that he/she changes the input data. In the case of asynchronous data staging, because it is difficult for a user to know the start-time of the input data staging, changing the input data during this waiting period is impossible. Secondly, if input data staging is scheduled using a first-in-first-out algorithm, the time taken for data staging increases and the allocation of compute nodes gets stacked as the amount of staging data increases. These problems arise both during stage-in and stage-out.
As the sophistication of simulation models grows, the amount of input and output data is rapidly increasing, with the result that the time taken up by stage-in and stage-out may account for a significant proportion of the overall run-time. Consequently, there is a need for a more efficient data staging mechanism between temporary storage and permanent storage.