A batch process is a process in which a desired process starts at predetermined timing, and afterward the desired process is repeated to predetermined data.
Distributed parallel process is a method for processing a large volume of data at a high speed by making a plurality of servers (referred to as a computer, a calculation processing apparatus, an information processing apparatus, a calculation processing system, an information processing system or the like hereinafter) cooperate with each other. For example, the consistent hashing is one example of realizing distributed parallel process efficiently.
The consistent hashing or a method using a distributed hash table is one of a method for distributing data to each of plural calculation processing nodes (information processing node which will be referred to as “node” hereinafter) that a computer includes. For example, according to the consistent hashing, by assigning a hash key, which is used when data is arranged in each node, to each node virtually, specific data is arranged in any node of a distributed parallel computer.
Meanwhile, a multi-tasking operating system (Operating System which will be referred to as “OS” hereinafter), which operates on a distributed parallel computer, has a function to convert an address between a virtual memory space and a physical memory space of a main memory apparatus mutually per a process, a function to control paging, a function to manage a memory (hereinafter, referred to as “memory managing function”) or the like. The multi-tasking OS realizes the memory managing function by using a memory managing unit (Memory Managing Unit which will be referred to as “MMU” hereinafter) that an OS kernel and a processor provide.
The memory managing function manages an access to a main memory apparatus (hereinafter, abbreviated as “memory”) from a process. By virtue of the memory managing function, a programmer can create a program without considering whether a destination of an access from the process to the memory is the physical memory space or the virtual memory space.
Meanwhile, Garbage Collection (hereinafter, referred to as “garbage collector” or “GC”) is a mechanism or a kind of programming techniques to prevent memory leak. GC reduces a time and effort that a programmer frees a memory area explicitly, and reduces a load that a programmer carries out a system call on reserving and freeing a memory area.
GC detects an unnecessary memory area from the memory area that the process reserves. Then, GC collects the unnecessary memory area. Consequently, another process can use the memory area that GC collects. As a result, number of times of the process's carrying out the system call for reserving the memory area and the system call for freeing the memory area is reduced.
The mark and sweep model is one example of a model which realizes GC. The mark and sweep model includes a mark phase and a sweep phase.
In the mark phase, GC checks each object in the memory area whether the process or the like refers to the object. In the case that the process or the like refers to the object, GC marks each object. GC stores a marking result in a predetermined memory area.
In the sweep phase, GC frees a memory area that is assigned to each object not marked in the mark phase (that is, an object to which the process or the like does not refer).
Since each process is executed independently of each other, according to GC mentioned above, the memory area to which the process or the like does not refer is generated fragmentarily in the memory space. The memory managing function may include a function to defragment the memory area that is generated fragmentarily after the sweep phase in some cases.
In the case of a programming language such as Java (hereinafter, referred to as “Java (registered trademark)”), the dot net (hereinafter, referred to as “.Net”) or the like, an environment for executing the memory managing function, such as Java_Virtual_Machine (hereinafter, referred to as “JVM”), .Net_Framework_Runtime or the like has a function to carry out GC. For example, a GC function in JVM monitors the heap memory. On the basis of the mark and sweep model mentioned above, GC collects the memory area, to which the process or the like does not refer, out of the heap memory that is assigned to JVM, Afterward, the memory managing function in JVM defragments the fragmented data.
In the description mentioned above, the process to collect the memory area is not always a process to return management of the memory area that is assigned to JVM to OS. For example, in a memory management model that uses the malloc function to assign a memory area, and the free function to free the assigned memory area, the management of the memory area that the process frees is not returned to OS. According to the memory management model, the process assigns a memory area according to the malloc function by using the memory area that is freed by the free function.
Next, a method to realize the batch process based on distributed parallel process will be described.
A batch executing base has information about a date and time when executing the batch processor the like beforehand. The batch executing base starts the batch process at a predetermined date and time according to the control information. As a method how the batch executing base executes the batch process, there is a method that a client instructs to execute the batch process in addition to the method mentioned above.
A process which is executed by a distributed batch executing base 30 will be described with reference to FIG. 15. FIG. 15 is a block diagram showing a system configuration of the distributed batch executing base 30 which is related to the present invention.
A distributed parallel system 32 includes the distributed batch executing base 30 and a distributed data store 31. The distributed batch executing base 30 has a function to process at least one job. The distributed batch executing base 30 processes a plurality of jobs in parallel or in pseudo-parallel. It is also possible that the distributed batch executing base 30 is realized by a plurality of computers which are connected to each other through a communication network (hereinafter, abbreviated as “network”).
A batch execution managing mechanism unit 34 of each node has a function by which the own node processes the job (hereinafter, referred to as “batch executing function”). In the distributed parallel system 32, the batch execution managing mechanism unit 34 of each node shares information on a configuration and information on computer resources. The batch execution managing mechanism unit 34 controls a whole of the function to process the batch by each node's communicating the configuration information or the like, with each other.
By communicating with the batch execution managing mechanism unit 34 of the distributed batch executing base 30, a job control unit 35 controls the job between a starting time and an end time of executing the job. The job control unit 35 controls the job with reference to a job repository 38. The job repository 38 can associate information on control of executing the job, information on a history of executing the job and information on a state of executing the job, and store the associated information.
A batch application includes a definition on at least one job, and a batch program which is executed in the job. The definition on the job includes a definition on content of the batch process and a definition on data that is a process object. Moreover, the batch program includes a method for arranging data that is the process object in the distributed parallel system 32. The method for arranging the data defines arrangement of the data which reduces an overhead caused when the data is exchanged in the distributed data store 31. The job definition may not always include the batch program and the information on the method for arranging the data.
For example, the definition on the job includes a definition on a step (Step) that indicates a part of processes of the job, an order of executing the step, data that is the process object of the job, a path name that indicates a storage area of the data, information on a format of the data, information on properties of the data and the like.
The definition on the job may include a pre-process, a post-process and the like of each step. The definition on the step mentioned above may include information on a policy (processing method) on the distributed batch executing function, and a policy (processing method) on a synchronous process that is executed after the distributed parallel process. The definition on the job may not always include all the items mentioned above.
The batch execution managing mechanism unit 34 arranges the batch application to a plurality of the distribution batch executing base 30 through a management interface 36. An application managing unit 33 manages the batch application, which the batch execution managing mechanism unit 34 arranges, by using an application repository 37. The application repository 37 has the batch application and management information on the batch application (that is, records on a person who arranges the application, a time when arranging the application, the batch application which is selected, a classification of setting for arranging the application which are associated with each other and stored). A plurality of the batch applications may exist in the application repository 37.
Furthermore, the application managing unit 33 may include a function to analyze the batch application and a function to check validity.
The job is a batch processing program that can execute the batch application in the distributed batch executing base 30. The job may include a plurality of processes in one step.
Next, a method for realizing the distributed data store will be described.
The step in the job defines a reading process, a writing process, an updating process, deleting process and the like that are executed to data of the distributed data store 31 through an input/output interface of the distributed data store 31.
The distributed data store 31 includes at least one data store in a plurality of computers which are connected to each other through the network. The data of the distributed data store 31 is associated with metadata. For example, the metadata includes information on a key which is necessary to access data, information on a storage location at which data is stored, access information which indicates a situation of using data and the like. At least one node which the distributed data store 31 includes shares the metadata. As a result, a client can access data that a local node or a remote node has through the input/output interface without the consideration about the node which stores the data.
A data managing unit 39 manages the metadata associated with the data which the distributed data store 31 of the local node stores. A process in which the data managing unit 39 manages the metadata will be described in the following with reference to FIG. 16. FIG. 16 is a conceptual diagram showing an example of the metadata that the data managing unit 39 manages related to the present invention. The metadata associates data which the metadata indicates, information on the arrangement of the data, and the access information which is referred to when the data is accessed.
The information on the arrangement of the data includes information on a master node which has original data, and a copy node which has a copy of the original data. For example, the access information includes information on “priority” which indicates a degree of priority, “count” which indicates number of times of Referring to the data, and “time” which indicates a length of a time for processing the data. For example, in FIG. 16, a node “2” has data which “Y” indicates, and a node “1” has a copy of the data. The priority of the data is “Mid.” (that is, middle), and the data is referred to one hundred times, and the time for processing the data is “long” (that is, long).
The data managing unit 39 cooperates (interlocking) with a data managing unit 39 of another node (or, referred to as “remote node”) in the distributed parallel system. For this reason, a client can access the data through the input/output interface of the distributed data store without consideration of the node in which the data exists. For example, Java_Virtual_Machine (hereinafter, referred to as “JVM”) has a function related to the data managing unit 39 mentioned above.
The distributed data store 31 will be described. The distributed data store 31 stores data which is processed in the batch process. For example, the distributed data store 31 includes computer resources, file systems, a database, and a data managing software of the on-memory type data store or the like of the own node, and computer resources of another nodes such as a hard disk and a memory (referred to as “main storage apparatus” or “main memory” hereinafter). A client can process the data without depending on a storage location at which the data is stored. Hereinafter, it is assumed that the distributed data store 31 includes also a data store which is realized in one calculation processing system.
An on-memory type data store is a data store whose storage location of data is a memory or the like. Moreover, a disk type data store is a data store whose storage location of data is a hard disk or the like. A processing speed for data in the on-memory type data store is higher than one for data in the disk type data store.
Furthermore, in the distributed data store 31, the data stores of plural computers cooperate to each other through the network. Therefore, a client can handle the distributed data store 31 as a single data store.
A system which a patent document 1 discloses estimates a time that is consumed for executing a job based on job characteristics and number of inputting data, and estimates a load of each server which is caused within a range of the estimated time. The system selects a server that executes the job based on a state of the estimated load. By making the load of each server equal, the system reduces the time required for processing the job.
According to a GC method which is disclosed in a patent document 2, an unnecessary memory area in a loop is collected according to a state of data which a pointer designates.
In GC, a device that a patent document 3 discloses creates a profile which indicates a state of using a memory area, and estimates a possibility that a memory area becomes short based on the profile.