1. Field of the Invention
This invention relates to data processing in general, and specifically to arrangements for optimizing job scheduling and execution in a distributed computing grid.
2. Related Art
In traditional data processing environments, a set of servers (i.e., computers, such as mainframes, midrange processors, blade servers, and the like) interact with storage (such as disk, tape, or network-attached storage) on a dedicated basis to process compute jobs such as payroll, e-commerce, billing, and so forth. Additional elements, such as firewalls, load balancers, Local Area Networks, Storage Area Networks, and the like, are also typically engaged. For example, servers A, B, and C may be dedicated to a payroll application, servers D, E, and F may be dedicated to customer technical support, and servers G, H, and I may be dedicated to web applications.
A limitation of this conventional approach is that the capacity dedicated to individual applications may be too little or too great at any given time. If the capacity is too great, it means that the owner of this infrastructure may have overpaid for the hardware and software comprising the data processing environment. If it is too little, it means that the application may not meet performance objectives such as the number of simultaneous users supported, throughput, response time and latency, or the like.
An emerging approach is called “grid computing.” Grid computing typically involves a number of geographically dispersed compute nodes. If an application needs to be run, and capacity of the appropriate type and configuration is available at one of the nodes, the job is scheduled to run at that node. A problem with this conventional job scheduling approach is that it ignores network considerations, except for the availability of a basic connectivity path to the node.
Consequently, the inventor has realized that, even though it may be true that a node has available capacity, it may not be the best node for the job, due to the total costs involved in moving the application and the data required for the application, and additional data, packets, or transactions as the job runs. These costs include the cost of transport, as well as the cost of delays due to bandwidth that may be insufficient to move the data to the node on a timely basis. These delay costs can be quantified due to user dissatisfaction, regulatory requirements and financial penalties, competitive needs, job deadlines, and so forth.
Moreover, conventional grid environments typically provision large fixed-bandwidth connections between nodes, for example, several Gigabit Ethernet or even several 10 Gigabit Ethernet connections are used in the TeraGrid backbone sponsored by the National Science Foundation. Much of the time, most of this capacity is unused, and consequently, users or firms may overpay for unused capacity.
It would be economical to utilize a switched line and pay only for needed service, rather than lease an expensive dedicated but underused line. In this regard, emerging technologies permit bandwidth to be allocated “on demand” on a link or end-to-end basis. Bandwidth on demand (BoD) is sometimes called (or is closely related to) dynamic bandwidth allocation, load balancing, committed information rates, rate shaping, quality of service (QoS) management, traffic management, traffic engineering, bandwidth minimums, bandwidth maximums, and the like. As generally understood and broadly used here, BoD temporarily flexibly provides capacity on a link to accommodate changes in the volume (e.g., packets or megabits per second) or characteristics (e.g., jitter, packet loss) of demand, the capacity being dynamically increased or decreased as specified through a control interface. A typical BoD implementation involves a router (or other network element as a switch, optical add/drop multiplexer, and the like) with the capability to perform the bandwidth allocation. Such routers can also be directed to establish or otherwise support the establishment of logical links on demand to provide more capacity (subject to the ultimate physical capacity of a link), and then be directed to dissolve the link as the traffic demand withdraws. The network element is typically coupled with a higher level entity, such as a software policy management layer, that tells the network element what to do. Various ways are known in the art to implement BoD, but none appear to be linked to or combined with scheduling jobs on nodes on the network.
What is needed in the art is way for a grid computing environment job scheduler to synergetically interoperate with such network functionality to optimize the overall performance and cost of distributed computing.