1. Field of the Invention
The present invention relates, generally, to the scheduling of tasks in non-trivial processing systems and, in one embodiment, to the prerequisite-based scheduling of tasks wherein scheduling decisions are made based on the priority of tasks and/or the presence or absence of resources needed for a particular task to execute.
2. Description of Related Art
Hardware context. In modern, non-trivial processing systems, the operating system creates a new task when a program is to be executed by a processor. Although many tasks may be created, only one task may have access to the processor at any one time. Schedulers are therefore required to identify the next task to dispatch from a list of potentially dispatchable tasks.
A host bus adapter (HBA) is one example of a non-trivial processing system in which schedulers play a vital role. HBAs are well-known peripheral devices that handle data input/output (I/O) operations for host devices and systems (e.g., servers). A HBA provides I/O processing and physical connectivity between a host device and external data storage devices. The storage may be connected using a variety of known “direct attached” or storage networking protocols, including but not limited to fibre channel (FC), internet Small Computer System Interface (iSCSI), Serial Attached SCSI (SAS) and Serial ATA (SATA). HBAs provide critical server central processing unit (CPU) off-load, freeing servers to perform application processing. HBAs also provide a critical link between storage area networks (SANs) and the operating system and application software residing within the server. In this role, the HBA enables a range of high-availability and storage management capabilities, including load balancing, SAN administration, and storage management.
FIG. 1 illustrates an exemplary block diagram of a conventional host system 100 including a HBA 102. The host system 100 includes a conventional host server 104 that executes application programs 106 in accordance with an operating system program 108. The server 104 also includes necessary driver software 110 for communicating with peripheral devices. The server 104 further includes conventional hardware components 112 such as a CPU and host memory such as read-only memory (ROM), hard disk storage, random access memory (RAM), cache and the like, which are well known in the art. The server 104 communicates via a host bus (such as a peripheral component interconnect (PCI or PCIX) bus) 114 with the HBA 102, which handles the I/O operations for transmitting and receiving data to and from remote storage devices 116 via a storage area network (SAN) 118.
In order to further meet the increasing demands of I/O processing applications, multi-processor HBA architectures have been developed to provide multi-channel and/or parallel processing capability, thereby increasing the processing power and speed of HBAs. These multiple processors may be located within the controller chip. FIG. 2 illustrates an exemplary block diagram of a HBA 200 including a multi-processor interface controller chip 202. The interface controller chip 202 controls the transfer of data between devices connected to a host bus 204 and one or more storage devices in one or more SANs. In the example embodiment illustrated in FIG. 2, the controller chip 202 supports up to two channels A and B, and is divided into three general areas, one area 232 for channel A specific logic, another area 206 for channel B specific logic, and a third area 208 for logic common to both channels.
Each channel on the controller chip 202 includes a serializer/deserializer (SerDes) 210 and a protocol core engine (PCENG) 212 coupled to the SerDes 210. Each SerDes 210 provides a port or link 214 to a storage area network. These links may be connected to the same or different storage area networks. The PCENG 212 may be specific to a particular protocol (e.g., FC), and is controlled by a processor 216, which is coupled to tightly coupled memory (TCM) 218 and cache 220. Interface circuitry 222 specific to each channel and interface circuitry common to both channels 224 couples the processor 216 to the host (e.g. PCI/PCIX) bus 204 and to devices external to the controller chip 202 such as flash memory 226 or quad data rate (QDR) SRAM 228.
When data is transferred from a device on the host bus 204 to a storage device on the link 214, the data is first placed in the QDR SRAM 228 under the control of the processor 216 that controls the link. Next, the data is transferred from the QDR SRAM 228 to the link 214 via the common interface circuitry 224 and channel-specific interface circuitry 222, PCENG 212 and SerDes 210 under the control of the processor 216. Similarly, when data is transferred from the link to the device on the host bus 204, the data is first transferred into the QDR SRAM 228 before being transferred to the device on the host bus.
Messages and tasks. A HBA receives messages to be communicated between devices connected to the host bus and devices in the SAN, and messages destined for elements within the HBA. The messages are processed by one or more tasks within the HBA. For example, the HBA may receive control commands from a device on the host bus, translate the commands into control messages, and process the control messages within the HBA to perform a particular function. In another example, the host interface may receive data commands from a device on the host bus, translate the commands into data messages, and send these data messages to an I/O interface such as one of the PCENGs for further transmission to an external target.
A message is therefore a generic construct, an encapsulator for transport through the system. A message is a communication mechanism that can communicate a state, data, an indication of an event, or information to be passed between one or more tasks. Messages are meaningful to the HIBA architecture because they can modify the system state.
FIG. 3 is an exemplary task flow diagram presented for purposes of illustration only. In the task flow diagram of FIG. 3, a message 300 may initially be placed in a Port Queue 302, such as one found in a HBA. A Proto Router 304 reads the message from the Port Queue 302 and sends the message either to the SAS Cmd Queue 306 or the SMP Cmd Queue 308.
Two additional tasks consume messages from the SAS or SMP Cmd Queues 306 and 308, the SAS Handler 310 and SMP Handler 312, respectively. Once the appropriate task is executed, the results are placed in the Phy Queue 314, which is then read by the Phy Layer task 316.
Note that FIG. 3 could have been serialized to employ one SAS/SMP Cmd Queue and one SAS/SMP Handler, but by splitting the processing into two parallel paths, rules or priorities can be applied differently to the task in each path. In the example of FIG. 3, the SAS Handler task 310 is the higher priority task (indicated by priority path 318), and the SMP Handler task 312 is the lower priority task. In other words, in the example of FIG. 3, SAS messages have a higher priority than SMP messages. By assigning priorities, SAS messages placed in the Port Queue 302 will be sent to the SAS Cmd Queue 306 and then to the Phy Queue. 314 and then to the Phy Layer 316 ahead of SMP messages, while SMP messages having a lower priority will be processed when appropriate.
As is evident in FIG. 3, the Proto Router task 304 requires that a message have been placed in the Port Queue 302 before it can execute. The Proto Router task 304 also requires that space be available in the SAS Cmd Queue 306 and the SMP Cmd Queue 308 to be schedulable. Note that both the SAS Cmd Queue 306 and the SMP Cmd Queue 308 must be available in order to ensure that the processed message can be sent downstream regardless of the protocol of the message.
Scheduling of tasks. FIG. 3 illustrates that the processing of a message may involve the execution of multiple tasks. The purpose of a scheduler is to identify the next task to launch or dispatch from a list of other potentially dispatchable tasks, because not all tasks are immediately schedulable. For example, if a task required the presence of a message in a message queue, the task would be blocked until a message-appeared in the message queue.
Several conventional scheduling algorithms are known in the art. In preemptive scheduling, tasks can, be preempted to allow other tasks to run when a higher priority task is released by an interrupt raised due to a system event. This preemption is called a context switch, as the context for the current task must be saved off and the context of a new task migrated into the processor. Preemption ensures that the highest priority task is always executing. However, the penalty is increased overhead and a loss of efficiency created by the context switching.
Most conventional schedulers look at the resources needed by a task in a one-dimensional manner. If there are multiple tasks ready to run, the scheduler uses a policy to determine the order in which tasks are dispatched. Typically, the order is determined according to a priority scheme, but conventional priority schemes do not take into account all the resources that are required for a task to run to completion. Because conventional schedulers do not account for all the resources that a task may need to fully execute, a dispatched task becomes blocked when a required resource is unavailable. Once blocked, the task must revert to a “pending state” and give up the processor which it had-been granted. Note that the task is not reset, it is just paused (having yielded the CPU) and waiting for the resource to become available. When a task becomes blocked, the scheduler must run again and dispatch another task that is in a “ready queue” of other potentially dispatchable tasks. Note that tasks that appear to require no other resources except the processor to run are known as being in the “ready state” and are placed in the ready queue. However, even the newly dispatched task may become blocked, because tasks in the ready state may eventually need a resource that is not available. Eventually, the first blocked task may become unblocked when the required resource becomes available. For example, one of the tasks that was dispatched while the first task was in the pending state may have created the resource needed by the first task.
Because conventional schedulers do not take into account the additional dimension of the effect of resources other than the CPU resource, programmers must write application code to ensure that resources are available prior to the dispatch of a task, or accept the overhead of intermediate blocking and associated context switches—degrading throughput.
Another conventional scheduling method is task-level polling. Task-level polling is a mechanism whereby tasks are scheduled for execution and attempt to do as much work as possible in a state-machine architecture. If resources aren't available for the task, the task is rescheduled and tried again later. A disadvantage in this architecture is that polling wastes time (CPU resources) that could be used in more productive work.
Thus, a need exists for a scheduler that eliminates intermediate blocking between releases and reduces or removes task-level polling altogether. Removing the possibility of blocking means that common issues such as priority inversion will not occur, leading to better system performance.