1. Field of the Invention
The invention relates to input/output (I/O) tasks and data transfer between storage devices, and more particularly relates to maintaining task prioritization and load balancing of I/O tasks with consistently intermittent failures.
2. Description of the Related Art
The explosion of information created by e-business is making storage a strategic investment priority for companies of all sizes. The nature of e-business requires storage that supports data availability so that employees, customers and trading partners can access the data at any time during any day through reliable, disaster-tolerant systems. In the event of a disaster, high data availability and recovery are essential to maintaining business continuity.
In order to prevent data loss during a disaster, such as a system failure or natural disaster, many companies rely on storage backups. A backup of data may be stored on removable media, such as tapes or writable optical disks. While removable media may be suitable for small companies, large corporations require immense amounts of storage capacity and therefore removable media is not a viable option for data backup. One solution for large corporations is storage servers. Storage servers are typically located on a common business network and configured to share data with nodes on the network. One such implementation is a storage area network (SAN). A SAN is a high-speed subnetwork of shared storage devices. A storage device is a machine that contains a disk or disks for storing data. Additionally, storage servers may be located remotely in order to provide data redundancy in the event of a complete site failure.
Storage servers support many data copy options. One widely used method of data transfer from a primary storage server to a remote storage server is peer-to-peer remote copy (PPRC). The PPRC function is a hardware-based solution for mirroring logical volumes from a primary site (the application site) onto the volumes of a secondary site (the recovery or remote site). PPRC can be managed using a Web browser to interface with the storage server or using commands for selected open systems servers that are supported by the storage server command-line interface. Additionally, copy functions such as flashcopy, PPRC extended distance copy, and extended remote copy are implemented.
Copy functions are processed over a variety of transmission mediums. Storage servers support a variety of connection interfaces. Such interfaces include Fibre Channel, 2 Gigabit Fibre Channel/FICON™, Ultra SCSI and ESCON®. Typically, multiple paths or channels exist between the primary site and the recovery site. Although multiple paths couple the primary and recovery sites, I/O tasks far outnumber the paths and therefore a task queue is necessary. However, certain types of data, such as customer data, are of greater importance than other types. Priority I/O queuing defines levels of prioritization for different types of data. For example, customer data would be assigned a high priority while a background data copy would be assigned a low priority.
In order to balance I/O tasks across the multiple paths that connect the primary and recovery sites, a prioritization algorithm is implemented. One example of a prioritization algorithm is to dedicate 70% of the resources of the path to high priority tasks, 20% to medium priority, and 10% to low priority. When a task is received at the primary site or server, a path to the recovery site is selected. Typically, the path with the lowest bandwidth usage is selected for the transfer. Once selected, a counter indicating system resource usage is incremented and the I/O task is started. Upon completion, the counter is decremented.
However, if there is a problem port or path that fails, all tasks that have selected the failed path will fail. Upon failure, tasks are retried on a different path, but the failed tasks are placed in the queue ahead of tasks already assigned to the new path. Commonly, once a task is retried the counter on the failed path is decremented. Since the path selection process is generally based upon the bandwidth usage of each path, the failed path is seen as the most available for subsequent tasks. A situation may arise where the failed path is selected for all low priority tasks because of the low bandwidth, and subsequently the low priority tasks fail and are retried on different paths ahead of high priority tasks on those paths. This defeats the purpose of task prioritization.
What is needed is an apparatus, system, and method that maintains task prioritization and load balancing. Beneficially, such an apparatus, system, and method would maintain the count on a bandwidth counter while a task is retried on a different path in order to prevent subsequent I/O tasks from being processed on the failed path.