1. Technical Field
This application relates to managing software errors in storage systems.
2. Description of Related Art
Computers, computer networks, and other computer-based systems are becoming increasingly important as part of the infrastructure of everyday life. Networks are used for sharing peripherals and files. In such systems, complex components are the most common sources of failure or instability. The proliferation of multiple interacting components leads to problems that are difficult or impossible to predict or prevent. The problems are compounded by the use of networks, which introduce the added complexity of multiple machines interacting in obscure and unforeseen ways.
Additionally, the need for high performance, high capacity information technology systems is driven by several factors. In many industries, critical information technology applications require outstanding levels of service. At the same time, the world is experiencing an information explosion as more and more users demand timely access to a huge and steadily growing mass of data including high quality multimedia content. The users also demand that information technology solutions protect data and perform under harsh conditions with minimal data loss and minimum data unavailability. Computing systems of all types are not only accommodating more data but are also becoming more and more interconnected, raising the amounts of data exchanged at a geometric rate.
To address this demand, modern data storage systems (“storage systems”) are put to a variety of commercial uses. For example, they are coupled with host systems to store data for purposes of product development, and large storage systems are used by financial institutions to store critical data in large databases. For many uses to which such storage systems are put, it is highly important that they be highly reliable so that critical data is not lost or unavailable.
Computer systems may include different resources used by one or more host processors. Resources and host processors in a computer system may be interconnected by one or more communication connections. These resources may include, for example, data storage devices such as those included in the data storage systems manufactured by EMC Corporation. These data storage systems may be coupled to one or more servers or host processors and provide storage services to each host processor. Multiple data storage systems from one or more different vendors may be connected and may provide common data storage for one or more host processors in a computer system.
A host processor may perform a variety of data processing tasks and operations using the data storage system. For example, a host processor may perform basic system I/O operations in connection with data requests, such as data read and write operations.
Host processor systems may store and retrieve data using a storage device containing a plurality of host interface units, disk drives, and disk interface units. The host systems access the storage device through a plurality of channels provided therewith. Host systems provide data and access control information through the channels to the storage device and the storage device provides data to the host systems also through the channels. The host systems do not address the disk drives of the storage device directly, but rather, access what appears to the host systems as a plurality of logical disk units. The logical disk units may or may not correspond to the actual disk drives. Allowing multiple host systems to access the single storage device unit allows the host systems to share data in the device. In order to facilitate sharing of the data on the device, additional software on the data storage systems may also be used.
Such a data storage system typically includes processing circuitry and a set of disk drives (disk drives are also referred to herein as simply “disks” or “drives”). In general, the processing circuitry performs load and store operations on the set of disk drives on behalf of the host devices. In certain data storage systems, the disk drives of the data storage system are distributed among one or more separate disk drive enclosures (disk drive enclosures are also referred to herein as “disk arrays” or “storage arrays”) and processing circuitry serves as a front-end to the disk drive enclosures. The processing circuitry presents the disk drive enclosures to the host device as a single, logical storage location and allows the host device to access the disk drives such that the individual disk drives and disk drive enclosures are transparent to the host device.
Further, disk arrays included in a data storage system may use a variety of storage devices with various characteristics for providing storage to a user. Each disk array may logically operate as a unified storage device. A data storage system may also include one or more storage array processors (SPs), for handling requests for storage allocation and input/output (I/O) requests. A storage processor (SP) in a disk array is the controller for and primary interface to the disk array. Disk arrays are typically used to provide storage space for one or more computer file systems, databases, applications, and the like. For this and other reasons, it is common for disk arrays to be structured into logical partitions of storage space, called logical units (also referred to herein as LUs or LUNs). For example, at LUN creation time, storage system may allocate storage space of various storage devices in a disk array to be presented as a logical volume for use by an external host device. This allows a unified disk array to appear as a collection of separate file systems, network drives, and/or volumes.
Disk arrays may also include groups of physical disks that are logically bound together to represent contiguous data storage space for applications. For example, disk arrays may be divided into redundant array of inexpensive disks (RAID) groups, which are disk arrays created by logically binding individual physical disks together to form the RAID groups. RAID groups represent a logically contiguous address space distributed across a set of physical disks. Each physical disk is subdivided into pieces used to spread the address space of the RAID group across the group (along with parity information if applicable to the RAID level). The physically contiguous pieces of the physical disks that are joined together to create the logically contiguous address space of the RAID group are called stripes. Stripes may form blocks and blocks may be allocated to create logical representations of storage space for use by applications within a data storage system.
As described above, applications access and store data incrementally by use of logical storage array partitions, known as logical units (LUNs). LUNs are made up of collections of storage blocks of a RAID array and are exported from the RAID array for use at the application level. LUNs are managed for use at the application level by paired storage processors (SPs). Ownership of a LUN is determined when the LUN is mounted by the application, with one of the paired SPs designated as the owner SP and the other SP acting as a backup processing device for the owner SP.
Ownership of a LUN may change under a variety of circumstances. For example, ownership of a LUN may migrate from one SP to another SP for host load balancing reasons, for host failover events, for SP failures, and for manual trespass operations initiated by a user at an application level. The term “trespass,” as used herein, refers to a change of ownership of a LUN from one SP to another SP. Host failover is a process by which a storage processor is eliminated as a single point of failure by providing hosts the ability to move the ownership of a LUN from one storage processor to another storage processor.