1. Field of the Invention
The present invention relates to data storage systems, and in particular, to a method and apparatus for utilizing a number of cache storage nodes in a cluster storage subsystem.
2. Description of the Related Art
The ability to manage massive amounts of information in large scale databases has become of increasing importance in recent years. Increasingly, data analysts are faced with ever larger data sets, some of which measure in gigabytes or even terabytes. To access the large amount of data, two or more systems that work together may be clustered. Clustering generally refers to multiple computer systems or nodes (that comprise a central processing unit (CPU), memory, and adapter) that are linked together in order to handle variable workloads or to provide continued operation in the event one computer system or node fails. Clustering also provides a way to improve throughput performance through proper load balancing techniques. Each node in a cluster may be a multiprocessor system itself. For example, a cluster of four nodes, each with four CPUs, would provide a total of 16 CPUs processing simultaneously. Practical applications of clustering include unsupervised classification and taxonomy generation, nearest neighbor searching, scientific discovery, vector quantization, time series analysis, multidimensional visualization, and text analysis and navigation. Further, many practical applications are write-intensive with a high amount of transaction processing. Such applications include fraud determination in credit card processing or investment house account updating.
In a clustered environment, the data may be distributed across multiple nodes that communicate with each other. Each node maintains a data storage device, processor, etc. to manage and access a portion of the data that may or may not be shared. When a device is shared, all the nodes can access the shared device. However, such a distributed system requires a mechanism for managing the data across the system and communicating between the nodes.
In order to increase data delivery and access for the nodes, cache may be utilized. Cache provides a mechanism to store frequently used data in a location that is more quickly accessed. Cache speeds up data transfer and may be either temporary or permanent. Memory and disk caches are utilized in most computers to speed up instruction execution and data retrieval. These temporary caches serve as staging areas, and their contents can be changed in seconds or milliseconds.
In the prior art, a mainframe or centralized storage model provides for a single global cache for a storage cluster. Such a model provides a single pipeline into a disk drive. Having data in one central location is easier to manage. However, to share data stored in a centralized location, multiple copies of the data must be made.
In another prior art model, the disk is separated from its controller and a storage area network (SAN) is utilized to store the global cache. In a SAN, a back-end network connects multiple storage devices via peripheral channels such as SCSI (small computer system interface), SSA (serial storage architecture), ESCON (enterprise systems connection), and Fibre Channel. A centralized SAN ties multiple nodes into a single storage system that is a RAID (redundant array of independent devices) device with large amounts of cache and redundant power supplies. A centralized storage topology, wherein data is stored in one central location, is commonly employed to tie a server cluster together for failover and to provide better overall throughput. In addition, some storage systems can copy data for testing, routine backup, and transfer between databases without burdening the hosts they serve.
In a decentralized SAN, multiple hosts are connected to multiple storage systems to create a distributed system.
In both decentralized and centralized SAN systems, nodes can be added, and data can be scaled and managed better because the data does not have to be replicated.
Typically, in the prior art, there are two nodes in SAN storage products. Such storage products are referred to as xe2x80x9cactive-passivexe2x80x9dxe2x80x94one node in the storage product is active and one is passive. When utilizing a passive node, there is no input/output (I/O) operations between the nodes unless requested (i.e., the node is passive). Such a request is primarily invoked when there is an error on the node the user is currently communicating with and recovery is required. Further, I/O can only occur in one directionxe2x80x94up/down the active channel. Such one way communication results in the inability to share information. Thus, with an active-passive storage product, the lack of active bi-directional communication between the nodes slows performance.
Storage subsystems, such as a storage cluster, are widely used to serve xe2x80x9csharedxe2x80x9d data in an enterprise computing environment that has high performance, fault tolerance, and storage capacity requirements. As described above, in a prior art clustered environment, one or more nodes are used to access data. However, the prior art systems do not provide high availability for cache coherency in a distributed cache environment. Accordingly, what is needed is a storage system and synchronization sequence method for providing high availability cache coherency in a distributed cache environment for a storage cluster. Further, what is needed is a storage system that provides fault tolerance, the minimization of the number of messages exchanged between nodes, the ability to maintain a single modified image of the data (with multiple copies), and the ability to maintain both local and remote locking states.
To address the requirements described above, the present invention discloses a method, apparatus, article of manufacture, and a memory structure that provides high availability cache coherency in a distributed cache environment for a storage cluster.
A number of hosts communicate with a storage cluster that manages data access to/from an array of disks. The storage cluster includes a number of nodes. Each node in a cluster may either be an owner of an extent (referred to as an extent owner node) or a client for a given extent (referred to as an extent client node), but each node can be the owner of one extent and the client for another extent simultaneously. The extent owner node controls requests to disk for the extent and the locking and demotion of data in the extent. The extent client node may access the data but does not control destaging or locking of the extent.
Generally, an extent client node must wait for the proper lock state granted for the extent from the extent owner node prior to completing any I/O request (i.e., the extent client node must wait for permission from the extent owner node). However, one or more embodiments of the invention provide various exceptions to this general rule that increases performance. An extent client node is allowed to get a xe2x80x9chead-startxe2x80x9d to receive data from a host for a write request and later let the extent owner node sort out the proper order for cache coherency. In another exception, an extent client node is not required to request a lock state change from the extent owner node when a read cache hit occurs in the extent client node. Further, when a read miss occurs in an extent client node, the extent client node can initiate a stage request from disk and request the owner for a lock state change at a later time (but prior to giving the requested data back to the host). In other words, nodes don""t need to obtain permission for reading data prior to starting to pull data up from a disk.
Additionally, the node assigned to be an extent owner can change depending on: (1) access patterns; (2) to improve performance; and/or (3) due to a failure of the owner node, or due to a failure of the owner to communicate to the xe2x80x9csharedxe2x80x9d devices. Such node assignment can be performed in a predictive manner.
The extent owner notifies the client with an invalidation message when the owner knows the client""s data is out of date. The extent owner can also notify the client with a xe2x80x9cdemotexe2x80x9d message because the global cache resource is constrained. Further, the extent owner may only target client""s cache data that has not been accessed for a while. In response, the client may be required to immediately xe2x80x9cdemotexe2x80x9d (throw away) its copy of the data upon receiving instruction from the owner. Alternatively, the client may choose to not demote the data at all due to client""s prediction on the access pattern to the data. Nodes may be required to tell the owner when they have complied, but these messages don""t have to be sent immediately (they can be batched and sent later if the system is busy).