1. Field of the Invention
The present invention relates generally to the field of storage networks, and more specifically to file switching and switched file systems.
2. Description of the Related Art
Since the birth of computer networking, access to storage has remained among the most important network applications. The reason is simple: the purpose of networks was and is to share data and content, and most of the data worth sharing resides on some form of storage.
Despite the importance of storage applications in networks, their usefulness has, until recently, been greatly limited by the insufficient bandwidth provided by networks. Even at 100 Megabits/second (Mbps) (the most common maximum speed in existing local area networks, also known as Fast Ethernet), accessing data through a network is several times slower than reading it from a hard disk attached locally to a computer. For this reason, historically most of the data accessed by a networked computer (workstation or application server—often referred to as a “client”) has resided on local storage and only data that has to be shared has resided on network servers.
The introduction of Gigabit network technology, however, is changing the rules of the game. A single Gigabit Ethernet or FibreChannel connection is capable of transporting data at aggregate rates of up to 240 Megabytes/second (MB/s), which is much greater than the performance of most locally attached storage devices. This means that in Gigabit networks, data can be accessed through the network much faster than from local storage. As a result, we have now reached the beginning of a fundamental trend in which the majority of useful data is being moved to the network.
Storage Networks
The ability to store terabytes of data on the network and make that data accessible to tens and hundreds of thousands of users is extremely attractive. At the same time, creating storage and network systems capable of adequately handling such amounts of information and usage loads is not a simple task. As a result, storage networking—the discipline that deals with designing, building and managing such systems—is rapidly becoming recognized as a separate, specialized field of computer networking.
Introduction to the field of storage networking can be found in “Introducing Storage Area Networks”, Michael Peterson, in February 1998 issue of InfoStor magazine, PennWell Corp., and in “Building Storage Networks”, Marc Farley, January 2000, McGraw-Hill, ISBN 0072120509. For an excellent overview of storage network architectures, please see “Auspex Storage Architecture Guide”, Second Edition, 2001, Auspex Systems, Inc. For a thorough background on the field and its business implications, as well as for a comparative analysis of current technologies, see “System Area Networks: The Next Generation of Scale in the Data Center”, Robert M. Montague, et al., Jul. 26, 2001, Dain Rauscher Wessels Equity Capital Markets.
The key promise of storage networking is in delivering network systems that enable the sharing of huge amounts of information and content among geographically dispersed users. To deliver on this promise, the storage network systems have to be extremely scalable while providing a high degree of availability comparable to that of the public telephone system. In addition, any system of this scale has to be designed so that it can be managed effectively.
In general, there are two distinct ways of providing storage services to the network: “network disk” and “network file”. In the first approach, network clients are given access to “raw” storage resources generally addressable as arrays of fixed-size data blocks. In the second approach, clients are provided with a protocol, such as NFS and CIFS, for accessing file system services. The NFS protocol is described in “NFS Version 3 Protocol Specification” (RFC 1813), B. Callaghan, et al., June 1995, The Internet Engineering Task Force (IETF). The CIFS protocol is described in “CIFS Protocol Version CIFS-Spec 0.9”, Jim Norton, et al., March 2001, Storage Networking Industry Association (SNIA). These protocols typically provide security, hierarchical directories, the ability to create, open and close individual named files, and access to data within such files. In most systems, clients can view each file as a separate, extensible array of bytes. File access protocols also provide the ability to mediate concurrent access to the same file by multiple clients.
The “network disk” approach is the foundation of the storage networking architecture commonly known as SAN (Storage Area Networks), while the “network file” approach is in the core of several architectures, including NAS (Network Attached Storage) and distributed file systems.
Currently, none of the available architectures is capable of adequately delivering on the promise of storage networking. Storage area networks scale relatively well; however, sharing data in SAN remains extremely difficult for reasons described below. NAS and distributed file systems, also discussed further below, are excellent in sharing data but have proven very difficult to scale in terms of performance and bandwidth. Because of these limitations, both types of systems are very cumbersome to manage. As a result, today a single storage administrator faces a challenge in managing only 80 gigabytes of data, and the cost and the cost and complexity of managing terabyte systems is astronomical.
Early applications of storage networking were focused on database storage such as on-line transaction processing, data mining, customer data, etc. In these applications, SAN remains the most common architecture. Today's applications, such as e-mail, document repositories, CAD/CAM, digital still images and digital video production, streaming high-resolution (HDTV) video, XML-based “soft-structured” data, and many others, are increasingly file-based. As a result, high-performance, high-availability network file services are becoming increasingly important.
Available Approaches to Scaling File Systems
The primary function of every file system is to enable shared access to storage resources. In fact, file systems were originally created to facilitate sharing of then-expensive storage between multiple applications and multiple users. As a result, when exposed as a network service, file systems provide a complete and mature solution to the problem of sharing data.
The flip side is that file systems are complex and very processing-intensive, which increases substantially the performance requirements to any computer that provides file services over a fast network. To serve files to hundreds and thousands of users simultaneously requires tremendous amounts of processing power, memory and bus bandwidth.
Because of the importance and magnitude of the problem, over the last fifteen years a number of different approaches have been tried. An excellent overview of the key issues associated with building network file systems and the various file system architectures can be found in “The Zebra Striped Network File System”, John Henry Hartman, 1994, Ph.D. dissertation submitted in the Graduate Division of the University of California at Berkeley. The known available approaches generally each fall into one of three broad categories: single box solutions, cluster file systems and distributed file systems.
FIG. 17 illustrates a typical application of presently available, commonly used network file systems. The system consists of a local area network 1700, which connects a large number of client workstations 1701 and a number of application servers 1702, connected to various file servers. The file servers typically include standalone servers such as 1703 and 1704, as well as file servers, such as 1705 and 1706, configured as a cluster 1710 with shared storage 1707. The servers 1705 and 1706 are connected together through a high-speed, low-latency intra-cluster connection 1709, and are also connected to the shared storage 1707 through a SAN, typically using optical (FibreChannel) interconnect, such as 1708. In addition, clients 1701, application servers 1702 and file servers 1703 through 1706 may be configured to be part of a distributed file system with the appropriate software services installed on all of those machines.
Single Box Solutions
Single box solutions provide a simple and straightforward approach to the problem of increasing the performance of file servers. Traditionally, the fastest available computers were used to serve files; when even these became insufficient, specialized architectures were built to extend the capabilities of the server. Where one processor was not enough, more processors were added; where the bandwidth of a standard bus was not sufficient, additional busses or even custom-designed wider busses were introduced, and so on.
The result of this approach is that high-end file servers are essentially massively multiprocessing supercomputers, with all the associated costs and complexity. Examples of single box solutions are the EMC Celera/Symmetrix, SGI Origin, HP Superdome, Intel Paragon and IBM SP, the trademarks of which are hereby acknowledged.
However, high-performance multiprocessing file servers quickly run into the performance limits of their storage subsystems. The approach to resolving this bottleneck is to spread the load among multiple hard disks and data paths operating in parallel. RAID and parallel file systems such as Cray T3E I/O system and PVFS are the results of this approach. RAID is described in “A case for redundant arrays of inexpensive disks (RAID)”, D. Patterson, et al., in Proceedings of ACM SIGMOD conference on the Management of Data, pp. 109-116, Chicago, Ill., Jun. 1-3, 1998, Association for Computing Machinery, Inc. The Cray T3E I/O system is described in “NERSC Tutorials: I/O on the Cray T3E”, chapter 8, “Disk Striping”, National Energy Research Scientific Computing Center (NERSC). The PVFS is described in “PVFS: A Parallel File System for Linux Clusters”, Philip H. Cams, et al., in Proceedings of the 4th Annual Linux Showcase and Conference, pages 317-327, Atlanta, Ga., October 2000, USENIX Association.
Single-box solutions are subject to several serious problems. First, because of the extremely high complexity and the need to develop custom silicon in order to satisfy performance requirements, single box solutions are hideously expensive. Worse, their development cycles are exceedingly long, virtually guaranteeing that they will be “behind the curve” in many important aspects, such as software technologies, protocols, etc., by the time they are generally commercially available. Since storage requirements effectively double every year or so, these boxes often become obsolete long before the customers manage to depreciate their high cost.
Cluster File Systems
An alternative to scaling the server architecture within the box is to put together multiple servers accessing the same pool of storage over a fast interconnect such as HIPPI or FibreChannel. The result is a “cluster” of computers that acts in many aspects similarly to a multiprocessing supercomputer but can be assembled from generally available components.
Since all computers in a cluster access the same set of hard disks, the file system software in each of them has to cooperate with the other members of the cluster in coordinating the access and allocation of the storage space. The simplest way to approach this problem is to section the storage pool and divide it among the different computers in the cluster; this approach is implemented in Windows clustering described in “Windows Clustering Technologies—An Overview”, November 2000, Microsoft Corp. More sophisticated approaches result in specialized cluster file systems, such as PVFS [see Carns, et al. above], Tigershark, GFS, VERITAS SANPoint Foundation, xFS, and Frangipani. Tigershark is described in “The Tiger Shark File System”, Roger L. Haskin, Frank B. Schmuck, in Proceedings of IEEE 1996, Spring COMPCON, Santa Clara, Calif., February 1996. GFS is described in “The Global File System”, Steven Soltis, et al., in Proceedings of the Fifth NASA Goddard Space Flight Center Conference on Mass Storage Systems and Technologies, Sep. 17-19, 1996, College Park, Md., and “The Design and Performance of a Shared Disk File System for IRIX”, Steve Soltis, et al., in Sixth NASA Goddard Space Flight Center Conference on Mass Storage and Technologies in cooperation with the Fifteenth IEEE Symposium on Mass Storage Systems, Mar. 23-26, 1998. VERITAS SANPoint Foundation is described in “VERITAS SANPoint Foundation Suite(tm) and SANPoint Foundation(tm) Suite HA: New VERITAS Volume Management and File System Technology for Cluster Environments”, September 2001, VERITAS Software Corp. xFS is described in “Serverless Network File System”, Thomas E. Anderson, et al., in the 15th Symposium on Operating Systems Principles, December 1995, Association for Computing Machinery, Inc. Frangipani is described in “Frangipani: A Scalable Distributed File System”, Chandramohan A. Thekkath, et al., in Proceedings of the 16th ACM Symposium on Operating Systems Principles, October 1997, Association for Computing Machinery, Inc. The benefits of cluster file systems, as well as their key problems and approaches to solving them are described in “Scalability and Failure Recovery in a Linux Cluster File System”, Kenneth W. Preslan, et al., in Proceedings of the 4th Annual Linux Showcase & Conference, Atlanta, Ga., Oct. 10-14, 2000, in “Benefits of SAN-based file system sharing”, Chris Stakutis, in July 2000 issue of InfoStor magazine, PennWell Corp., and in “Designing SSI Clusters with Hierarchical Checkpointing and Single I/O Space”, Kai Hwang, et al., IEEE Concurrency, pp. 60-69, January-March 1999.
The main challenge in all of the above-mentioned file systems comes from the need to frequently synchronize and coordinate access to the storage among all members of the cluster. This requires a centralized lock manager and/or a file manager that controls the allocation of disk space to different files and other shared metadata (data describing data). These components quickly become the major bottleneck that prevents scaling of the cluster file systems beyond 16 or so nodes.
To relieve this problem, several designs move substantial portions of the file manager functionality either into the storage subsystem or into the client software. Examples of this approach are Zebra [see Hartman, above], Swift, and NASD. Swift is described in “Swift: A Storage Architecture for Large Objects”, Luis-Felipe Cabrera, Darrell D. E. Long, In Proceedings of the Eleventh IEEE Symposium on Mass Storage Systems, pages 123-128, October 1991, and in “Swift/RAID: A distributed RAID system”, D. D. E. Long, et al., Computing Systems, vol. 7, pp. 333-359, Summer 1994. NASD is described in “NASD Scalable Storage Systems”, Garth A. Gibson, et al., June 1999, USENIX99, Extreme Linux Workshop, Monterey, Calif., and in “File Server Scaling with Network-Attached Secure Disks”, Garth A. Gibson, et al., in Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (Sigmetrics '97), 1997, Association for Computing Machinery, Inc. Although none of these approaches has been successfully commercialized as of the time of this writing, the available data suggests that they would scale better than the traditional cluster file systems.
The reliance on centralized resource coordination is the primary weak point of the cluster file systems that limits severely their scalability. Solutions that partially relieve this problem introduce other problems, including custom functionality in storage subsystems and specialized client-side software. If any of these approaches is commercialized, the requirement for using proprietary storage subsystems will have substantial negative effect on both adoption and price, while the need to rely on proprietary client-side software that has to be installed in every client accessing the system make the system fragile, prone to security breaches and hard to deploy and support.
Distributed File Systems
Both single box solutions and cluster file systems are tightly coupled systems that exhibit serious scalability limitations. Creating distributed file systems is an approach attempting to combine hundreds of file servers in a unified system that can be accessed and managed as a single file system. Examples of distributed file systems are the Andrew File System, and its derivatives AFS and Coda, Tricord, as well as the Microsoft Distributed File System DFS. AFS and Coda are described in “The AFS File System in Distributed Computing Environment”, May 1, 1996, Transarc Corp., and in “AFS-3 Programmer's Reference: Architectural Overview”, Edward R. Zayas, Transarc Corp., version 1.0 of Sep. 2, 1991, doc. number FS-00-D160. Tricord is described in U.S. Pat. No. 6,029,168 to Frey, issued Feb. 22, 2000, and entitled “Decentralized file mapping in a striped network file system in a distributed computing environment.” The Microsoft DFS is described in “Distributed File System: A Logical View of Physical Storage: White Paper”, 1999, Microsoft Corp.
Distributed file systems are loosely coupled collections of file servers that can be located in diverse geographical locations. They provide a unified view of the file namespace, allowing clients to access files without regard to where in the system those files reside. In addition, the system administrator can move files from one server to another in a transparent fashion and replicate files across multiple servers for increased availability in case of partial system failure.
Distributed file systems exhibit excellent scalability in terms of storage capacity. It is easy to add new servers to an existing system without bringing it off-line. In addition, distributed file systems make it possible to connect storage residing in different geographical locations into a single cohesive system.
The main problem with available distributed file systems is that they do not scale in performance nearly as well as they scale in storage capacity. No matter how large the number of servers in the system, each individual file resides on exactly one server. Thus, the performance the distributed file system can deliver to a single client (workstation or application server) is limited by the performance of the utilized individual file servers, which, considering the large number of servers involved, is not likely to be a very high performance machine.
Another problem that has great impact in commercial environments is the fact that most distributed file systems require specialized client-side software that has to be installed and configured properly on each and every client that is to access the file system. This tends to create massive versioning and support problems.
Moreover, distributed file systems are very prone to “hotspotting”. Hotspotting occurs when the demand for an individual file or a small set of files residing on a single server increases dramatically over short period of time, resulting in severe degradation of performance experienced by a large number of users.
Yet another problem with distributed file systems is in their low manageability. Although most aspects of the distributed file systems can be managed while the system is on-line, the heterogeneous and distributed nature of these systems effectively precludes any serious automation of the management tasks. As a result, managing distributed file systems requires large amount of highly qualified labor.