1. Field of the Invention
This invention relates to a distributed file system within a computer system, or, more specifically, to providing multiple computers or other computing entities with concurrent access to a file system while maintaining the integrity and coherence of the file system.
2. Description of the Related Art
Historically, a file system has often been accessible to only one computer at a time. For example, most computers have a local disk drive within the computer that contains a file system that historically has only been accessible to that computer. If multiple computers are given concurrent, unrestricted access to a typical file system the data in the file system will likely become corrupted. For example, suppose that a first computer and a second computer are connected to a disk drive containing a single file system. The computers may be connected to the disk drive by a SCSI interface (Small Computer System Interface), for example. Now if both computers are allowed to read and write file data and file system configuration data at will, a wide variety of conflicts can occur. As an example, suppose both computers are accessing the same file on the disk drive, and they both try to write to the end of the file at the same time. If only one write can actually be performed to the disk at a time, then the two writes will occur one after the other, and the second write will generally overwrite the data that was written in the first write, causing the data of one of the computers to be lost. As another example, suppose that both computers attempt to add a new directory to the file system at the same time. Again, a reference to a first directory created by the first computer may be overwritten by a reference to a second directory created by the second computer. Thus, to provide multiple computers with concurrent access to a common file system on a shared data store, without corrupting the data and the file system, one or more locking and/or coherence mechanisms must generally be implemented.
This invention may be implemented in any such situation in which it is advantageous for multiple computers or other computing entities to have concurrent access to a common file system, and this invention will improve the integrity and coherence of the file system and the data contained in the file system. The most common such situation in which the invention may be advantageously implemented involves multiple server computers connected to a data storage unit, such as through a data network. Thus, the preferred embodiment of the invention is described as being implemented in such a computer system comprising a data storage unit, multiple servers and some means of interconnecting the servers with the data storage unit. In many cases when multiple servers are connected to a data storage unit, however, each server has its own file system within the data storage unit, so that concurrent access to a common file system is not necessary. There are, however, other situations in which it is advantageous for multiple servers or other computers to have concurrent access to a common file system. One such situation is where multiple virtual machines (VMs) execute on multiple physical servers and share the same file system on a shared data store. Implementing the invention in such a system in which multiple VMs execute on multiple physical servers and share the same file system is particularly advantageous for several reasons, as described briefly below.
Now there are known techniques for enabling multiple computers to share a common file system. FIGS. 1A and 1B, for example, illustrate two different system configurations that have been used to give multiple servers access to a common file system.
FIG. 1A illustrates a computer system in which multiple servers access a common file system indirectly, by using a file server as an intermediary. FIG. 1A shows a plurality of servers 10, 12 . . . 18 connected together through a local area network (LAN) 20, which also interconnects with a file server 30. The file server 30 is connected to a data storage unit 40.
The data storage unit 40, illustrated in FIG. 1A and in other figures in this application, may be any data storage medium or any combination of data storage media that can hold a file system. Thus, the data storage unit 40 may be anything from a simple disk drive device to a complex combination of various data storage devices and/or systems. The data storage unit 40 includes a file system 41, which may be any conventional file system, such as a New Technology File System (NTFS) from Microsoft Corporation or a UNIX or Linux file system. The file server 30 may be a conventional file server, such as a server based on an x86 architecture from Intel Corporation, running a conventional operating system (OS), such as a Linux OS distribution, a Windows OS from Microsoft Corporation or a UNIX OS, along with a standard file server application. The file server 30 may be connected to the data storage unit 40 by any conventional means, such as through a SCSI interface. The local area network 20 may be a conventional computer network, such as an Ethernet network. Also, the servers 10, 12 and 18 may be any conventional server, such as a server based on the x86 architecture, running a conventional OS.
Now the servers 10, 12 and 18 do not access the file system 41 directly. The server 10 cannot, for example, directly mount the file system 41 or directly open a file within the file system 41. Instead, the servers 10, 12 and 18 must interact with the file server 30 to obtain access to the file system 41. For example, the server 10 may request that a directory be added to the file system 41, but it is the file server 30 that actually accesses the file system 41 to add the directory. Similarly, if the server 12 desires access to a file within the file system 41, the file server 30 actually reads data from the file or writes data to the file, as requested by the server 12. In this configuration, only one server, namely the file server 30, ever has access to the file system 41. Thus, this is not a distributed file system, in which multiple computers have concurrent access to a common file system.
The configuration of FIG. 1A is not desirable in many situations because the file server 30 can be a bottleneck that substantially slows down the speed at which the servers 10, 12 and 18 may interact with the file system 41. Interactions with the file system 41 may only proceed as fast as the file server 30 is able to service the requests of the servers 10, 12 and 18 and transfer the data between the servers and the data storage unit 40. Also, the file server 30 represents a single point of failure in the servers' ability to access the file system 41. A distributed file system is generally desirable in such situations, so that each of the servers 10, 12 and 18 may access the data storage unit 40 independently and redundantly, without having to go through the file server 30 to obtain access.
FIG. 1B illustrates a computer system that implements a prior art distributed file system. In this system, multiple servers access a common file system through a data storage network, and they communicate locking information with each other using a separate computer network. FIG. 1B shows the same plurality of servers 10, 12 . . . 18 connected together through the same local area network 20. FIG. 1B also shows the same data storage unit 40, including the same file system 41. This time, however, the servers 10, 12 and 18 are connected to the data storage unit 40 using a data storage network 32.
The data storage network 32 may be a conventional Storage Area Network (SAN), for example, based on any of a variety of technologies, including possibly Fibre Channel technology or SCSI technology. An important advantage to using a SAN or similar data storage network 32 is that the entire interface between the servers 10, 12 and 18 and the data storage unit 40 may be made very reliable. First, the data storage network 32 may be configured with redundant data paths between each of the servers 10, 12 and 18 and the data storage unit 40. Thus, for example, the data storage network 32 may comprise at least a first path and a second path between the first server 10 and the data storage unit 40. Either the first path or the second path may be used to transfer data between the server 10 and the data storage unit 40. Next, the system may be set up with failover capabilities, so that, if there is a failure in one data path between a server 10, 12 and 18 and the data storage unit 40, the system may switch over and use another, redundant data path. Thus, for example, when there is a first data path and a second data path between the first server 10 and the data storage unit 40, and there is a failure along the first path preventing its use, the system can switch over and use the second data path to maintain a connection between the server 10 and the data storage unit 40.
It is often advantageous to have a fully redundant data storage network 32, so that no single failure can prevent any of the servers 10, 12 and 18 from accessing their data on the data storage unit 40. One requirement of a fully redundant data storage network 32 is that each server 10, 12 and 18 must have at least two interface cards for interfacing with the data storage network. Otherwise, if a server only has a single interface card and a failure on that card prevents its use for accessing the data storage network 32, then the respective server 10, 12 or 18 is prevented from accessing the data storage unit 40. Thus, each of the servers 10, 12 and 18 in FIG. 1B is shown as having a pair of data interface cards. Specifically, the server 10 includes a first data interface card 10C and a second data interface card 10D, the server 12 includes a first data interface card 12C and a second data interface card 12D, and the server 18 includes a first data interface card 18C and a second data interface card 18D. Each of the data interface cards 10C, 10D, 12C, 12D, 18C and 18D may be a conventional data interface card for interfacing with the data storage network 32. For example, if the data storage network 32 is a Fibre Channel network, then the data interface cards 10C, 10D, 12C, 12D, 18C and 18D may be Fibre Channel host bus adapter cards (HBAs).
Each of the servers 10, 12 and 18 may use the data storage network 32 to access the file system 41 in the data storage unit 40. Each of the servers 10, 12 and 18 may have full access to the file system 41, including mounting the file system, reading and modifying configuration data for the file system, and reading and writing file data within the file system. Without more, however, the file system 41 would likely become corrupted, as described above. Thus, a distributed file system such as the one illustrated in FIG. 1B must place restrictions on the ability of the servers 10, 12 and 18 to access the file system 41.
Existing distributed file systems use the exchange of locking information to restrict access to the file system. A few examples of such distributed file systems are the Frangipani file system that was created by the Digital Equipment Corporation, the xFS file system that was created by the University of California at Berkeley, and the Veritas cluster file system, developed by the Veritas Software Corporation. These distributed file systems require that the servers 10, 12 and 18 exchange locking information to ensure that they do not access the file system 41 in conflicting manners. For example, a first file in the file system 41 may have a first lock associated therewith. One of the servers 10, 12 and 18 may be designated as a master server with respect to this first lock. Thus, suppose the server 12 is designated as the master server with respect to the first lock and that the server 10 desires to access the first file. The server 10 must communicate with the server 12 and request the first lock. The server 12 must then communicate with the server 10 to grant it the first lock before the server 10 may access the first file. Thus, for such a distributed file system to work, there must be some means of communication between the servers 10, 12, and 18.
Although the data storage network 32 enables the servers 10, 12 and 18 to interface with the data storage unit 40, such networks typically do not enable the servers 10, 12 and 18 to interface with each other. Thus, computer systems that use a distributed file system such as the one illustrated in FIG. 1B typically also include a separate network that may be used by the servers 10, 12, and 18 to communicate with each other. FIG. 1B shows a separate LAN 20 that is generally used for this purpose. Thus, in the example described above, the server 10 may send a network packet to the server 12 using the LAN 20, requesting the first lock, so that it may access the first file. The server 12 may then send another network packet back to the server 10 granting it the first lock, and thereby granting it access to the first file.
In the system of FIG. 1B, the LAN 20 is used to enable the servers 10, 12 and 18 to communicate with each other to gain access to the data storage unit 40. The LAN 20 is essential to the servers 10, 12 and 18 gaining access to the data storage unit 40. As a result, the reliability of the servers' access to the data storage unit 40 is dependent on the reliability of the LAN 20. Put simply, if a server 10, 12 or 18 cannot access the LAN 20 to obtain a lock for using the file system 41, it doesn't matter how reliable the data storage network 32 is. Thus, to improve the reliability of data access for the servers 10, 12 and 18, redundant paths are preferably also provided for enabling the servers 10, 12 and 18 to interface with each other over the LAN 20. In particular, each of the servers 10, 12 and 18 is preferably provided with two network interface cards (NICs) for connecting to the LAN 20. Otherwise, with just a single NIC, a failure in that NIC could prevent the respective server from obtaining a lock required to access its data in the data storage unit 40. Thus, the server 10 includes a first NIC 10A and a second NIC 10B, the server 12 includes a first NIC 12A and a second NIC 12B, and the server 18 includes a first NIC 18A and a second NIC 18B. Now, for example, if the first NIC 10A fails, the server 10 may still interface with the LAN 20 using the second NIC 10B.
In many situations, a system such as the system of FIG. 1B is not desirable for various reasons. First, it may not be desirable to require a second network such as the LAN 20 to enable the servers 10, 12 and 18 to communicate with each other to access the data storage unit 40. Even if each of the servers 10, 12 and 18 is connected to some other computer network, it may not be desirable to ensure that they are all connected to the same computer network. Second, it may not be desirable to provide each server 10, 12 and 18 with a pair of NICs just to provide full redundancy for their access to the data storage unit 40. Third, configuring a system such as the system of FIG. 1B can be complicated and time-consuming. Each of the servers 10, 12 and 18 must be provided with a substantial amount of information, such as which servers 10, 12 and 18 are permitted to access the file system 41 and individual data entities within the file system 41, which server is to function as the master server for each lock, and the IP (Internet Protocol) addresses or other addresses for each of the other servers. Finally, a system such as the one illustrated in FIG. 1B typically must also employ a complex re-mastering technique that is used whenever a master server of a lock fails, to ensure that another server can become the master server, so that the remaining servers may still access the data entity secured by the lock.
What is needed is a distributed file system that enables multiple computing entities to have concurrent access to a data storage unit, without having to go through a file server, and without all of the complexity, expense and inefficiencies of existing distributed file systems.