This invention is specifically concerned with distributed data processing systems characterized by a plurality of processors interconnected in a network. As actually implemented, the invention runs on a plurality of IBM RT PC.sup.1 interconnected by IBM's Systems Network Architecture (SNA), and more specifically SNA LU 6.2 Advanced Program to Program Communication (APPC). SNA uses as its link level Ethernet.sup.2, a local area network (LAN) developed by Xerox Corp., or SDLC (Synchronous Data Link Control). A simplified description of local area networks including the Ethernet local area network may be found in a book by .sup.1 RT and RT PC are registered trademarks of IBM Corporation. .sup.2 Ethernet is a trademark of Xerox Corporation. Larry E. Jordan and Bruce Churchill entitled Communications and Networking for the IBM PC, published by Robert J. Brady (a Prentice-Hall company) (1983). A more definitive description of communications systems for computers, particularly of SNA and SDLC, is to be found in a book by R. J. Cypser entitled Communications Architecture for Distributed Systems, published by Addison-Wesley (1978). It will, however, be understood that the invention may be implemented using other and different computers than the IBM RT PC interconnected by other networks than the Ethernet local area network or IBM's SNA.
As mentioned, the invention to be described hereinafter is directed to a distributed data processing system in a communication network. In this environment, each processor at a node, in the network potentially may access all the files in the network regardless of the nodes at which the files may reside. As shown in FIG. 1, a distributed network environment 1 may consist of two or more nodes A, B and C connected through a communication link or network 3. The network 3 can be a local area network (LAN) as mentioned or a wide area network (WAN), the latter comprising a switched or leased teleprocessing (TP) connection to other nodes or to a SNA network of systems. At any of the nodes A, B or C there may be a processing system 10A, 10B or 10C, such as the aforementioned IBM RT PC. Each of these systems 10A, 10B and 10C may be a single user system or a multi-user system with the ability to use the network 3 to access files located at a remote node in the network. For example, the processing system 10A at local node A is able to access the files 5B and 5C at the remote nodes B and C.
The problems encountered in accessing remote nodes can be better understood by first examining how a standalone system accesses files In a standalone system, such as 10 shown in FIG. 2, a local buffer 12 in the operating system 11 is used to buffer the data transferred between the permanent storage 2, such as a hard file or a disk in a personal computer, and the user address space 14. The local buffer 12 in the operating system 11 is also referred to as a local cache or kernel buffer. For more information on the UNIX.sup.3 operating system kernel, see the book by Brian W. Kernighan and Rob Pike entitled The Unix Programming Environment, Prentiss-Hall (1984). A more detailed description of the design of the UNIX operating system is to be found in the book by Maurice J. Bach, Design of the Unix Operating System, Prentiss-Hall (1986). The local cache can be best understood in terms of a memory resident disk. The data retains the physical characteristics that it had on disk; however, the information how resides in a medium that lends itself to faster data transfer rates very close to the rates achieved in main system memory. FNT .sup.3 Developed and licensed by AT&T. UNIX is a registered trademark of AT&T in the U.S.A. and other countries.
In the standalone system, the kernel buffer 12 is identified by blocks 15 which are designated as device number and logical block number within the device. When a read system call 16 is issued, it is issued with a file descriptor of the file 5 and a byte range within the file 5, as shown in step 101 in FIG. 3. The operating system 11 takes this information and converts it to device number and logical block numbers of the device in step 102. Then the operating system 11 reads the cache 12 according to the device number and logical block numbers, step 103.
Any data read from the disk 2 is kept in the cache block 15 until the cache block 15 is needed. Consequently, any successive read requests from an application program 4 that is running on the processing system 10 for the same data previously read from the disk is accessed from the cache 12 and not the disk 2. Reading from the cache is less time consuming than accessing the disk; therefore, by reading from the cache, performance of the application 4 is improved. Obviously, if the data which is to be accessed is not in the cache, then a disk access must be made, but this requirement occurs infrequently.
Similarly, data written from the application 4 is not saved immediately on the disk 2 but is written to the cache 12. This again saves time, improving the performance of the application 4. Modified data blocks in the cache 12 are saved on the disk 2 periodically under the control of the operating system 11.
Use of a cache in a standalone system that utilizes the AIX.sup.4 operating system, which is the environment in which the invention was implemented, improves the overall performance of the system disk and minimizes access time by eliminating the need for successive read and write disk operations. FNT .sup.4 AIX is a trademark of IBM Corporation.
In the distributed networking environment shown in FIG. 1, there are two ways the processing system 10C in local node C could read the file 5A from node A. In one way, the processing system 10C could copy the whole file 5A and then read it as if it were a local file 5C residing at node C. Reading the file in this way creates a problem if another processing system 10B at node B, for example, modifies the file 5A after the file 5A has been copied at node C. The processing system 10C would not have access to the latest modifications to the file 5A.
Another way for processing system 10C to access a file 5A at node A is to read one block at a time as the processing system at node C requires it. A problem with this method is that every read has to go across the network communications link 3 to the node A where the file resides. Sending the data for every successive read is time consuming.
Accessing files across a network presents two competing problems as illustrated above. One problem involves the time required to transmit data across the network for successive reads and writes. On the other hand, if the file data is stored in the node to reduce network traffic, the file integrity may be lost. For example, if one of the several nodes is also writing to the file, the other nodes accessing the file may not be accessing the latest updated file that has just been written. As such, the file integrity is lost, and a node may be accessing incorrect and outdated files. Within this document, the term "server" will be used to indicate the processing system where the file is permanently stored, and the term client will be used to mean any other processing system having processes accessing the file. The invention to be described hereinafter is part of an operating system which provides a solution to the problem of managing distributed information.
Other approaches to supporting a distributed data processing system in a UNIX operating system environment are known. For example, Sun Microsystems has released a Network File System (NFS) and Bell Laboratories has developed a Remote File System (RFS). The Sun Microsystems NFS has been described in a series of publications including S. R. Kleiman, "Vnodes: An Architecture for Multiple File System Types in Sun UNIX", Conference Proceedings, USENIX 1986 Summer Technical Conference and Exhibition, pp. 238 to 247; Russel Sandberg et al., "Design and Implementation of the Sun Network Filesystem", Conference Proceedings, Usenix 1985, pp. 119 to 130; Dan Walsh et al., "Overview of the Sun Network File System", pp. 117 to 124; JoMei Chang, "Status Monitor Provides Network Locking Service for NFS"; JoMei Chang, "SunNet", pp. 71 to 75; and Bradley Taylor, "Secure Networking in the Sun Environment", pp. 28 to 36. The AT&T RFS has also been described in a series of publications including Andrew P. Rifkin et al., "RFS Architectural Overview", USENIX Conference Proceedings, Atlanta, Ga. (June 1986), pp. 1 to 12; Richard Hamilton et al., "An Administrator's View of Remote File Sharing", pp. 1 to 9; Tom Houghton et al., "File Systems Switch", pp. 1 to 2; and David J. Olander et al., "A Framework for Networking in System V", pp. 1 to 8.
One feature of the distributed services system in which the subject invention is implemented which distinguishes it from the Sun Microsystems NFS, for example, is that Sun's approach was to design what is essentially a stateless machine. More specifically, the server in a distributed system may be designed to be stateless. This means that the server does not store any information about client nodes, including such information as which client nodes have a server file open, whether client processes have a file open in read.sub.-- only or read.sub.-- write modes, or whether a client has locks placed on byte ranges of the file. Such an implementation simplifies the design of the server because the server does not have to deal with error recovery situations which may arise when a client fails or goes off-line without properly informing the server that it is releasing its claim on server resources. An entirely different approach was taken in the design of the distributed services system in which the present invention is implemented. More specifically, the distributed services system may be characterized as a "statefull implementation".
A "statefull" server, such as that described here, does keep information about who is using its files and how the files are being used. This requires that the server have some way to detect the loss of contact with a client so that accumulated state information about that client can be discarded. The cache management strategies described here, however, cannot be implemented unless the server keeps such state information. The management of the cache is affected, as described below, by the number of client nodes which have issued requests to open a server file and the read/write modes of those opens.
More specifically, because file path name resolution is so frequent, it is important that it be done efficiently. Each system call that uses a file name, for example mount or open, can require that a directory be read and searched for each component of the file name's path. The performance penalties of reading numerous directories each time a file name is used are even more serious in a distributed environment where some of the directories may be in remote nodes.
Some UNIX.TM. implementations cache directory entries each time they are used in resolving a file's name. Subsequent file name resolution on the same file or files with names that have path pieces in common with the previously cached entries will run faster because directory entries can be found in the cache. Finding directory entries in the cache is faster than reading and searching directories because: (1), the directory cache is a special data structure maintained by the operating system that is optimized for searching; (2), the cache is kept in memory while the directories need to be read from the file system; and (3), the cache will usually have only a limited number of entries to be examined. The directory cache holds the most recently used, and hence the most likely to be useful, directory entries.
There are two major problems that the operating system faces in using a directory cache. The contents of the cache must be kept consistent with the contents of the directories, and the cache must be kept from getting too big. It is important that the cache be kept consistent. If the directory cache indicates that a file's inode number is, say, 45 but the directory has been changed, perhaps due to a mv command, so that the file's real inode number is 62, attempts to resolve the file's name will resolve to the wrong file--an open could open a file different than the one that was specified. If the cache is allowed to grow arbitrarily, it will eventually be so large that the time required to search it will negatively affect performance.
In a stand-alone system, the operating system itself is responsible for all changes to directories, making it possible for the operating system to purge from the directory cache any entries that may have changed, thus always leaving the directory cache with consistent entries. When the cache becomes full, some entries can be purged to make room for new entries. The choice of entries to purge to make room is not critical, but performance will usually be least impacted if the most recently used entries are retained. Since the major problems of directory caching can be handled in this fashion for stand-alone systems, several stand-alone UNIX.TM. implementations including stand-alone AIX.TM. do directory caching.
The solutions available for stand-alone systems do not work in a distributed environment. The directory cache is maintained by client nodes, while changes to directories in other, server, nodes could result in inconsistent cache entries. Attempts to maintain consistency by communicating every directory change at every server to every client caching directory entries could flood a network with these messages, vitiating any performance advantages from the directory caching.
It would, therefore, provide greatly improved operating efficiency in accessing file directories in networks as described above to have the ability to cache file directory information and be assured of its validity, while not needlessly and inefficiently updating this information during periods when no changes have been made.