Today's computers require memory to hold or store both the steps or instructions of computer programs and the data that those programs take as input or produce as output. This memory is conventionally divided into two types, primary storage and secondary storage. Primary storage is that which is immediately accessible by the computer or microprocessor, and is typically though not exclusively used as temporary storage. It is, in effect, the short term memory of the computer. Secondary storage can be seen as the long-term computer memory. This form of memory maintains information that must be kept for a long time, and may be orders of magnitude larger and slower. Secondary memory is typically provided by devices such as magnetic disk drives, optical drives, and so forth. These devices present to the computer's operating system a low-level interface in which individual storage subunits may be individually addressed. These subunits are often generalized by the computer's operating system into “blocks,” and such devices are often referred to as “block storage devices.”
Block storage devices are not typically accessed directly by users or (most) programs. Rather, programs or other components of the operating system organize block storage in an abstract fashion and make this higher-level interface available to other software components. The most common higher-level abstraction thus provided is a “file system” (often also written as filesystem). In a file system, the storage resource is organized into directories, files, and other objects. Associated with each file, directory, or other object is typically a name, some explicit/static metadata such as its owner, size, and so on, its contents or data, and an arbitrary and open set of implicit or “dynamic” metadata such as the file's content type, checksum, and so on. Directories are containers that provide a mapping from directory-unique names to other directories and files. Files are containers for arbitrary data. Because directories may contain other directories, the file system client (human user, software application, etc.) perceives the storage to be organized into a quasi-hierarchical structure or “tree” of directories and files. This structure may be navigated by providing the unique names necessary to identify a directory inside another directory at each traversed level of the structure. Hence, the organizational structure of names is sometimes said to constitute a “file system namespace.”
Conventional file systems support a finite set of operations (such as create, open, read, write, close, delete) on each of the abstract objects which the file system contains. For each of these operations, the file system takes a particular action in accordance with the operation in question and the data provided in the operation. The sequence of these operations over time affects changes to the file system structure, data, and metadata in a predictable way. The set of file system abstractions, operations, and predictable results for particular actions is said to constitute a “semantic” for the file system.
In some cases, a storage resource is accessed by a computer over a network connection. Various mechanisms exist for allowing software or users on one computing device to access storage devices that are located on another remote computer or device. While there are several remote storage access facilities available, they generally fall into one of two classes: block-level; and file-level. File-level remote storage access mechanisms extend the file system interface and namespace across the network, enabling clients to access and utilize the files and directories as if they were local. Such systems are therefore typically called “network file systems.” One Example of this type of storage access mechanism is the Network File System (“NFS”) originally developed by Sun Microsystems. Note that the term “network file system” is used herein generally to refer to all such systems and the term “NFS” will be used when discussing the Sun Microsystems developed Network File System.
Networked file systems enable machines to access the file systems that reside on other machines. Architecturally, this leads to the following distinctions. In the context of a given file system, one machine plays the role of a file system “origin server” (alternatively either “fileserver” or simply “server”) and another plays the role of a file system client. The two are connected via a data transmission network. The client and server communicate over this network using standardized network protocols. The high-level protocols which extend the file system namespace and abstractions across the network are referred to as “network file system protocols.” There are many such protocols, including the Common Internet File System or CIFS, the aforementioned NFS, Novell® Netware file sharing system, Apple® AppleShare®, the Andrew File System (AFS), the Coda file system (Coda®), and others. CFS and NFS are by far the most prevalent. All of these network file system protocols share approximately equivalent semantics and sets of abstractions, but differ in their details and are noninteroperable. In order to use a file system from some fileserver, a client must “speak the same language,” i.e., have software that implements the same protocol that the server uses.
A fileserver indicates which portions of its file systems are available to remote clients by defining “exports” or “shares.” In order to access a particular remote fileserver's file systems, a client must then make those exports or shares of interest available by including them by reference as part of their own file system namespace. This process is referred to as “mounting” or “mapping (to)” a remote export or share. By mounting or mapping, a client establishes a tightly coupled relationship with the particular file server. The overall architecture can be characterized as a “two-tier” client-server system, since the client communicates directly with the server which has the resources of interest to the client.
The pressing need to monitor file systems and to report activities related to the file systems presents a challenge of unprecedented scope and scale on many fronts. For example, current network file system architectures suffer several shortcomings. In large network settings (e.g., those with large numbers of clients and servers), the architecture itself creates administrative problems for the management and maintenance of file systems. The inflexibility of the two-tier architecture manifests itself in two distinct ways. First, the tight logical coupling of client and server means that changes to the servers (e.g., moving a directory and its [recursive] contents from one server to another) require changes (e.g. to the definitions of mounts or mappings) on all clients that access that particular resource, and thus must be coordinated and executed with care. This is a manual and error-prone process that must be continuously engaged and monitored by the system administrators that manage and maintain such networked file systems. Second, the overall complexity of the environment grows at a non-linear rate. The complexity of a system of networked file system clients and servers can be characterized by the total number of relationships (mounts, mappings) between clients and servers, i.e. it grows as/is bounded by:{{{Complexity˜=# Clients×# Servers}}}
Two-tier networked file systems therefore ultimately fail to scale in an important sense—the overall cost of managing a networked file system environment is proportional to this complexity, and as the complexity grows the costs quickly become untenable. This can be referred to as “the mapping problem.” The mapping problem may be understood as the direct result of an architectural deficiency in networked file system, namely the inflexibility of the two-tier architecture.
Existing attempts to address the problems of unconstrained complexity growth in the networked file system environment generally take one of two general forms: automation of management tasks; and minimization of the number of mounts through storage asset virtualization. The automation approach seeks to provide better administrative tools for managing network file storage. The virtualization approach takes two forms: abstraction; and delegation. The abstraction approach aggregates low-level storage resources across many servers so that they appear to be a single resource from a single server from a client's perspective. The delegation approach designates a single server as “owning” the file system namespace, but upon access by a client the delegation server instructs the client to contact the origin server for the resource in question to carry out the request. None of these approaches alone fully addresses the architectural deficiencies that cause complexity growth.
“Directory services” can be used to centralize the definition and administration of both lists of server exports and lists of mounts between clients and servers. Automation schemes can then allow clients to automatically lookup the appropriate server for a given file system in a directory service and mount the file system in its own namespace on demand.
File system virtualization solutions to date have usually taken one of three forms: low-level gateways between networked block-level protocols and file-level protocols; delegation systems; and fully distributed file systems. Low level gateways aggregate storage resources which are made available over the network in block (not file) form, and provide a file system atop the conjunction of block storage devices thus accessed. This provides some benefit in minimizing the number of exports and servers involved from a client perspective, but creates new complexity in that a new set of protocols (block-level storage protocols) is introduced and must be managed.
Delegation systems centralize namespace management in a single system—i.e., they make it appear that all the files are located on a single server—while actually redirecting each client request to a particular origin server. Delegation systems are relatively new and support for them must be enabled in new versions of the various file system protocols. Delegation systems allow a directory service to appear as a file system. One example is MicroSoft Corp.'s NT-DFS. Delegation systems typically do not map individual directories to individual directories. In other words, all the directories below a certain point in the file system namespace controlled by the delegation system are mapped to a single top-level directory. Another shortcoming is that prior art delegation systems typically respond to a request for a file or directory with the same response, regardless of the client making the request. As another deficiency, the underlying directory service does not handle requests directly, but redirects the requests to be handled by underlying systems.
Fully distributed file systems employ distributed algorithms, caching, and so forth to provide a unified and consistent view of a file system across all participating machines. While addressing mount management to some extent, distributed file systems introduce new and significant challenges in terms of maintaining consistency, increased sensitivity to failures, and increased implementation complexity. It should be noted that fully distributed file systems typically require specialized protocols and software on every participant in the system, in effect making every computer involved both a client and a server. Other distributed file systems seek to support mobile clients which frequently disconnect from the network, and thus focus on techniques for caching files and operations and ensuring consistency of the distributed file system upon reconnection.
Some prior art has focused on mechanisms for taking multiple file systems and producing a merged logical view of those file systems on a given file system client. This is sometimes referred to as “stack mounting.” Stack mounting to date has been seen as a nondistributed mechanism. It is used by a client to organize and structure their own local file system namespace for various purposes, rather than being used to organize and manage a collection of network file systems on an enterprise basis. Existing stacking file systems are limited in an important way—among a collection of logically joined file systems, a single origin file system is designated as the primary or “top” file system “layer” in the stack. All writes are performed on this file system layer. This has incorrectly been perceived as the only way to preserve the “correct” or traditional semantics of file systems.
In addition to organizing and maintaining the relationships between file system clients and file servers, additional challenges exist in managing access to and utilization of file systems. While most organizations have and enforce stringent document workflow and retention policies for their paper files, similar policies—while desired and mandated—are rarely enforced for electronic files. As a non-limiting example, many corporations have a policy that prohibits the usage of corporate storage capacity on fileservers for the storage of certain personal files and content types—for instance MP3s, personal digital images, and so on. This “policy” usually takes the form of a memo, email, etc. The administrators in charge of enforcing this policy face significant challenges. Conventional file systems do not provide mechanisms for configuring a file system to only allow particular content types or otherwise automatically make decisions about what should be stored, where, and how. These conventional file systems are static, and the set of semantics for access and other administrative controls are rather limited. Thus any such policy enforcement that happens is done retroactively and in an ad-hoc manner via manual or mostly-manual processes. The net result is that network file storage fills up with old, duplicated, and garbage files that often violate corporate and administrative utilization policies.
File systems are quasi-hierarchical collections of directories and files. The “intelligence” that a file system exhibits with respect to access control is typically restricted to a static set of rules defining file owners, permissions, and access control lists. To the extent even this relatively low level of “intelligence” exists, it is typically statically defined as a part of the file system implementation and may not be extended. Current file systems do not allow arbitrary triggers and associated activities to be programmed outside of the permissions hard coded in the original implementation of the file system.
Additional challenges exist for file system monitoring and reporting. File system activity produces changes to the state of a file system. This activity can affect changes to the structure, the stored metadata, and the stored data of the directories and files. Generally speaking, this activity is not logged in any way. Rather, the file system itself holds its current state. Some file systems—called “journaling” file systems—maintain transient logs of changes for a short duration as a means of implementing the file system itself. These logs, however, are not typically organized in any way conducive to monitoring and reporting on the state of the file system and its evolutionary activity over time. These logs are typically not made available to external programs, but are instead internal artifacts of the file system implementation. Further, these logs are frequently purged and therefore provide a poor basis for reporting of historical and trend data.
The collection, redaction, and analysis of high-level data about what a file system is being used for, what is stored in it, by whom and for what purpose continue to be a significant problem. Solutions today involve software programs or users explicitly walking through the file system structure, gathering the data required, and then analyzing it and/or acting on it, etc. Collection of file system data proactively as operations occur is generally not done as it is generally not supported by the file system itself. Furthermore, the accuracy of such collected data is usually questionable, as it reflects not an instantaneous state of the file system at any given moment, but, rather, an approximate state of the file system over the duration of the run. Without collecting and maintaining the appropriate statistics as file operations occur, it is impossible for the data, at the end of the run, to represent a correct and accurate picture of the contents of the file system at that time.
The problem of data collection and reporting is further compounded in the network file system environment. Because each server—indeed, each file system on each server—is a separate entity, it is therefore necessary to perform each data collection independently on each server. If reporting or monitoring is to be done across the network file system environment, significant challenges exist; namely, because of the parallel and discrete nature of the collection runs, it becomes difficult or impossible to sensibly merge the collected data into a consistent snapshot of the state of the file system at some time.
It is further the case that collection and storage of all such data as it occurs could be untenably burdensome; such logs would “grow” quickly and consume additional storage capacity at an undesirable rate. The ability to both collect such data as it occurs and dynamically redact or “historize” it would allow ongoing statistics to be maintained while simultaneously constraining the total amount of storage capacity that must be dedicated to such a purpose.
In today's increasingly litigious environment and in the presence of rules and regulations such as the Health Insurance Portability and Accountability Act of 1996 (HIPAA) and the Sarbanes-Oxley Act of 2002, the lack of management, including the inability to enforce policies consistently and effectively, represents a serious risk that corporations and businesses alike must rush to address. Unfortunately, as a direct result of the general lack of innovation and improvement in file system architecture over the last 30 years, viable solutions that could provide practical and effective policy management to enterprises do not seem to exist.
Perhaps a general comparison between typical databases systems and typical file systems could provide an insight as to the lack of innovation and improvement in file system architecture. For databases, storage is usually organized into tables arranged in a flat space (i.e., tables may not be contained in other tables) which contain records with generally fixed form. Such database systems often provide a notion of “triggers” and “stored procedures.” Triggers define a set of conditions; when the database is manipulated in a way that matches some condition, the stored procedure associated with that trigger is executed, potentially modifying the transaction or operation. This mechanism is used primarily in two ways in database applications: to ensure data correctness and integrity and to automate certain administrative and application-specific tasks. The analogous facility is not available in file systems because file systems are quasi-hierarchical collections of directories and files. As such, triggers cannot be defined with associated stored procedures that can be automatically activated and enacted synchronous with a file system activity in any extant file system.
In general, implementation of triggers and stored procedures in file systems is significantly more complex than in databases systems because of less regular structure of file systems, their less formally well-defined semantics, and because file data is itself arbitrarily semi-structured and loosely typed. Implementation of programmable procedures which respond to an arbitrary file system operation by modifying the operation is challenging when the correct (i.e., traditional, expected, etc.) semantics of file systems must be preserved. There are existing systems that will generate “events” when operations occur on the file system; these events can then be used to activate arbitrary actions post-facto. However, the actions cannot themselves modify the file operation, since the event which activates them is not generated until the triggering operation completes.
Currently, the “intelligence” that a conventional file system exhibits with respect to access control is typically restricted to a static set of rules defining file owners, permissions, and access control lists. To the extent even this relatively low level of “intelligence” exists, it is usually statically defined as a part of the file system implementation and may not be extended.
In a typical enterprise, the files and directories stored in the enterprise file systems represent unstructured or semi-structured business intelligence, which comprises the work product and intellectual property produced by its knowledge workers. The work product may include business-critical assets and may range from Excel spreadsheets representing (collectively) the financial health and state of the enterprise to domain-specific artifacts such as Word documents representing memos to customers. However, in contrast to the data stored in “mission critical” information systems such as logistics systems, inventory systems, order processing systems, customer service systems, and other “glass house” applications, the unstructured and semi-structured information stored in the enterprise file systems is largely “unmanaged.” It is perhaps backed up but little or no effort is made to understand what the information is, what its relevance or importance to the business might be, or even whether it is appropriately secured.
As examples, assuming that a user ‘Idunno’ has stored unauthorized and illegal copies of MP3 music files in a “home directory” on some file server that belong to a corporation ‘Big Corp’ where Idunno works. In doing so, Idunno has perhaps violated a corporate policy of Big Corp stating that no MP3 files are to be stored on the network. However, since the “home directory” is not visible to the system managers, the system managers have no knowledge to this violation, nor any automated means of remedying the situation. Even in the event that the system managers are able to episodically inventory the file systems for such violators, they are often loathe to automatically take appropriate actions (e.g., deleting) on such offending files. The reason is that, more often than not, while they have the responsibility for enforcing such policies, they do not have the authority to do so. To remedy this, the end-user (i.e., the file owner—in this example, Idunno) or some other responsible party must be brought “into the loop.” Other examples of file management policies might include: documents relating to patients' individual medical conditions within a healthcare provider business might be stored in such a way that perhaps would violate the privacy and/or security constraints of HIPAA; or financial documents within the finance operation of a Fortune 2000 company might be stored in such a way that perhaps would violate both regulatory requirements under the Sarbanes-Oxley Act of 2002 and internal corporate governance considerations.