The present invention relates to file systems and, more particularly, to a file system that supports an explicit command to store files in accordance with desired performance characteristics of the files.
Almost all computer systems, whether large mainframes or tiny embedded micro controllers, need to store data such that the data shall not be lost when the system is powered down. Therefore those computers usually include some kind of Non Volatile Memory (NVM), in addition to any volatile memory they may use for running their programs. The NVM may be a magnetic disk, a flash memory chip, or any other non-volatile storage element.
FIG. 1 shows the general structure of how such a storage device is accessed. At the bottom of the Figure is a physical storage medium 10, which is the hardware layer implementing the physical storage. As each storage device may have its own unique interface and peculiarities which make it very inconvenient to work with, it is the common practice to have a software device driver 12 included in the operating system running on the computer (or running on the bare hardware, if no operating system is used), with this device driver 12 providing a simplified and standardized interface for other software components wishing to access the device. For storage devices 10 used for storing files (i.e. disks, diskettes, etc.), but not only for them, the interface provided by their device drivers 12 is usually of the type known as a “block device driver”. Such device drivers 12 interact with their clients using blocks of data rather than single bytes. This applies to both input and output operations, that is, to both reading and writing. The most common example of a block device 10 is the magnetic disk, whose hardware interface is commonly configured for transferring only complete blocks (usually called “sectors” in this context), such as 512 bytes or more. It should be emphasized that it is not necessary for physical storage device 10 to be physically limited to block operations in order to have a device driver 12 presenting a block device interface. For example, a battery-backed RAM disk is not physically limited to blocks and may physically read and write each of its memory bytes. Still, typically its device driver 12 presents a block device interface to the rest of the system, so as to be compatible and interchangeable with magnetic disks. Therefore, for the purpose of the present invention, a block device is any device whose driver 12 presents a block device interface, regardless of its actual physical structure.
A block device seems to its users as a linear array of blocks of a certain fixed size. Each one of these blocks can be read or written independently of the other blocks using its index in the array, as shown in FIG. 2. The common practice (which is also used here) is to number the blocks starting from block number 0 (21), and ending in block number (N−1) 22, where N is the number of blocks exported by the device driver. Again it should be emphasized that this linear array structure does not necessarily exist at the physical device level. For example, a flash disk block device driver 12 also presents this linear array image, but internally the physical blocks on flash medium 10 are usually scattered in a random order (such that block number 0 may physically be located in the middle or the end of flash medium 10) due to the writing limitations in flash memory and the possible existence of bad blocks. It should also be understood that block device driver 12 has no knowledge of the contents put into its blocks by the upper software layers 14 and 16.
Referring again to FIG. 1, it is seen that there is usually a File System (FS) software layer 14 on top of device driver 12. A FS 14 is a software component which provides further insulation from physical device 10, by enabling the application programs 16 to interact with storage device 10 using only the concept of files, a concept which is much more natural and convenient to the typical programmer or user. FS 14 achieves this abstraction by organizing the user data on block device 10 into some logical structure, and associating the blocks containing a file's data with the file's attributes (i.e. file name, creation time, access permissions, etc.). For that purpose FS 14 stores into device 10 meta-data, which are not directly visible to the user, and which include the FS 14 internal book-keeping information with which FS 14 is able to trace and access the user files. For example, the Microsoft DOS FAT12 file system, which is one of the simplest FS commercially available, stores on storage device 10 a boot sector containing some basic parameters, allowing the location of the other meta-data structures (which must be in the first block of device 10), one or more copies of the File Allocation Table (FAT), which is the allocation map of device 10, and a root directory structure for locating files by name. Application programs 16 interact with FS 14 on the file-level, by issuing commands such as “open file”, “delete file”, “write file”, etc. Application programs 16 thus are completely ignorant of the underlying block structure. There are many file systems 14 in use today, greatly differing in their internal structures and characteristics. In many cases (such as with the Linux operating system) an operating system even provides several file systems 14 to its users and they may choose the one most suitable for their needs.
While most prior art storage systems follow the layered modular separation between device driver 12 and file system 14 that is illustrated in FIG. 1, there are storage systems in which such clear separation does not exist and in which both device driver and file system functionalities are combined into one software module. Such systems are typically referred to as “monolithic” storage systems. It should be understood that while all explanations are given here using the layered model, the present invention is equally applicable to the monolithic model, where any reference to a “file system” should be interpreted as referring to the file-handling portion of the monolithic software, and any reference to a “device driver” should be interpreted as referring to the device-handling portion of the monolithic software.
More details of the considerations related to the handling of files by a file system are found in Mosek, US Patent Application Publication No. 2005/0193025, filed on May 20, 2004 and entitled “A File System that Manages Files According to Content”. This patent application is incorporated by reference for all purposes as if fully set forth herein.
Prior art file systems 14 and storage device drivers 12 handle storage space allocation the same for all files. One must understand, though, that the way a file is allocated physical space may affect the performance the user will see when accessing the file.
As an example consider the case of reading a file that spans many sectors (a sector is the minimal chunk of data read or written to/from medium 10. A sector is typically 512 bytes long). Let us assume that storage medium 10 which is used by file system 14 is a solid-state flash disk. Assume that file system 14 is called upon to read the first 16 KB of the file. This translates to reading the first 32 data sectors. Even though the user logically views those 16 KB as if they are residing on medium 10 as a continuous stream of data, this is not necessarily the way the data are physically stored on medium 10. File systems 14 typically allocate medium space when there is a need for writing new data, and their allocation policy can be non-sequential. A simple example for this is the very widely used Microsoft FAT file system, the one used in DOS and many Windows operating systems. FAT allocates space in “clusters”, with each cluster containing a few sectors (a typical case is 4 sectors per cluster). Even though the sectors within the same cluster will be allocated physically sequential locations on medium 10, there is no necessity that logically adjacent clusters (that is—clusters having adjacent logical addresses) will be physically close to each other on medium 10. File system 14 may decide to allocate clusters that were left free in the middle of the space used by another file, thus mapping a continuous range of logically sequential data into a few physically non-sequential chunks of medium 10. (Microsoft does provide a utility for “defragmenting” a disk drive. Such a utility re-arranges the files on the medium so that the files are “continuous”. However, the defragmenting utility is very time-consuming and is typically run by the system's user only once in a few weeks, while in the in-between period the disk gradually returns to its “non-continuous” state. Additionally, one should distinguish between logically continuous storage and physically continuous or contiguous storage. For flash disks these are not the same. Due to flash limitations of bad blocks and the inability to rewrite without erasing big chunks of space, flash memory management systems employ mapping of logical addresses to physical addresses, so that even though the device driver reports continuous allocation, physically it is not so). More details of the way FAT-like file systems handle storage allocation are found in Ban et al., US Patent Application Publication No. 2004/0193564, filed on Mar. 27, 2003 and entitled “Robust Self-Maintaining File System”. This patent application is incorporated by reference for all purposes as if fully set forth herein.
Many NAND-type flash memory devices (typically used for implementing solid-state disks or memory cards) have a feature of “continuous reading”. This means that when the last byte of a flash page (the equivalent of a data sector on the device hardware level) is read out of the device, the device automatically starts loading its internal data buffer from the memory cells of the physically next page. In NAND flash, the time required for loading the data buffer with the stored data is relatively long and has a major effect on the reading performance of the memory system. A typical value for the time it takes to load the buffer is 15 microseconds. For a 16-bit wide device using 100 nanosecond bus cycles, the time required for transferring a 512-byte page over the bus is 256×0.1=25.6 microseconds, which is of the same order of magnitude as the buffer loading time. Therefore doing buffer loading in advance can significantly speed-up the reading of a sector, and consequently the reading of the file. However, such saving of time using the “continuous reading” feature is only possible when the next page to be read is physically sequential to the previously read page. It follows that reading performance is affected by the storage allocation policy and can benefit from “clever” allocation decisions.
Based on the above, it can be seen that it is advantageous to have files physically continuous on medium 10, so why not allocate all files like that? The reason is that even in a moderately-complex system file system 14 has to support hundreds of files residing in parallel within medium 10. Those files are created, written to, appended to, deleted, or otherwise changed in many complex and unpredictable sequences of events. Keeping all files physically sequential is simply impossible except in the most simple systems. For example, a user adding a few lines to an existing text file forces file system 14 to extend the file length. But if the space immediately following the file is currently used by another file, the only options available are to break the updated file to non-sequential parts, to move the entire other file to another location and use its old space for the updated file, or to move the entire updated file to another location where there is enough room. In almost all cases the preferred decision is to break continuity, as the other options are much more costly in time.
As another example, consider the case of writing a file into a flash disk. Flash memory devices have the characteristic that a page cannot be written into unless the block targeted for the write was previously erased. This implies that while the disk is being continuously used it is necessary to recycle (erase) blocks of the disk so as to make room for new data. Such recycling may be done only when there is an immediate need for free space, or it also may be done periodically or when the system has idle time. For better understanding the issues of flash memory allocation and recycling see Ban, U.S. Pat. No. 5,404,485, entitled “Flash File System”, that issued on Apr. 4, 1995. This patent is incorporated by reference for all purposes as if fully set forth herein.
The time it takes a storage system to write a file depends on whether there is free space immediately available for allocation or whether some space first should be recycled in order to generate usable space. Erasing flash blocks is a rather slow operation (a few milliseconds in NAND flash and a few hundred milliseconds in NOR flash), so this might have a major impact on a file's write performance. If it could be guaranteed that the storage required for writing a file is somehow reserved for it in advance so that no recycling will be required, its write performance will be guaranteed to be higher.
Indeed, the idea of reserving flash space in advance of writing is not new—Intel Corporation provides a flash management software package called Flash Data Integrator (or “FDI” for short) that allows its user to “initialize” a “packet” of storage and then fill up the packet with data. The initialization includes the pre-allocation of free space, causing recycling if necessary. However, the FDI use of this idea is rather primitive and limited—it does not apply at all to files but to some type of objects which are primitive structures identified by a numeric identifier. Also, only one packet stream can be open for input at a time.
So prior art storage systems that treat all files the same in terms of space allocation, do not allow a user to obtain optimal performance for a specific file.
There is thus a widely recognized need for, and it would be highly advantageous to have, a file system 14 that allows an application program 16 to request special ad hoc storage of a file on a medium 10, for optimal performance
It should be clarified here that the term “performance” as employed herein, refers not only to write performance or read performance, as might be suggested by the examples listed above. Performance should also be interpreted to include any other characteristics associated with the execution of storage system operations related to time. For example, the latency of a storage system is also a performance characteristic. By latency is meant the amount of time it takes the storage system to complete a single call. In some so-called “hard” real-time systems, it is a mandatory requirement that no single call should exceed a given duration, so that no delay will be incurred in performing some time-critical operation. So in such systems low latency is a required performance characteristic.
Also, it should be clarified that even though in all the examples presented herein media 10 are solid-state flash disks, the present invention is equally applicable to other types of storage media, including magnetic disks. Even though the considerations for what constitutes optimal allocation for a file generally are different for different media types, the same ideas still apply. For example, in magnetic disks there is the issue of optimizing the mechanical movements of the read/write heads used for accessing the storage area, with different considerations applying to reading and writing. Therefore the present invention as described and claimed below is also applicable to this case too.