The disclosure relates generally to information storage systems, and more specifically to improved free space collection in log structured storage systems.
Log structured storage systems have been developed as a form of disk storage management to improve disk access time. LFSs use the assumption that files are cached in a main memory and that increasing memory sizes will make the caches more effective at responding to read requests. As a result, disk use is dominated by writes. A LFS writes all new information to disk in a sequential structure called a log. New information is stored at the end of the log rather than updated in place, to reduce disk seek activity. As information is updated, portions of data records at intermediate locations of the log become outdated. This approach increases write performance by eliminating almost all seeks. The sequential nature of the log also permits faster crash recovery.
In a LFS, data is stored permanently in the log and there is no other structure on disk. For a LFS to operate efficiently, it must ensure that there are always large extents of free space available for writing new data.
Log structured disks (LSD) and log structured arrays (LSA) are disk architectures which use the same approach as the LFS. LSAs combine the LFS architecture and a disk array architecture such as the well-known RAID (redundant arrays of inexpensive disks) architecture with a parity technique to improve reliability and availability. Generally, an LSA includes an array of N+1 physical discs and a program that manages information storage to write updated data into new disk locations rather than writing new data in place. Therefore, the LSA keeps a directory which it uses to locate data items in the array.
As an illustration of the N+1 physical disks of the LSA array, an LSA system may include a group of disk drive DASDs (direct access storage devices), each of which includes multiple disk platters stacked into a column. Each disk is divided into large consecutive areas called segment-columns. A segment-column is typically as large as a physical cylinder on a physical disk. Corresponding segment-columns from the N+1 disks constitute a segment. The array has as many segments as there are segment-columns on a disk in the array.
A logical track is stored entirely within some segment-column of some physical disk of the array; many logical tracks can be stored in the same segment-column. The location of a logical track in an LSA changes over time. A directory, called the LSA directory, indicates the current location of each logical track. The size of a logical track is such that many logical tracks can be stored in the same LSA segment-column.
In LSAs and LFSs, data to be written is grouped together into relatively large blocks (the segments) which are written out as a unit in a convenient free segment location on disk. When data is written, the previous disk locations of the data become free creating unused data (or garbage) in the segments on disk. Eventually the disk fills up with segments and it may be necessary to create free segment locations by reading source segments containing at least some unused data and compacting their still-in-use content into a lesser number of destination segments without any unused data. This process is called free space (or garbage) collection.
To ensure that there is always an empty segment to write to, all logical tracks from a segment selected for free space collection that are still in that segment (i.e. are still pointed to by the LSA directory) are typically read from disk and placed in a memory segment. These logical tracks will be written back to disk when the memory segment fills. Free space collected segments are returned to the empty segment pool and are available when needed.
As free space collection proceeds, live data from the various target segments is read into the temporary storage buffer, the buffer fills up, and the live data is stored back into an empty segment of the disk array. After the live data in the temporary storage buffer is written back into the disk array, the segments from which the live data values were read are designated as being empty. In this way, live data is consolidated into a fewer number of completely full segments and new empty segments are created. Typically, free space collection is performed when the number of empty segments in the array drops below a predetermined threshold value.
The way in which target segments are selected for the free space collection process affects the efficiency of LSA system operation. There are three well-known in the art algorithms that may be used: “greedy” algorithm, “cost-benefit” algorithm, and “age-threshold” algorithm. The greedy algorithm selects target segments by determining how much free space will be achieved for each segment processed and then processing segments in the order that will yield the most amount of free space. The cost-benefit algorithm compares a cost associated with processing each segment against a benefit and selects segments for processing based on the best comparisons. The age-threshold algorithm selects segments for processing only if their age in the storage system exceeds an age-threshold value and once past the age-threshold, the segments are selected in the order of leased utilized segments first.
More particularly, in the cost-benefit algorithm, a target segment is selected based on how much free space is available in the segment and how much time has elapsed since the segment was last filled with new information. The elapsed time is referred to as the age of the segment. In the cost-benefit algorithm, the age of the segment is defined to be the age of the youngest live track in the segment. For example, age might be indicated by a time stamp value associated with a track when it is placed in the LSA input write buffer. A benefit-to-cost ratio is calculated for each segment, such that the ratio is defined to be:Benefit/Cost=(1−u)a/(1+u)
where u is called the utilization of the segment; (1−u) is defined to be the fraction amount of free space in the segment, also called the “dead” fraction; and a is the age of the segment as defined above.
In the age-threshold algorithm, segments are selected if their age exceeds a threshold value. The system determines the age of a segment by determining the amount of time a segment has been located in the storage system and considers a segment for free space collection only after the segment has been located in the storage system for the selected age threshold value. From the set of candidate segments, the system chooses one or more segments for free space collection in the order that they will yield the most free space. The free space yield is determined by utilization data, so that the least utilized segments will be free space collected first.