1. Field of the Invention
This invention relates to a method and program for detecting certain runs of data. More particularly, the invention relates to a method and program for detecting previously specified runs of data or identical consecutive runs of data, either of which may be suitable for data compression.
2. Description of the Related Art
Modern computers require a host processor including one or more central processing units and a memory facility. The processor manipulates data stored in the memory according to instructions provided to it. The memory must therefore be capable of storing data required by the processor and transferring that data to the processor at a rate capable of making the overall operation of the computer feasible. The cost and performance of computer memory is thus critical to the commercial success of a computer system.
Because today's computers require large quantities of data storage capacity, computer memory is available in many forms. A fast but expensive form of memory is main memory, typically comprised of microchips. Other available forms of memory are known as peripheral storage devices and include magnetic direct access storage devices (DASD), magnetic tape storage devices, optical recording devices, and magnetic or optical mass storage libraries. Each of these other types of memory has a greater storage density and thus lower cost than main memory. However, these other memory devices do not provide the performance provided by main memory. For example, the time required to properly position the tape or disk beneath the read/write mechanism of the drive cannot compare with the rapid, purely electronic data transfer rate of main memory. It is inefficient to store all of the data in a computer system on but a single type of memory device. Storing all of the data in main memory is too costly and storing all of the data on one of the peripheral storage devices reduces performance.
A typical computer system includes both main memory and one or more types of peripheral storage devices arranged in a data storage hierarchy. The data storage hierarchy arrangement is tailored to the performance and cost requirements of the user. In such a hierarchy, main memory is often referred to as primary data storage, the next level of the hierarchy is often to referred to as secondary data storage, and so on. Generally, the highest level of the hierarchy has the lowest storage density capability, highest performance and highest cost. As one proceeds down through the hierarchy, storage density generally increases, performance generally decreases, and cost generally decreases. By transferring data between different levels of the hierarchy as required, the cost of memory is minimized and performance is maximized. Data is thus stored in main memory only so long as it is expected to be required by the processor. The hierarchy may take many forms, include any number of data storage or memory levels, and may be able to transfer data directly between any two distinct memory levels. The transfer of data may employ I/0 channels, controllers, or cache memories as is well known in the art.
Data is stored in memory in digital form as either logical 0's or logical 1's known as bits. The bits are stored in a precise order (i.e. encoded) so as to represent that data to the system in a form understood by the system. For example, a certain set (or byte) of eight bits, such as 00000000, might represent the letter "a". Also, 00000001 might represent "A", 00000010 might represent "b", and 11111111 might represent "$" etc. In addition, the bytes could actually represent any binary data. Various encoding schemes are known in the art.
The efficiency of use of a data storage hierarchy is improved if particular data can be further encoded so as to reduce the amount of storage space required in memory to retain that data. Such further encoding is generally referred to as data compaction or data compression (hereinafter simply "compression"). Certain repeated portions of data are usually replaced with coded versions thereof (hereinafter "codes"). Data is typically compressed before it is stored in a particular type of memory and decompressed upon its being read out so as to return it to a form understood by the host processor for processing. Various compression techniques are known in the art. Some background on different types of data compression can be found in U.S. Pat. No. 3,694,813.
One technique for compressing data involves the replacement of identical runs of data with codes. A run of data is any sequential series of units of data (i.e. any single portion of the data), the units of data in the run being consecutive. The units of data within a run need not be identical. For example, 0101 is a run of four bits. Identical runs of data are two or more runs which are identical to each other (i.e. 0101 and 0101 are identical runs of four bits each). A single identical run of data is impossible--there must be at least two such runs of data for them to be identical to each other. Codes are assigned to each individual portion of data which is found to repeat. The codes are shorter than the actual data as otherwise stored and may represent the repeated data in various ways, such as by referencing the memory address of where the run first appears. At each repetition of a run thereafter, the codes are used to represent such runs in memory instead of the actual data, thereby reducing the amount of memory required to store the data. Examples of this type of data compression are found in U.S. Pat. Nos. 4,446,516, 4,491,934, 4,626,824, and 4,701,745.
Another technique for compressing data involves the replacement of identical consecutive runs of data with codes. An identical consecutive run refers to a run in which each unit of data therein is the same. Thus, "identical consecutive runs" refers to a plurality of runs, each run consisting of a single repeating unit of data. The term does not imply that each of the plurality of runs is identical to each other. For example, 0000 and 1111 are each identical consecutive runs of four bits, but are not identical runs. A single identical consecutive run of data is possible as the term only refers to a run being identical throughout itself. The codes are binary data which indicate the repeated unit of data and the number of times such unit repeats (i.e. the run length). For example, a run of 78 all-zero bytes can be represented by one all-zero byte and a binary representation of the number 78. Data is thus compressed by replacing the identical consecutive runs of data with the codes. An example of this type of data compression is found in U.S. Pat. No. 4,586,027.
The aforementioned techniques relate to how data can be compressed. To be able to compress data one must first detect the presence of data which is suitable for compression. The detection of compressible data includes the identification of repeated portions of data and the number of times the portion is repeated. The number of repetitions is required to ensure that compression will indeed save memory space without unduly degrading performance. Memory space may not always be saved, as in the case where a portion is repeated but one time and the portion is so short that the code used to represent that portion is larger than the portion itself. Also, modest savings in memory space may not be worthwhile if the overhead (i.e. processing time) to perform the compression and decompression outweighs the benefit of the memory space saved. Because some of the overhead associated with data compression is the result of the detection of compressible data, it is important that such detection occur as efficiently as possible.
In the current Multiple Virtual Storage (MVS) operating system environment, a family of computer programs known as Data Facility Storage Management Subsystem can be used to manage the data storage hierarchy. An optional program in the family known as Data Facility Data Set Services (DFDSS) can be used to manage certain aspects of direct access storage devices (DASD). More particularly, DFDSS functions include the movement of data from one type of DASD to another type of DASD, the backup and recovery of data sets, and the reduction or elimination of free-space fragmentation. The backup of selected data sets is accomplished by copying or "dumping" such data sets, typically on DASD, to additional recording media, typically magnetic tape. DFDSS uses data compression in dumping data sets.
To compress data, early versions of DFDSS searched strings of bytes to locate data suitable for compression. The strings of bytes were examined to identify any identical consecutive runs of three or more bytes. The technique for detecting such runs of bytes was by simple comparison of pairs of adjacent bytes using a System/370 COMPARE LOGICAL CHARACTER machine instruction. To examine a given string of bytes for compressibility, the first byte was compared to the adjacent second byte during a first machine instruction. If the first and second bytes are identical, such was noted in a temporary storage location and the second byte and adjacent third byte were then compared during a second machine instruction. If the second and third bytes were also identical, such was again noted and it was therefore known that a compressible run of three or more identical consecutive bytes existed. The procedure then continued, comparing the third byte and adjacent fourth byte, fourth byte and adjacent fifth byte, fifth byte and adjacent sixth byte, etc. during separate machine instructions until the run of identical consecutive bytes was broken. At such time, the run was actually compressed. Searching then started anew for another run of identical consecutive bytes beginning with a comparison of the very same pair of bytes (i.e. repeating that comparison) last compared during the identification of the last run identical consecutive bytes.
Continuing with the aforementioned example, if the first and second bytes were not identical, the second and third bytes were nevertheless compared. However, no record of identical byte detection was retained as such did not occur. If the second and third bytes were found to be identical, such was noted and the comparisons continued in search of a run of three or more identical consecutive bytes, as previously described. Again, if the second and third bytes were not identical, the comparisons continued, but no record of identical byte detection was retained. The comparisons continued in this manner until the entire string of bytes was examined.
The aforementioned technique of detecting compressible data by comparing each and every pair of adjacent bytes during separate machine instructions is inefficient. A technique reducing the number of machine instructions required would improve efficiency. There is therefore a need for an improved method of detecting certain runs of data.