1. Field of the Invention
This invention relates to techniques for representing file differences useful in computer file protect systems and other systems, and more particularly to file transfer techniques useful in an electronic data backup system wherein only changes in a file are periodically sent to the backup system and in other systems.
2. Discussion of Prior Information
It is well known to off-load computers at the end of a work day to secure the data file against computer failure. It is also known to transmit the file to an off-site location for additional file security.
What is not known is the generation of a set of representations of the changes in a file, and the periodic relocation of that set of representations and its use to update the previous version of the file.
Accordingly it is an object of the invention to generate a set of representations of the changes made in a computer file during a period of time.
Another object of the invention is to generate a set of representatives of the changes made in a computer file which can be used to update an earlier version of the file, or to create a previous version of an updated file.
Still another object of the invention is to generate and to use such a set of representations in a cost and time effective manner.
The objects of the invention are achieved through computer programs designed to run on a micro- and mini-computers. A first or SCAN program is designed to create a TOKEN Table (or file signature) of mathematical representations of segments of the file as it exists at the start of a period (earlier file (EF)). The TOKEN Table reflects the indices (ordinal numbers) for all of the segments in the earlier file, and the exclusive-or (XR) and cyclic redundancy check (CRC) products of the set of characters for each segment. Actually, two CRC products are generated for each segment; a sixteen bit one and a thirty-two bit one. The three products, XR and two CRC, are generated for speed in comparisons: the XR product is first compared because it is the fastest comparison; then the slower sixteen bit CRC one if necessary; and finally the still slower thirty-two bit CRC if necessary.
A second program is used at the end of the period to create a MATCH Table setting forth the location of segments in the current file that are identical to those in the earlier file. The MATCH Table lists the indices of all of the segments in the earlier file and the file offsets of the first character of the corresponding identical segment in the updated file. The second program calculates the mathematical representations of the first segment (window) in the updated, revised or current file, first calculating only the XR product and comparing it to the XR product for the first earlier-file segment in the TOKEN Table and noting whether a match exists. If so, it then calculates the sixteen bit CRC product and compares it to the sixteen bit early file CRC product and notes whether a match exists; if so, it finally calculates the more time consuming but more reliable thirty-two bit CRC product and compares it to the thirty-two early file CRC product and notes whether a match exists; and if so, makes an index and offset entry in the MATCH Table for the identical segments; the offset entry being the ordinal number of the first character in the current file segment string of characters. (The earlier file segments are numbered (indexed) sequentially.). If a segment match is obtained, the second program calculates one or more mathematical representations for the next segment in the current file, and compares them to the products associated with the next index in the TOKEN Table and representing the second segment of the earlier file. However, if a mismatch obtained, the window (which retains segment size) is bumped along one character, new product(s) calculated for the window characters and comparison(s) again made with the same representations of the earlier file segments in the TOKEN Table. This continues until a match obtains at which time the index for the earlier file segment and the offset of the first character in the nonmatching current file window (segment) are recorded in the MATCH Table. The process then continues as above to the end of the current file. Only the XR product is calculated in the event of an XR product mismatch; the sixteen bit and the thirty-two bit CRC products being generated respectively only in the event of earlier matches of the XR and sixteen bit CRC products.
A third program creates a TRANSITION Table that reflects what""s in the current file that""s not in the earlier table, and where. It scrolls through the list of indices and offsets in the MATCH Table, to see if each offset number differs from the previous one by the segment size. When such an offset differs from the previous one by more than the segment size, it adds the segment size to the first offset to determine the file ordinal number of the first character in the matching information, subtracts one from the second offset to determine the last character, goes to the current file and lifts therefrom that set of characters beginning with that ordinal number and stopping with the character preceding the extra-spaced offset, and adds them to the MATCH Table to create with the index a TRANSITION Table.
Thus creation of the Transition Table involves assuring that every character in current file is accounted for in the TRANSITION Table. The MATCH Table provides all of the information necessary for this accounting. Each entry in the beginning column represents a match in the early file of segment characters to the current file characters at location beginning. The matching segment in the early file is located at that offset, which is equal to the index times the segment size in early file.
Essentially the same process is followed for a deletion. The second program, if no match obtained for an earlier file segment by the end of the updated file (or over a predetermined number of segments as conditioned by the character of the file), would have proceeded to endeavor to match the next index mathematical representations in the TOKEN Table with a current file segment, with no offset entry having been made in the MATCH Table for the index of the segment that was unmatched. On proceeding with the index and representations of the next earlier-file segment, the window of the current file would be bumped along, and the index and offset number entered in the MATCH Table when the match of the mathematical representations occurred. The third program on scrolling through the MATCH Table offsets, notes the missing offset, notes the preceding offset, adds the segment size to the previous offset and copies from that number forward the reduced characters if any in the current file before the next offset, into the TRANSITION Table and in association with the index number of the unmatched segment.
The TRANSITION Table is used to update a copy of the earlier file. Typically, a fourth program and the earlier version of the file are on an off-site location and the TRANSITION Table representations are electronically transmitted thereto. The fourth program will examine the indexes and offsets of the TRANSITION Table, copying segments from the earlier file where the succeeding offset just differs by the segment size, into what is to be a duplicate version of the updated file, making additions where the offset numbers differ from the preceding ones by more than the segment size with the information provided in the TRANSITION Table, and substitutions from the TRANSITION Table where the offset numbers are missing.
As observed earlier, the TOKEN Table mathematical representations of file segments may be the products of exclusive-oring of the characters in successive earlier file segments and of generating two cyclic redundancy check (CRC) products for each earlier file segment. Corresponding XR products are most quickly generated, but do not detect character order differentiating; a sixteen bit CRC will catch most of these transpositions; a relatively slowly generated thirty-two bit CRC product will detect essentially all of them.
As observed earlier the MATCH Table is generated by the second program generating mathematical representations of the segment sized windows of the current file, and comparing the representations of a window with an index""s associated mathematical representations in the TOKEN Table. As long as matches obtain, successive window sized segments of the current file are addressed and a MATCH Table listing reflecting the early file segment index and the current segment first character offset is generated. Normally three mathematical representations of each segment obtainxe2x80x94an exclusive-or (XR) one and sixteen bit and thirty-two bit cyclic redundancy check (CRC) ones. In the interests of speed, the XR products are compared first, and if a mismatch occurs in them, it is clear that the segments are unmatched. However, even if the XR products match, the segments may not match because the XR operation is not sensitive to the transposition of characters. Accordingly, it is also necessary on XR match, to compare the sixteen bit CRC product. On sixteen bit CRC match, it is desirable to do a thirty-two bit CRC match for most applications to achieve practically one hundred percent certainty. The generation of the CRC product is a relatively slow process and is avoided where possible as on XR mismatch. However, the great benefit of avoiding CRC calculations occurs in operations subsequent to segment mismatch.
As observed earlier, upon detection of a mismatch, a segment sized window representing only a one character displacement of the window in the current file is operated upon to determine its mathematical representations and compare them with the representations of the just compared TOKEN Table representations, then on mismatch upon successor windows until a match obtains or the end of file is reached. By generating first the quickly generated exclusive-or (XR) products, and only on match generating the more slowly generated CRC products, a significant amount of time can be saved.
Applicant has further discovered that even the exclusive-oring process can be expedited on a one-character shift of the window under consideration. Thus the new XR product need not involve the exclusive-oring of each of the characters of the new window: rather only the exiting character and the entering character need be exclusive-ored with the existing XR product of the just tested segment. The second exclusive-oring of the exiting character amounts to a subtraction of it from the segment product.
Another feature of the invention is that the amount of updating material that must be transmitted to the off-site is minimal; normally being less than five percent (5%) of the current file.
An advantage of the invention is that it provides an easy way to secure a user""s data from fire, theft and tampering.
Another advantage is that is provides an inexpensive disaster recovery insurance.
A further advantage is that it eliminates the tedious chore of computer backup, and allows the user""s office time to be dedicated more fully to the productivity and profitability of his or her business.
Yet another advantage of the invention is that programs embodying the invention can be incorporated in larger programs for handling large model files which are immune to character insertions and deletions and grow in size to accommodate new records. Thus under certain circumstances, it is possible to skip creation of MATCH and TRANSITION Tables by windowing techniques.