This invention generally relates to methods and apparatus for tracking changes to byte streams. More specifically, the present invention relates to methods and apparatus capable of identifying and approximately quantifying changes to Web sites.
More and more consulting services are being delivered via the World Wide Web, and automated tools are being developed to provide-these services. One of these tools allows a client to track competitors by monitoring changes at the competitors"" Web sites and notifying the clients of major changes to the sites.
Traditional ways of tracking changes are (1) to store the Web pages off-line and to compare the stored pages to the current pages, and (2) to calculate a number, called a message digest, that represents the document and to use the message digest as the basis for comparison. The second approach is, at least under many circumstances, more appropriate since the first approach requires large amounts of disk space and processing time. Traditional methods of calculating message digests, however, are computationally intensive and have no sensitivity to the content of the document; in fact these methods were designed to be insensitive to content because their primary purpose is to provide a unique identification of, or to xe2x80x9csign,xe2x80x9d documents and to prevent forgeries. A computationally efficient method of representing the document numerically that is a function of the content of the document is needed to detect the degree of change.
An object of this invention is to improve methods and apparatus for tracking changes to Web sites.
Another object of the present invention is to provide an improved procedure for detecting and approximately quantifying changes to a Web site.
These objective are obtained with a method and system for determining whether first and second byte streams are different, comprising the steps of providing a first k1 byte long sequence of characters ci for i values from i=1 to k1; providing a second k2 byte long sequence of characters cj for j values from j=1 to k2; and computing a modulo arithmetic operation on said i values, and computing said modulo arithmetic operation on said j values. A value N1 is computed according to a formula that combines said modulo operation on i and each said character ci for i=1 to 1xe2x88x92k using arithmetic or logical operations, and a value N2 is computed according to said formula by combining said modulo arithmetic operation on j and each said character cj for j=1 to j=k2. These N1 and N2 values are then compared to determine whether the first and second byte sequences are different.
The step of computing the modulo arithmetic operation on the i value may include the step of computing the modulo arithmetic operation on the i values including additional arithmetic and logical bit operations. The step of computing the modulo arithmetic operation on the j value may include the step of computing the modulo arithmetic operation on the j value including additional arithmetic and logical bit operations.
With the preferred embodiment of this invention, the procedure is used to create a number suitable for detecting and approximately quantifying changes to a byte sequence. This procedure is suitable for characterizing arbitrarily large documents in a way suitable for change detection without storing a copy of the document itself. The byte sequence function creates a small number suitable for efficient storage, is computationally efficient, is sensitive to changes in the byte sequence, and is sensitive to the size of the byte sequence. An important advantage of the invention is that the generated number lends itself to methods of arbitrary sensitivity to document changes such as setting clipping levels for changes based on the ratio of the before and after numbers.
Further benefits and advantages of the invention will become apparent from a consideration of the following detailed description, given with reference to the accompanying drawings, which specify and show preferred embodiments of the invention.