This invention relates generally to a system and method for comparing the differences between two documents and in particular to a system and method for comparing two hypertext markup language (HTML) documents and displaying the changed areas in the HTML documents while retaining the original HTML formatting.
The traditional method of locating document changes within pure text files is accomplished via a technique known as file differencing, or diffing. UNIX has a utility called “diff” that is used for file differencing. It works by comparing each line in a first file (the Right File) with each line in a second file (the Left File). A carriage return character typically separates each line from each other line. After the comparisons are finished, each line in the Right File will be identified as having one of the following states:                1. unmodified—The current line exactly matches another line in the Left File;        2. new—The current line has no match in the Left File; and        3. modified—The current line nearly matches one of the lines in the Left File with some changes.        
The unit of comparison, a line, is deliberately chosen because it is an intermediate amount of information. In other words, it is somewhat larger than a single character or word, and therefore offers a meaningful context for the detected change. However, a line is still small enough so that the remainder of text, divided into lines, is considered separately and most lines are often identical in both files.
To better understand this typical differencing technique, consider a “diff” operation, i.e., line-by-line comparison, of the text on the left and its revised version on the right where modified lines have been underlined in the right file for emphasis:
La Nina continues in the PacificLa Nina continues in the PacificOcean, meaning cooler thanOcean, meaning cooler thanaverage sea surface temperaturesaverage sea surface temperaturesalong the equator north of Southalong the equator west of SouthAmerica. Typically this means aAmerica. Typically this means awarmer and drier summer for thewarmer and drier summer for theMidwest. The Summer of '99 hasMidwest. The Summer of 1999 hasbeen very hot, with 32 daysbeen very hot, with 32 daysrecording highs of 90 degrees orrecording highs of 90 degrees orabove, and very dry, with rainfallabove, and very dry, with rainfalldeficits exceeding 4.5 feet so far.deficits exceeding 4.5 inches so far.For this example, Table 1 below illustrates what you would see if the basis of comparison was a word (left column) vs. a line (right column).
TABLE 1Changes Found After Comparing Text on a Word Basis vs. Line BasisWord BasisLine BasiswestEquator west of South America1999Summer of 1999 has been very hotinchesRainfall deficits exceeding 4.5 inches
As illustrated by the above example, the detected changes in a line by line based comparison (right column) are more useful for conveying the essence of the revisions than the detected changes when using a smaller unit comparison, such as a word based comparison.
The World Wide Web (Web) is an international network of computers containing a vast amount of information. The hypertext markup language (HTML) is the lingua franca for publishing documents on the Web. The problem is that the typical differencing operations as described above do not work well for HTML documents. In particular, unlike pure text documents, or documents created using a word processor, carriage returns in HTML documents are not significant. In more detail, the width of lines displayed by a viewer will be determined by the width of the viewer window, not where carriage returns are entered in the HTML file. Therefore, a typical differencing operation that uses lines for a unit of comparison does not work successfully when comparing HTML files since the operation may unnecessarily identify differences which are insignificant. In addition, the HTML language treats contiguous sequences of white space characters as being equivalent to a single space character. Therefore, a contiguous sequence of white space characters is equivalent to a single white space character in the HTML language, but a typical differencing operation will not take this into account.
Due to the peculiar rules of the HTML language described above, the following are equivalent representations of the same paragraph in HTML document sources: