1. Field of the Invention
The present invention relates generally to computer applications and programming. More specifically, it relates to creating a user interface that enables a user to quickly evaluate similarities and differences among multiple documents of the same or different type.
2. Disxussion of Related Art
A common feature or utility in some word processing programs and operating systems is the ability to compare files and provide information on differences (or similarities) between the files. There are a variety of file comparison programs available which have different limitations and capabilities, for example, with regard to how and what comparison data is presented or the number of files that can be compared in one run. Many of these programs are adequate in certain aspects but have drawbacks in others making them poorly suited for certain applications. This is particularly true given the constantly growing trend to store, submit, transfer, copy, and otherwise manipulate information electronically.
One utility used to compare files in the UNIX operating system is known as diff This program can compare up to three files or documents. The output of this program is typically two columns of data. One column displays line numbers in one (subject) document across from a second column displaying line numbers in the query document that are different from corresponding line numbers in the subject document. Thus, the diff utility is used when the documents are assumed to be generally similar. The program uses a dynamic programming algorithm that computes the minimal xe2x80x9cedit distancexe2x80x9d between two documents. An xe2x80x9cedit distancexe2x80x9d between two documents, or strings, is the length of a minimal sequence of insertions, deletions, and substitutions that transforms one to the other. From information about how the minimal edit distance is derived diff computes matching passages in the two documents, which are presented to the user in the column format described earlier. The program can not find differences among sets or large bodies of documents, but typically between two or among three documents at most.
Other methods of comparing files can be broadly categorized as information retrieval methods. These methods compare statistical profiles of documents. For example, one strategy used by these methods is computing a histogram of word frequencies for each document, or a histogram of the frequency of certain pairs or juxtaposition of words in a document. Documents with similar histograms are considered to be similar documents. Refinements of these methods include document preprocessing (e.g. removing unimportant words) prior to computing the statistical profile and applying the same information retrieval method to subsections of documents. Some of the primary drawbacks of these methods include tendencies to provide false positive matches and presenting output or results in a form difficult to quickly evaluate. False positives arise because it is sometimes difficult to prevent dissimilar documents from having similar statistical profiles. With respect to presentation, these methods often simply provide correlations. In sum, these methods can often provide too little information about similarities or differences among documents thus requiring the user to closely evaluate the results and refer back to the files being compared to determine whether meaningful differences or similarities exist.
Another method is based on a procedure known as document fingerprinting. Fingerprinting a document involves computing hashes of selected substrings in a document. A particular set of substring hashes chosen to represent a document is the document""s fingerprint. The similarity of two documents is defined as a ratio C/T where C is the number of hashes the two documents have in common and T is the total number of hashes taken of one of the documents. Assuming a well-behaved hash function, this ratio is a good estimate of the actual percentage overlap between the two documents. However, this also assumes that a sufficient number of substring hashes are used. Various approaches have been used in determining which substrings in a document are selected for hashing and which of these substring hashes are saved as part of the document fingerprint. One way is to compute hashes of all substrings of a fixed length k and retain those hashes that are 0 mod p for some integer p. Another way is partitioning the document into substrings with hashes that are 0 mod p and saving those hashes. The difference from the first way is that the substrings selected are not of fixed length. In this method, a character is added to a substring until the hash of the substring is 0 mod p, at which point the next substring is formed. In order to reduce memory requirements, the program can set p to 15 or 20 thereby saving, in theory, every 15th or 20th hash value. However, based on probability theory, for a large body of documents, there will be large gaps where no hash value will be saved. This can potentially lead to the situation where an entire document is bypassed without having a single substring hash value saved for a fingerprint. More generally, if gaps between stored hash values are too long, a document""s fingerprint will be faint or thin and, thus, ill-suited for comparison to other documents.
Another drawback of current document comparison programs is the presentation of comparison results to the user. Most user interfaces displaying the comparison data are generally text-based. Some may have simple bar graphs or charts based on percentage of matching content. They do not present to the user more specific data on the location of the matching passages, their lengths, or the ability to immediately focus on matching content between two or more documents. Typically, user interfaces or output data from user comparison programs present information based on the total amount of matching content between documents.
Therefore, it would be desirable to create and present a user interface for a document comparison program that allows a user to easily and meaningfully evaluate the results of the comparison. It would be desirable to present as much information as possible in a simple, intuitive, and visually-appealing manner that also allows a user to quickly access portions of matching text in the documents. It would also be desirable to have the user interface function efficiently in a Web browser context.
To achieve the foregoing, and in accordance with the purpose of the present invention, methods, apparatus, and computer program products for monitoring a document being digitally transmitted within a computer network or outside a private network via the Internet. A digitally transmitted document, such as an email message, is received at a monitoring component or station. The document is compared against multiple previously stored documents, typically including confidential and otherwise sensitive content, and a comparison output or result is produced. It is then determined whether to alter the original transmission of the document based on the comparison result. Through this process, the dissemination of the document with respect to the computer network can be regulated if it contains confidential information.
In one embodiment an index, such as a hashtable, for the multiple previously stored documents is created that facilitates the comparison of the document against the previously stored documents, which are stored in a file system or database. In another embodiment it is determined whether the digitally transmitted document should be added to the multiple previously stored documents. The document is examined for predetermined indicators associated with characteristics of the multiple previously stored documents. In another embodiment a rule engine is provided with the comparison output and causes a particular action to occur based on the comparison result. The rule engine performs one of multiple rules based on the comparison output. In yet another embodiment the comparison result indicates whether the document matches any content in the previously stored documents above a predetermined threshold of similarity or percentage of overlap. In yet another embodiment the document is transmitted using an Internet email protocol or a stream-oriented protocol such as HTTP or FTP.
In another aspect of the present invention, a user interface having one or more frames for displaying results of a document comparison program is described. The user interface allows a user to efficiently and effectively examine the results of comparing two or more documents for similar or matching passages. One of the frames is an overview frame that contains two or more graphical representations of an equal number of documents that are were compared by the program. A graphical representation contains comparison information related to one or more passages in a selected document that closely resemble passages in other documents. The comparison information enables a user to directly access passages in the selected document thereby facilitating examination of similar passages found in the other documents. The direct access is in the form of an internal link between the comparison information in the graphical representation and the passages in the documents. Such a link is an HTML hypertext link.
In one embodiment of the present invention, the user interface contains other frames for displaying content from the documents. One frame contains content from the selected document, or the document being compared to the corpus of documents and the other frames contain content from the corpus of documents. In another embodiment the passages in the selected documents and in the other documents are assigned a color in order to associate matching passages. In yet another embodiment, matching passages in the documents are internally linked, such as through an HTML hypertext link, thereby allowing direct reference to matching corresponding passages among the documents. In yet another embodiment the graphical representation of a document in the overview frame is in the form of an overview bar having typically one or more sub-bands. Each sub-band is assigned a color which matches the color of two or more matching passages. Each sub-band also has a link to its corresponding passage in the document such that activating the link causes the corresponding passage to come up for display in one of the content frames so that the user can examine it. The overview bars are aligned or configured in such a manner that sub-bands in different bars can be compared to see which matching passages might be of interest to a user.
In another aspect of the present invention, a method of interacting with a user interface for displaying results of a document comparison program is described. The user interface includes one or more overview bars where each overview bar represents a document used in the comparison program. Each overview bar includes one or more sub-bands where a sub-band represents a passage in the document that matches a passage(s) in another document. A sub-band from an overview bar is selected thereby activating a link between the sub-band and its associated passage in the document. By activating this link, the associated passage is displayed to the user in another frame in the user interface. Upon being displayed to the user, the user selects the associated matching passage thereby activating another link between that passage and its matching passage in another document. By activating this link, the matching passage in the other document is displayed next to the original passage in yet another frame in the user interface. This allows the user to compare the two matching passages simply by clicking on either the sub-bands in the overview bars or the text in the documents. In one embodiment a sub-band is assigned a color that is the same as the color of its corresponding passage in the document.