1. Field of the Invention
The present invention relates to a computer program product, system, and method for determining linkage metadata of content of a target document to source documents.
2. Description of the Related Art
Often a target document needs to be compared against other source documents to determine if the target document has content matching or copied from the source documents. For instance, the source documents may comprise sensitive organizational documents which are confidential to the organization, such as a corporation or government body. However, sensitive documents often tend to propagate freely within an organization and across the organization network boundaries onto the Internet. One particular problem is when a sensitive document with proprietary, competitive or private information is leaked outside the organization without authorization.
In addition to the organizational and corporate setting, various services are available to determine whether a student paper has plagiarized documents in a database of papers. One such student plagiarism detection service is Turnitin (see turnitin.com).
Known techniques to determine whether a target document has copied content from other source documents, such as sensitive corporate documents or papers, involve comparing the content of the target document with the source document.
One technique for determining textual relatedness is Koala Document Fingerprinting (KDF), which has been used to determine the relatedness of computer science research documents such as technical reports, conference papers and journal articles. In a Koala search technique, the document (URL) is loaded to the KDF server as a textual representation. Then, using this text, a fingerprint of the document is generated. Finally, this fingerprint is matched against the current document fingerprint database to find related documents.