Similarity computation between documents has many applications such as plagiarism detection, copyright management, duplicate report detection, document classification/clustering, and search engines, just to list a few. Many solutions have appeared assuming that the contents of the documents are public, where the general tendency here is to first compute a feature vector that is mainly based on the number of occurrences of words in the document, or n-gram frequencies. Those features, or the fingerprint values extracted from them, are then compared to compute the similarity score. Despite natural language processing based approaches, the information distance based similarity metric has also been applied for some practical cases such as software plagiarism detection.
The necessity for the privacy preserving similarity detection appears when the owners of the documents want to keep their contents secret, while at the same time they need to measure the similarity in between. Some example scenarios include, but not limited to, duplicate submission control between the related conferences and journals in academia, and information sharing between the insurance companies or intelligence agencies.
In simultaneous submission detection problem, since the content of each individual submission is private, the submitted venues cannot share the submissions publicly to check whether there exist multiple very similar copies of the same paper under review at other venues at the same time. However, in case a practical privacy preserving document similarity tool is available, then the related venues such as the conferences or journals accepting papers at the same time on similar topics, may use of this procedure to control the duplicate submissions.
Another example is on the document comparison between the insurance companies to detect fraud applications. When a damage or a loss is claimed from one company, it might be required to check similar claims filed from others. However, since those claims are confidential, public comparison is not possible, where the companies need a privacy preserving similarity method to compare the files without breaking their privacy. Similar scenarios can be imagined for the information sharing during a collaboration between the governmental intelligence agencies.
The known methods proposed to solve the privacy preserving document similarity detection problem generally begin with encrypting the extracted features from the documents mostly with a homomorphic scheme, which makes some limited operations possible to be computed on the encrypted data, while keeping the information provably secure. Based on those operations doable via homomorphic encryption, the comparison of the encrypted feature vectors are achieved via some multi-party secure computation protocols. However, the inherent practical difficulties of the homomorphic schemes inhibit the usage of the previous solutions for the privacy preserving document similarity detection.
The Chinese patent document numbered CN1447603 of the known state of the art relates to the technique of data lossless compression and decompression in information technique area the loss less compression and decompression using information source entropy coding are all based on the entropy coding method of non-prefix code, which builds a binary tree based on frequency of occurrence of the message source as well as builds the codes of message source through the search from root to leafs. This method does not aim to provide a method for privacy preserving document similarity detection and the non-prefix coding is totally different.
In the paper titled “Non prefix-free codes for constrained sequences” (Dalai, Marco & Leonardi, Ricardo, Department of Electronics for Automation University of Brescia, Italy), using of variable length non prefix-free codes for coding constrained sequences of symbols is mentioned. This method does not aim to provide a method for privacy preserving document similarity detection and the non-prefix coding is totally different.