The following references have some relevance in this matter.    [1] Mikhail J. Atallah and Wenliang Du, “‘Secure Multi-party Computational Geometry’”, WADS '01: Proceedings of the 7th International Workshop on Algorithms and Data Structures, 2001, pp. 165-179.    [2] Mikhail J. Atallah and Florian Kerschbaum and Wenliang Du, “‘Secure and private sequence comparisons’”, WPES '03: Proceedings of the 2003 ACM workshop on Privacy in the electronic society, Washington D.C., pp. 39-44.    [3] Chris Clifton, Murat Kantarcioglu, Xiao dong Lin, Jaldeep Vaidya, and Michael Zhu, “Tools for privacy preserving distributed data mining”, SIGKDD Explorations, 4(2), pp. 2834, January 2003.    [4] Peter Christen, Tim Churches and Markus Hegland, “‘Febrl—A Parallel Open Source Data Lindage System’”, Proceedings of the 8th Pacific-Asisa Conference, PAKDD 2004, Sydney, Australia, May 26-28, 2004, pp. 638-647.    [5] Tim Churches and Peter Christen, “‘Blind Data Linkage using n-gram Similarity Comparisons’”, Proceedings of the 8th PAKDD '04 (Pacific-Asia Conference on Knowledge Discovery and Data Mining), Sydney, May 2004, pp. 121-126.    [6] Tim Churches Tim and Peter Christen, “‘Some methods for blindfolded record linkage’”, BMC Medical Informatics and Decision Making, vol 4, 2004.    [7] Wenliang Du and Mikhail J. Atallah, “‘Secure multi-party computation problems and their applications: a review and open problems’”, NSPW '01: Proceedings of the 2001 workshop on New security paradigms, 2001, pp. 13-22.    [8] Wenliang Du, Mikhail J. Atallah and Florian Kerschbaum “‘Protocols for secure Remote database Access with Approximate Matching’”, 7th ACM Conference on Computer and Communications Security (ACMCCS 2000), Athens, Greece, November 2000.    [9] Halbert L. Dunn, “Record Linkage”, American Journal of Public Health, Vol. 36 (1946), pp. 1412-1416.    [10] M. Naor and B. Pinkas, “Oblivious Transfer and Polynomial Evaluation”, Proc. of the 31st Syrnp. on Theory of Computer Science (STOC), Atlanta, Ga., pp. 245-254, May 1-4, 1999.    [11] M. Naor and B. Pinkas, “Efficient Oblivious Transfer Protocols”, Proceedings of SODA 2001 (SIAM Symposium on Discrete Algorithms), Washington D.C., Jan. 7-9, 2001.    [12] Ravikumar P., Cohen W. W. and Fienberg S. E., A secure protocol for computing string distance metrics, Proceedings of the Workshop on Privacy and Security Aspects of Data Mining, pages 40-46, Brighton, UK, 2004    [13] W. E. Winkler, “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage” Proceedings of the Section on Survey Research Methods, American Statistical Assn., 1990, pp. 354-359.    [14] William E. Winkler and Yves Thibaudeau, “An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census”, Statistical Research Report Series RR91/09, U.S. Bureau of the Census, Washington, D.C., 1991.    [15] Andrew C Yao., “‘Protocols for secure computations’”, In Proc. 23rd FOCS, New York, 1982, pp. 160-164.    [16] Akihiro Yamamura and Taiichi Saito, “Private Information Retrieval Based on the Subgroup Membership Problem”, In V. Varadhara jan and Y. Mu, editors, Proceedings of ACISP 2001, volume 2119 of LINCS, pages 206220. Springer-Verlag, 2001.
The term Record Linkage refers to the process of determining records in one database which correspond to the same entity as a record or records in some other database. This is a common process in a number of fields within the health domain, including health data management, health research and disease surveillance. The technique is also commonly applied in preprocessing data for data mining tasks to remove duplicates, in the processing of census and historical data and in a number of other domains.
An example of an application within the health surveillance is the elimation of duplicate lab test results. In some cases, more than one lab might report a positive test of some disease for the same patient, in which case it is important to link those two tests so that the number of instances of positive tests is not over-counted. Another example of an application is a case where a positive test for some disease is reported in two different geographic locations. If record linkage can determine that those tests correspond to the same patient, this may be inportant information for the tracking of the disease.
Record linkage was first introduced in [9]. Record linkage methods include deterministic algorithms, which use a predefined set of rules to determine the linkage of two records based on the exact agreement or disagreement of various fields, and probabilistic algorithms, which specify a pair of records as a link, no-link, or uncertain based on the value of some likelihood score of the set of agreements and disagreements of various fields.
Errors in first names and surnames in database records are common, often occurring in as many as twenty percent of names, due to typographical errors or errors from optical character recognition. Also, various versions of the same name (such as “‘John’” and “‘Johnny’”) can occur in different records which refer to the same entity. Because of this, it is often useful in record linkage to make use of approximate string comparators which can provide a measure of the similarity of two fields in a record which is more general than the binary exact-match/non-exact-match measure.
A number of approximate string comparators have been proposed or used in practice to date. These include the bi-gram or n-gram string comparators, the Jaro-Winkler string comparator, the longest common sub-string comparator, edit distance such as the Levenshtein edit distance, bag distance, and others. Many of these are implemented in the open-source record linkage project Febrl (Freely extensible biomedical record linkage), by Peter Christen and Tim Churches [4]. There are also a number of other schemes including some which use compression to calculate similarity and some which first use a phonetic coding algorithm to convert strings to some representation of their sound when vocalized and then apply a comparison metric to those representations.
The bigram string comparator, known as the Dice Coefficient is commonly used in a number of applications in computer science. In this scheme, a string of n characters is broken up into n−1 pairs of adjacent characters,each pair of which is called a bigram. The Dice coefficient is given as the ratio of the number of bigrams which the two strings have in common to the average of the number of bigrams in the two strings. An n-gram based string comparator is a generalization of this concept from pairs of adjacent strings to substrings of length n.
The edit distance between two strings is the minimum number of operations, including insertions, deletions and substitutions, which is required to convert one string into the other. The length of the longest common substring is closely related to the edit distance, in that it is a special case of a weighted edit-distance in which insertions, deletions and substitutions are weighted differently [2].
The Jaro-Winkler string comparator was developed by the United States Census Bureau [14]. It finds the number of common characters in the two strings and the number of transpositions of those common characters. It also considers characters which are approximately matched in some sense and the number of consecutive matching characters in the beginning of the string. We outline the Jaro-Winkler measure in detail in Section.
Many different organizations in the health sector and in other domains collect a great deal of data which can benefit from record linkage. It is often not feasible to perform this record linkage when the sets of records which need to be linked belong to different organizations or to different political jurisdictions. In those cases, privacy and confidentiality considerations prohibit the sharing across organizational or jurisdictional boundaries of the information contained in those records. However, if it were possible to compare two records belonging to different databases in such a way that no information about the records, other than a measure of similarity, is learned by either party, then data linkage may be possible without jeopardizing privacy or confidentiality. The algorithms in this document describe one method for accomplishing a Jaro-Winkler string comparison between two strings in two separate databases in such a way that no other information, other than some minimal innocuous information, is learned by either party.
The concept of secure multiparty computation was introduced in [15], in which a problem called the Millionaire Problem was described. In this problem, two millionaires must determine which of the two is richer without either being able to learn anything else about the worth of the other. A great deal of literature has been generated since then, describing various methods to solve similar problems, wherein a number of parties, each holding some secret information, wish to collectively compute some function of those secret inputs without any of the parties learning the secrets of any others. A review of some of these results, including a classification of different types of such problems can be found in [7].
In theory, a solution always exists for any secure multiparty computational problem, however solutions which apply this general result directly are often impractical and inefficient. For this reason, specific solutions are developed for specific problems, which are more practical and efficient.
Secure multiparty computation problems can differ in the assumptions which are made about the nature of the participants. One particular special case of interest is the assumption of the semi-honest participants. A semi-honest participant is a participant who will learn anything which they can from the data which they gain access to, but will follow the protocol correctly. This generally disallows collusion, so that for example, a semi-trusted third party can be used in a protocol involving two other participants in which the third party can hold information which, by itself reveals no secrets, could reveal secrets if the third party were to collude and share information with one of the other participants. A more general case relaxes this assumption, so that any participant may fail to follow the protocol and may collude with other parties. This is generally known as a malicious participant. Other common assumptions include the existence of secure channels of communication etc. Secure multiparty computation protocols with malicious participants and secure two party protocols, neither of which can make use of the trick of using a semi-trusted third party tend to be more computationally intensive and involve the use of elements common in public key cryptography. In this sense, the strength of security of the protocols is tied to the strength of the public key system which it is related to. For example, the security of protocols will depend on assumptions related to the difficulty of solving problems such as the discrete logarithm problem or the decision Diffie-Hellmann problem other similar problems underlying the security of public key systems.
We have identified six methods which have been published for secure computation of string comparators. Four of those consider the semi-honest participant case, and two cover a two party case without any assumptions of honesty. String comparators of the form Σi=1|a|f(ai, bi), are considered in [8] for arbitrary functions f and for some special cases of f, where ai and bi represent the ith characters of strings a and b respectively. Solutions are given for both the semi-honest three party case and the two party case.
The algorithm in [5] is a three party protocol which can be used to securely calculate any n-gram Dice coefficient in the semi-honest case. An example is given for the bigram case. This algorithm is very efficient in the case where all of the strings in one set must be compared to all of the strings in some other set.
In [6], a method is presented for secure computation of the Jaro-Winkler comparator in the semi-honest case. This method was considered impractical by the authors, because of the complexity of the algorithm, and was only included in the work in order to stimulate thought. Probably because of this, the outline of the method appears somewhat hastily written and contains a number of typographical errors and a number of phrases with unclear meaning. Because of this, it is not possible to determine conclusively the steps of the method. In any case, the method is either incorrect, insecure or both.
To see this, we define a binary matrix such that the element in the jth row and kth column of the matrix is a one if and only if the jth character of the first string is common with the kth character in the second string. It appears that, according to the method, the entity calculating the value of the string comparator gains access to a number of tuples of the form (agreementflag, i), where agreementflag is half the value of one of the elements of the matrix and i is somehow related to the row or column index. It appears, due to similarity of notation that i represents either the row or column index, but the meaning of i is not given in the discussion. If i represents only one of the row or column indices, then there is not enough information available to the calculator to obtain either the number of characters in common or the number of transpositions correctly, since both of those depend on the relative order of both sets of characters. If i represents information containing both the row and column indices, then the algorithm divulges much more than the length of each string as claimed. Specifically, the entire binary matrix is learned, from which can be learned, for example, partial information about the Character Group Size Histogram (CGSH) of each of the strings, which we define below.
The modifications to the original Jaro string comparator, such as the one by Winkler, which considers the number of consecutive matching leading characters, and others which consider the effect of approximate matching of characters and adjustments for long strings, are not considered in this work.
In [12], the authors apply a secure approximate inner product protocol to the calculation of two string distance metrics: TFIDF and SoftTFIDF. These are distance metrics which are primarily used for long strings, although they can also be applied to shorter strings. The TFIDF metric consists of the inner product of two vectors, one defined for each string. The vectors, in turn, contain information about the frequency of each sub-string in a string and the frequency of each sub-string in the database of possible strings as a whole. The SoftTFIDF metric is similar to the TFIDF metric, except that it employs some additional metric, which measures the similarity of sub-strings. The authors of [12] claim that they have achieved good results using the SoftTFIDF metric with the Jaro-Winkler metric as the additional metric. The method assumes semi-honest participants in that the secure inner product used is based on a secure cardinality of intersection protocol. The intersection protocol requires a semi-trusted third party when there are only two parties with input, as is the case here.
In [2], a secure protocol is given for the edit distance between two strings. This is a generalization of the Levenshtein edit distance and is given by the minimum cost of transforming one string into another by applying insertions, deletions and substitutions. This work considers the general case where insertion and deletion costs depend arbitrarily on the character deleted or inserted and where substition costs depend arbitrarily on the pair of characters involved in the substitution. The authors also provide somewhat more efficient versions of the algorithm for two special cases. The general algorithm and the special cases all require O(n2) homomorphic encryptions, where n is an upper bound on the length of both strings involved. The encryptions typically require at least one multiplicative group exponentiation operation (via a square-and-multiply or double-and-add algorithm) per encryption and these operations tend to be computationally very expensive.
The method in [5] is efficient in that it requires only O(1) encryptions for each string. However, it suffers from weaker security in that it assumes semi-honest participants. In addition, the length of the participant's strings is leaked and this information can be important for very long or very short strings.
The method in [12] uses a secure set intersection protocol. The most computationally expensive part of this algorithm is in the encryption of all of the elements in the set, which must be done with some commutative encryption scheme. The complexity of this algorithm is thus O(s) group exponentiations, where s is the number of samples used. In [12], s is quite large, on the order of 10,000. This would be quite time consuming, although a smaller number of samples may be appropriate for shorter strings, such as personal names or surnames on the order of 10 characters. This method appears to be designed for much longer strings. In addition, this method also suffers from the semi-honest requirement.