An edit distance, also called Levenshtein distance, between two symbol sequences is typically defined as the minimum number of edit operations required to transform a first sequence to a second sequence, with the allowable edit operations being insertion, deletion, or substitution of a single symbol at a time. The symbols can be bits, characters, numbers, glyphs, deoxyribonucleic acid (DNA) nucleotides, optical character recognition (OCR) characters, or fingerprint minutia maps, to name but a few examples.
The edit distance is an important measure in a number of bioinformatics and data-mining applications. Further, the length of the longest subsequence that is common to two input sequences is related to the edit distance. For example, the longest common subsequence in two input character strings “JANUARY” and “FEBRUARY” is “UARY”.
The edit distance and the length of the longest common subsequence can both be determined via dynamic programming, e.g., using Wagner-Fischer or Needleman-Wunsch methods. If the lengths of the symbol sequences are n and m respectively, the dynamic programming solution involves determining the entries of a matrix of size n×m.
FIG. 1 shows pseudocode for a function EditDistance that takes two input character strings, s of length m, and t of length n, and computes the edit distance between those two strings.
FIG. 2 shows a matrix 210 used by the dynamic programming solution for two character strings “FAST” 230 and “FIRST” 240. Elements of the matrix are determined recursively based on values of previously determined elements. At the end, the bottom-right element of the matrix provides the edit distance 220.
However, the dynamic programming solution does not provide privacy. If two processors use the dynamic programming solution to find the edit distance between the sequences, each processor knows all the symbols from each string. In some applications, it is necessary to determine the edit distance without disclosing the symbols from the sequences.
In an example of such applications that require privacy, a first processor that stores information about a patient whose DNA sequence is to be examined for susceptibility to genetic disorders, using for example the Smith-Waterman algorithm. A second processor has a database of sequences corresponding to these disorders. The patient information is not to be revealed to the second processor in order to preserve the privacy of the patient. Similarly, the second processor does not want to reveal the database to protect the business, since isolating the sequences in the database may have required significant research, time and investment. Nevertheless, the second processor and the first processor need to determine the edit distance between DNA sequence of the patient and DNA sequences stored in the database to determine if the DNA sequence of the patient approximately matches any of the diagnostic DNA sequences in the database of the second processor. DNA profiling in forensic sciences has similar privacy requirements.
Determining the edit distance under privacy constraints can use secure multiparty computation (SMC). In SMC, two parties can securely compute any function of their inputs as long as that function can be expressed as an algebraic circuit. However, solutions of two-party computation problem, in which outputs and inputs are related by an algebraic circuit, are complicated to implement, even for a very small number of input sequences. Specifically, those methods rely on oblivious transfer protocols which have high computational complexity, and a high communication overhead. For a practical solution to problems, such as the DNA matching problems described above, it is necessary to devise secure protocols, which have manageable computational complexity and require a relatively small number of encrypted transmissions amongst the two parties.
For example, one method determines the edit distance securely using a third semi-honest but trusted processor. However, the third processor is not always available. Furthermore, the third processor can collude with one of the processors and thereby discover the sequences processed by the other processor.
Another method discloses a two-party symmetric protocol, i.e., two processors incur exactly equal protocol overhead in terms of computational complexity and communication. At the end of the protocol, the two processors each possess additive shares of the m×n matrix described above, from which the edit distance is obtained. However, that method requires a large number of encrypted transmissions between the processors. Moreover, the method works efficiently only for a very limited variety of substitution costs.
Accordingly, there is a need to find a computationally efficient method for determining securely an encrypted edit distance between two symbol sequences.