The present invention relates to a method of identifying matches in data strings and square matrices and, more particularly, to a method of identifying non-exact matches in data strings and square matrices.
It is well known that maintenance of large computer programs is a problem because programmers having different programming styles tend to work on the same program. Many times when programmers revise large sections of the program for the purpose of adding new features or revising old features, there is a tendency to duplicate sections of the program and revise the duplicated sections. Such duplication occurs even though it is known that duplicating sections of the program typically makes the program larger, more complex and more difficult to maintain. The programmer usually modifies the duplicated section while still maintaining the old section in the program. The duplicated section may be further copied and modified as the program is revised. In time, the amount of duplication in the program can become substantial, thereby resulting in significant inconsistencies between different sections of the program.
Many times, the ongoing revisions of a program result in sections of the program that are not identical, but are similar in content except for a systematic change in parameter names such as identifiers and constants. For example, in one section of the code the parameters first, last, 0 and fin may be used and in another section of the code these parameters may be replaced by init, final, 1 and g. Identifying these types of approximate matches is difficult since there is no way to identify whether symbols are different because of renaming or because the symbols represent different values.
One way to identify similar code segments is to identify a pattern comprised of parameter names and constant names which represent a particular code segment to be identified. The pattern is compared to a program to identify code segments which match the pattern. However, such a method does not address the situation in which two code segments which perform the same function include parameters which have different names.
A method for identifying exact matches is the Boyer-Moore (BM) method which calculates the number of positions a text string can be shifted, using a text pointer, to avoid a known mismatch based on information contained in two tables as described in R. S. Boyer et al., "A Fast String Searching Algorithm", Commun. ACM, Vol. 20, No. 10, Oct. 1977, pp. 762-772 and D. Knuth et al., "Fast Pattern Matching in Strings," SIAM J. Comput., 6 (1977), pp. 323-350. The first table indicates the smallest number of positions the text pointer can be shifted which will cause a mismatched text symbol in the current alignment to be aligned with a like symbol in the pattern string. The second table indicates the smallest number of positions the text pointer can be shifted which will cause text symbols which match a particular portion of the pattern string to match a different analogous portion of the pattern string after the pattern string has been shifted by a predetermined number of positions. The BM method examines the text symbols in a right to left direction. While this method can detect exact matches, this method is unable to detect nonexact matches.
Similar problems exist for identifying non-exact matches in two dimensional cases, such as square matrices. The ability to identify two dimensional matches is useful in low-level image processing and in conjunction with visual databases which are used in multimedia systems. However, because of the amount of data which must be stored to identify matches, in a square matrix and the time involved in making the necessary comparisons, such matching is normally inefficient and difficult to perform.