The present invention relates to a method of identifying pattern matches and duplication in coded structures and square matrices and, more particularly, to a method of identifying pattern matches in coded structures and square matrices via encodings as special kinds of strings, namely parameterized strings and linear strings respectively.
In a large ongoing systems project, introduction of new features and code maintenance by large staffs of programmers may result in code that includes many duplicated sections. Such duplication occurs even though it is known that copying code may make the code larger, more complex and more difficult to maintain. Many times when a revision is made to a large software system, the programmer copies and modifies old code while still maintaining the old code in the system. The copies may be further copied and modified as the system is revised. In time, the amount of duplication in the system can become substantial and significantly complicate maintenance.
Various methods are known for finding duplication in symbol strings. One such method is calculating a matrix A such that A[i,j] is 1 if symbol i matches symbol j and 0 otherwise, followed by searching the diagonals of the matrix for maximal matches. However, this method takes quadratic time and quadratic space with respect to the length of the string. For large strings, such as computer programs containing millions of lines of code, such a method is impractical. Another method that can be used for such large strings makes use of data structure models known as suffix trees, for example as described in E. McCreight, "A Space-Economical Suffix Tree Construction Algorithm", Vol. 23, No. 2, Journal of A.C.M., 262-272 (April 1976). Such suffix trees can be used for finding code duplications in large computer programs.
Many times the ongoing revisions of the code result in sections of code that are not identical, but are similar in content except for a systematic change in parameter names, such as identifiers and constants. For example, in one section of the code the parameters first, last, 0 and fin may be used and in another section of the code these parameters may be replaced by init, final, 1 and g. The correspondence between sections of code which are similar except for labeling of parameters is referred to as a parameterized match. These "parameterized" matches cannot be found using the methods described above.
Clearly, the amount of code can be reduced if sections of code that are identical except for the labeling of parameters are replaced by a single subroutine. Thus it is desirable to find parameterized matches in bodies of code in which sections of code are identical except for parameters. Further it is deskable to be able to find parameterized matches using suffix trees. By identifying parameterized matches in a text of code, problems such as inconsistent code and plagiarism of code can be detected.
A problem also exists in designing a data structure model which efficiently represents a two-dimensional analog of a suffix tree for square text matrices. Such an application is useful in low-level image processing, and in conjunction with visual databases for use in multimedia systems. The suffix tree must represent all substrings of the text in an index which can be directly queried and which is efficient in both its representation and storage requirements.
There is a need for a method of identifying pattern matches in parameterized strings and square matrices which is more efficient in both time and space. In addition, the method should be capable of identifying only those pattern matches which are over a threshold length thereby maximizing the information content of the match reported. There is also a need for being able to provide a linear representation of a square matrix for addressing problems which arise in low-level image processing and visual databases.