A composition structure for a large software program is an "organization chart" that groups procedures into modules, modules into subsystems, subsystems into bigger subsystems, and so on. The composition structure chart is used for project management, integration planning, design, impact analysis, and almost every other part of software development and maintenance. In a well-designed system, each module or subsystem contains a set of software units (procedures, data structures, types, modules, subsystems) that collectively serve a common purpose in the system. However, because the purposes and roles of software units frequently overlap, it is not always easy to decide how a system should be divided up. Furthermore, during the evolution of a large software system over several years, many software units are added, deleted, and changed. The resulting organization chart may no longer have any technical rationale behind it, but may instead be the result of economic expediency, or simple neglect. Its poor quality then increases the cost of software maintenance, by impeding technical analysis.
Generally, when a software system that has been developed by a large team of programmers has matured over several years, changes to the code may introduce unexpected interactions between diverse parts of the system. This can occur because the system has become too large for one person to fully understand, and the original design documentation has become obsolete as the system has evolved. Symptoms of structural problems include too many unnecessary recompilations, unintended cyclic dependency chains, and some types of difficulties with understanding, modifying, and testing the system. Most structural problems cannot be solved by making a few "small" changes, and most require the programmer to understand the overall pattern of interactions in order to solve the problem.
A field of application of the present invention is in the implementation of a software architect's "assistant" for a 0 software maintenance environment. The "assistant" is a computer program for helping the software architect to analyze the structure of the system, specify an architecture or chart for it, and determine whether the actual software is consistent with the specification. Since the system's structural architecture may never have been formally specified, it is also desirable to be able to "discover" the architecture by automatically analyzing the existing source code. It should also be possible to critique an architecture by comparing it to the existing code and suggesting changes that would produce a more modular specification.
A common approach to structural analysis is to treat cross-reference information as a graph in which software units appear as nodes and cross-references appear as edges. Various methods, both manual and automatic, may then be used to analyze the graph. Recent work has used clustering methods to summarize cross-reference graphs, by clustering nodes into groups, and then analyzing edges between groups. See, e.g., Yoelle S. Maarek and Gail E. Kaiser, "Change Management in Very Large Software Systems", Phoenix Conference on Computer Systems and Communications, IEEE, March 1988, pp. 280-285, and Richard W. Selby and Victor R. Basili, "Error Localization During Software Maintenance: Generating Hierarchical System Descriptions from the Source Code Alone", Conference on Software Maintenance--1988, IEEE, Oct. 1988. Other currently available methods for recovering, restructuring, or improving the composition structure chart are "manual", involving much reading and trial and error.
Clustering algorithms may be either batch or incremental. A batch algorithm looks at all of the data on all objects before beginning to cluster any of them. An incremental algorithm typically looks at one object at time, clustering it with the objects it has already looked at before looking at the next object. The heart of a batch algorithm is the similarity measure, which is a function that measures how "similar" two groups of objects are (each group can have one or more members). The batch algorithm takes a large set of individual objects and places them together in groups, by repeatedly finding the two most similar objects (or groups of objects) and putting them together. The batch algorithm typically produces groups with two subgroups. This is unnatural for most purposes; instead it is preferable to merge some sub-groups to make larger groups.
Prior art applications of clustering to software analysis have generally fallen into two categories. One category is conceptual clustering for re-use, as discussed in Maarek and Kaiser, referred to above. This work finds a way to specify the external interface of a software unit, including its function, and then classify units drawn from many different system to place them in a library where they can be found and re-used. They cannot use shared names for classification because two similar units drawn from different systems would use different names.
Another category is statistical clustering which attempts to predict errors and predict the impact of changes, as discussed in Selby and Basili, referred to above. This work classifies the software units according to the number of "connections" between them, which may be procedure calls, data flow paths, or names used in one group that are the names of units in the other group. The resulting groups can be used to plan integration sequences for large software systems, and can be measured to predict the likelihood of errors in them. However, the groups do not have lists of shared characteristics that would explain to the programmer why they were grouped together. There is no evidence yet that the groups computed this way would be appropriate for describing the structure of the whole system.
It is a therefore a general object of the present invention to automate the task of analyzing the composition structure of a large software program by using computerized clustering methods for grouping objects into groups according to similar attributes. It is a particular object of the invention to provide feedback on classification decisions that can lead to improved classification. Specifically, it is desired to provide a method for estimating the optimal coefficients for a similarity function which accounts well for the classification of objects in a category.