Retrieving similar or related web pages is a feature of popular search engines (e.g., GOOGLE™, ASK.COM™, HOTBOT™). For example, after a user submits a search query, GOOGLE™ displays links to relevant web pages along with a link labeled “Similar” next to each result. These “Similar” links point to web pages that GOOGLE™'s algorithm judges to be similar by aggregating various factors that may include target link association (as when one webpage contains links to each of the “Similar” web pages), topical similarities, and popularity scores of the retrieved pages. One benefit of the “Similar” link is that it lets users find pages similar to a known web page without the need to determine the proper keyword search to achieve this result.
One technical area where a similarity search function would be desirable is in the realm of application development. A software application is a collection of all source code modules, libraries, and programs that, when compiled, result in the final deliverable that customers install and use to accomplish certain business functions. Detecting similarity between applications, however, is a notoriously difficult problem, in part because it means automatically detecting that the high-level requirements of these applications match semantically. Such detection is difficult for a variety of reasons. For example, many application repositories are polluted with poorly functioning projects, which could lead to non-functioning projects being misidentified as “similar” to functioning projects. Further, keyword searching may also lead to erroneous results because, for example, a keyword match between words in a requirements document with words in the descriptions or source code of an application does not guarantee relevance between the two corresponding applications. Also, applications may be highly similar to one another at a low-level even if they do not perform the same high-level functionality, which could result in the misidentification of “similar” applications that perform dissimilar functions. Moreover, it may be difficult to recognize similarity between software artifacts belonging to different applications because programmers rarely record traceability links between different applications.
Knowing similarity between applications plays an important role in assessing reusability of applications, improving understanding of source code, prototyping for rapid development, and discovering code theft and plagiarism. Allowing programmers to compare how different applications implement the same requirements may contribute to their knowledge about application requirements and to the efficient reuse of code. Retrieving a list of similar applications may allow programmers to concentrate on the new aspects of the requirements, thus saving time and resources for programmers. Programmers could spend this time instead understanding the functionality of similar applications, and seeing the complete context in which the functionality is used.
Consider a typical project in a large-scale software development enterprise in which company programmers engage in several hundred software projects at the same time. The enterprise may have previously delivered thousands of applications, many of which may have had similar requirements and implementations to the project at hand.
A typical project starts with writing a proposal in response to a bid request from a company that needs an application. A winning bid proposal has many components: well-written requirements, preliminary models and design documents, and proof of experience in building and delivering similar applications in the past. A company that submits a bid proposal that contains these components with the closest correlation to a desired application will likely win the bid. Reusing the components from successfully delivered applications in the past will save time and resources and further increase chances of winning the bid. Thus, recognizing similarities between past and present applications is important for preserving knowledge, leveraging experience, winning bids on future projects, and successfully building new applications.
The process of finding similar applications may start with code search engines that return code fragments and documents in response to queries that contain key words from elicited requirements. However, returned code fragments are of little help when many other non-code artifacts (e.g., different functional and non-functional requirements documents, UML models, or design documents) are required. Matching words in queries against words in documents and source code may be a good starting point, but keyword search results do not establish how applications are similar at a high-level scale.
A problem in detecting closely related applications is in the mismatch between the high-level intent reflected in the descriptions of these applications and low-level details of the implementation. This problem is known as the concept assignment problem. For any two applications it is too imprecise to establish their similarity by simply matching words in the descriptions of the applications, comments in their source code, and the names of program variables and types (e.g., names of classes and functions as well as identifiers). Thus, existing code search engines do not effectively detect similar applications and programmers must typically invest a significant intellectual effort to analyze and understand the functional similarity of retrieved applications.
Similarities between documents can be found using syntagmatic associations by considering documents similar when terms in these documents occur together in each document. This technique is used by the MUDABlue similarity engine. Alternatively, similarities between documents can be found using semantic anchors and by developing paradigmatic associations where documents contain terms with high semantic similarities. Semantic anchors are elements of documents that precisely define the documents' semantic characteristics. Semantic anchors may take many forms. For example, they can be expressed as links to web sites that have high integrity and well-known semantics (e.g., cnn.com or whitehouse.gov) or they can refer to elements of semantic ontologies that are precisely defined and agreed upon by different stakeholders. Without semantic anchors, documents (or applications) are considered as collections of words with no semantics, and the relevance of these documents to user queries (and to one another) is determined by matches between words. Using semantics represents the essence of paradigmatic associations between documents, whereas using word matching represents the essence of syntagmatic associations.
Programmers routinely use Application Programming Interface (API) calls from third-party packages (e.g., the Java Development Kit (JDK)) to implement various requirements. Unlike names of program variables, types, and words used in comments, API calls from well-known and widely used libraries have precisely defined semantics. Since programs contain API calls with precisely defined semantics, the API calls may serve as semantic anchors to compute the degree of similarity between applications by matching the semantics of applications as expressed by the API calls. Using the API calls to compute similarities among applications may result in better precision than syntagmatic associations among applications.
Therefore, a method of finding similarities in applications based on underlying semantics of the applications would be useful to allow programmers needing to find similar applications to do so with less intellectual and manual efforts than currently used search methods. A method of finding a similar application based on underlying semantics would also be useful to help preserve knowledge base and correlate supporting software documentation in similar applications.