Many organizational users of protectable content (e.g., open source software code, proprietary software code, freeware) are concerned with identifying the composition of the protectable content included in their code libraries as well as the license provisions associated with that protectable content. Understanding the components of a user's protectable content helps the user determine whether the user's protectable content and/or particular usages of that protectable content are in compliance with all applicable software license provisions and/or the user's associated use policies. For example, a user may seek to identify whether a snippet or portion of open source code has inadvertently or intentionally been introduced into an item of protectable content within the user's code library and whether the introduced code renders the user's protectable content noncompliant with applicable license provisions and/or the user's protectable content use policies.
To determine the composition of an item of protectable content, some have proposed a partially automated discovery process that involves analyzing the item of protectable content for snippets or segments of code that are similar to or that “match” snippets or segments of code contained within known open source or proprietary content that has been assembled within a library of existing content for comparison (“comparison content”). After a comparison between the user's protectable content and the comparison content is complete, the user receives a list of match results that identifies all or substantially all of the items of comparison content that were found to contain snippets of code that match snippets of code from the user's item of protectable content. This process is generally known as deep discovery, deep dive, deep source scanning, or content matching (hereinafter “deep discovery” or “content matching”).
While content matching is useful in determining the composition of a user's protectable content, current deep discovery techniques exhibit several deficiencies, including unacceptable levels of noise in the match results. More specifically, the nature of the open source software concept encourages software developers to access and make use of existing open source software code when developing new open source software code. As a result, items of protectable content (e.g., portions or snippets of code, code files, directory structures and/or trees, open source software projects and packages, and proprietary software applications) often exist as part of a complex network of interdependencies and interrelationships. Conventional content matching analysis methods lack the ability to differentiate between an original source of a snippet of code and various other items of comparison content that contain the snippet of code but that are merely related to the original source, and are therefore duplicative or redundant. As a result, conventional content matching analysis techniques generally produce match results that include “false positives,” or that identify inaccurate or erroneous, redundant, and/or unnecessary matches from among the items of comparison content. The user must then review the match results to determine which of the match results represent original sources of the copied snippets, which are incorrectly identified, and which are correctly identified but are redundant and/or duplicative. This manual process of elimination is time consuming and generally requires extensive knowledge of the interrelationships between the various items of comparison content identified in the match results.
Other deficiencies in current content matching analysis methods include inefficiencies in the process of performing content matching analysis, including unreasonably lengthy analysis times, an inability to customize and/or optimize deep discovery analyses, difficulty identifying all matches, especially when interchangeable and/or nonfunctional elements have been removed or altered for the purpose of the content matching analysis, and difficulty securing or protecting the confidentiality of the user's protectable content during the course of a content matching analysis.