Software applications are often made up of many individual files written by various software developers, with each file containing software code that performs certain functions. As a software project grows in scope, the number and size of the individual files grows, as does the number of software developers working on the project. Many organizations that create software desire to organize the various files, keep track of versions, control access to the files to protect the software from being accidently modified, and to protect the software from being copied or distributed. Accordingly, organizations often use a code storage system to store software code. Such a code storage system often provides version control and is sometimes referred to as a code repository. The code storage system may be a designated directory on a computing system, or the storage system may include software, for example, a version control system, that keeps track of the various versions of software, the identity of a person who has modified each the file, the identity of a person who has checked a file out of the repository, etc. Often, a storage system with version control allows an organization to track versions of the code files and requires authorization to access or update the files. In conjunction with the code storage system, an organization may create policies that define non-compliant use or copying of the source code, such as policies that prevent source code or any part of source code, from being copied to laptop computers, from being attached to emails, from being copied to directories visible outside the organization over a network, such as over the Internet, or to any other unsecure location. Organizations may designate the files in the code storage system, or part of the code storage system, as protected and the non-compliant use policies may apply to any portion of the source code files stored in the protected locations.
Source code stored in a central location may be searched for copies of code snippets. A code snippet is a portion of code, for example a few lines of source code, that often performs some function. Software developers may desire to locate certain code snippets to determine what source code files contain the snippet, to use the code snippet as a template or to make changes to each occurrence of the snippet in the code base. To facilitate the search process, a code search engine that works with the storage system may include a pre-calculated clone detection structure, such as an index, that can be used to detect code clones between files in the code base. A code clone may be a code snippet that appears in more than one file. Thus, a source code file may share one or more code clones with other source code files.
While such a pre-calculated clone detection structure may be beneficial for detecting code clones within a code storage system, some organizations may desire to determine whether code clones exist between a file not included in the code storage system and the files in the code storage system, or between files located in a location designated secure or secret and a location designated less protected or even public. As discussed above, some organizations may desire to protect source code stored in a protected location within the code storage system by limiting the places that the code or portions of the code exist. In some instances, such as with a version control system, the entire code storage system may be considered a protected code location. In other instances only certain directories or locations may be considered protected locations.
Organizational policies may prohibit snippets of source code, whether a few lines or an entire file, from being copied to or existing in unsecure or unauthorized locations, such as a mobile device (e.g., laptops, tablets, smartphones, USB drives, etc.), an email attachment, a server or directory visible to the public via the Internet, or a data center located in a particular country or region. While, text-based comparison of the source code files can detect exact copies of code snippets, text-based comparisons cannot detect snippets that retain the same functionality but have been refactored or modified. Therefore, a challenge remains to identify clones between source code in a protected code location and files located in some other location, such as on a mobile device, attached to an e-mail, or in some other unsecure or unauthorized location.