A codebase can be extremely large, for example, on the order of 800,000 source files. Running static analysis on the code inside the source files can be very expensive in time (e.g. a single run could easily take a matter of days). In order to reduce the time, efficiencies may be obtained by running the static analysis in a distributed environment.
In the case of auto-forward-declaring languages, such as Java, JavaScript, Python, C#, and Visual Basic, the source files are compiled in bunches, which have cyclic dependencies. The sizes of these bunches greatly differ from compilation target to compilation target. Thus, even if thousands of machines are used to perform the static analysis, some machines will complete almost instantly where others will take a long time to perform the analysis. For example, in the case of java source files some targets complete much faster than others. Subsequently, machine resources are wasted and analysis still takes a long time to complete.
These auto-forwarding-declaring languages utilize libraries of code contained in modules. Library modules can contain predefined classes or other data structures that can be referenced by source files.
In the case of the Java programming language, Java classes (.class) are typically stored in jar files, (.jar). Jar files are an archive file format used to aggregate many Java class files and associated metadata and resources (for example text and images) into one file to distribute application software or libraries. Java source files (.java) that reference classes contained in Jar files are compiled into class files. Source files are files that contain source code to be compiled and can be analyzed by an analyzer.
Library modules, or in the case of Java, Jar files, often contain more code than is required by references in source files. Different Jar files contain various amounts of extra code than is required. As software systems scale up, it has been determined that an extremely large amount of data is downloaded when performing static analysis (for example, in ad-hoc testing it was found that greater than 200 MB of data was unnecessarily downloaded for an average target). Downloading of data that is not required contributes substantially to increasing the amount of time to perform the analysis and also wastes computing resources.