The present application generally relates to computer-implemented methods and systems for analyzing large software systems. More particularly, it relates to an interrelated set of tools and methods for recording the identity of software components responsible for creating files, recording the identity of software components that access software files, reasoning about the dependency relationships between software components, identifying and reporting undesirable dependencies between them, and reporting other useful information about a large-scale software architecture by instrumenting a software build process or test process. The term “software component” as used herein is intended to mean a software package or a software module of a software system.
Software systems today can be composed of millions of entities (files, functions, classes, methods, data structures, web services, databases, etc.) that are connected in many ways. These systems can be heterogeneous—made up of code written in dozens of languages, compiled or interpreted and used on multiple operating systems, and incorporating many third party technologies including open-source and proprietary tools and libraries. Designing and maintaining these systems is difficult, and keeping the complexity in the system under control is an important concern. When complexity causes different elements of a system to interact in unanticipated ways, or when parts of a system are so complex that they move beyond the bounds of human cognitive capacities, a host of interconnected problems can begin to occur. When engineers lose control of complexity in a system's design, it can lead to project failure, business failure, and/or man-made disaster. Even systems of high quality with a sustainable level of overall complexity may have some sub-systems or cross-cutting concerns that are unmanageable and incomprehensible.
In order to maintain long-term health in large systems, engineers often employ patterns in their designs to keep architectural complexity in check. From a macro-perspective, well-architected systems are structured as hierarchies of modules, have APIs, employ abstraction-layering schemes, and have reusable components. When carefully applied, such patterns can aid developer comprehension and enable independence of action between people and teams in large organizations. They can also endow systems with a variety of beneficial properties including comprehendability, reliability, evolvability, scalability, and flexibility, just to name a few.
Modular architectures are composed of distinct semi-autonomous structures with formal boundaries that separate their internal environment from the outside world. Robust modules have the property of “homeostasis”—their internal functioning is not easily disrupted by fluctuations in the external environment. Modular systems contain many independent components, each of which can change or evolve separately with minimal impact on each other or on the system as a whole. Modules hide information in the sense that the use of one only requires a client to understand its public interface, not its complex internals.
A hierarchical system is composed of elements whose dependency relationships form a directed acyclic graph (DAG). While, a hierarchy may not contain cycles, it can contain multiple source and sync nodes, and can both diverge and converge. A tree is a common type of hierarchy that fans out from a single root (or controller node) and never converges. A layered system is also a kind of hierarchy. Hierarchies are pervasive organizing patterns in many real-world systems. Hierarchical organization assists designers by reducing the cognitive burden placed on the human mind when examining a system from any one vantage point. Hierarchies also facilitate top-down control and the imposition of safety constraints. They are useful structures for classifying, storing, and searching for information. Finally, the requirement that a hierarchy contains no cyclic connections reduces the possibility that feedback loops will be formed between widely separated components. These feedback loops or cycles can hinder change or lead to undesirable change propagation during the design process.
Layers combine the notion of hierarchy and modularity in a manner that serves to contain complexity and endow a system with a variety of beneficial properties. Layers in systems provide services to components above them while relying on services provided by those below. They combine the notion of directionality found in hierarchies with the notion of information hiding found in modules. Conceptual layers in a design are sometimes called abstractions. Layering hides information in a stronger manner than modularity does because it partitions a complex network of components into two distinct regions that may be considered independently. In addition to hiding details, abstraction layers may embody new higher-level concepts by aggregating diverse facilities into a useful coherent whole. Abstraction layers can also partition systems by engineering discipline or be responsible for defining the boundaries between disciplines. The transistor, for instance, creates a useful barrier that allows electrical engineers to study quantum mechanics while computer engineers can study Boolean logic. The creation of new abstraction layers is an important way reuse is achieved in software.
Some new empirical and quantitative research suggests that code that adheres to these principles costs less to develop, adapt, and maintain. An MIT dissertation published in February 2013 titled “System Design and the Cost of Architectural Complexity” by Daniel J. Sturtevant finds that modular, hierarchical, and layered code has fewer defects than code in which those properties are absent or have degraded, and that software engineers working in architecturally sound code are also more productive and have higher morale. This dissertation built upon a prior body of work done by Alan MacCormack, Carliss Baldwin, and John Rusnak in which they explored software codebases using static analysis tools to extract dependencies between software elements and then used network analysis and design structure matrix (DSM) techniques to examine modular and hierarchical properties of those software systems.