When digital computers were first invented, programmers wrote programs in binary numbers (1's and 0's) or hexadecimal numbers (0-F) the computer CPU could directly execute. Throughout the 1960's and beyond, many computer front panels included banks of switches a programmer could use to directly program binary instructions into the computer. However, writing code in binary was challenging and prone to errors. Therefore, beginning in the 1950's, programmers began writing in so-called “assembly language.” Assembly language provided one-to-one correspondence between binary machine code instructions and human-written assembly language instructions, but substituted mnemonics (easy to remember short symbol strings) for the machine language operational codes (op codes) and used symbols instead of machine addresses. Writing in assembly language still required the programmer to have an intimate understanding of the hardware she was writing the program for, but simplified the programmer's tasks and eliminated sources of error.
Because assembly code is so intimately intertwined with hardware functionality, it was possible for a good assembly language programmer to optimize execution efficiency, reliability, speed and program size. This was especially important if the programmer wanted to squeeze every last ounce of performance out of a resource-constrained computing platform.
More than a half-century later, modern day computers have comparatively vast resources in terms of processor speed, memory storage and other performance characteristics. Furthermore, there are now vast numbers of computing platforms with a wide variety of different processor types and instruction sets. Therefore, most modern day computer programs are mostly written in higher level languages such as Java, C++ or the like. Such higher level languages have many advantages including portability across multiple computing platforms and insulating the human programmer from the detailed operation of particular computing devices. Typical computing platforms cannot execute the higher level languages directly. Rather, the higher level instructions are first compiled or interpreted to produce lower level “machine code” that the particular processor executing the instructions can execute. There is an entire industry devoted to developing compilers, linkers, and other tools to provide optimal transformation from higher level languages to machine code on various platforms.
Generally speaking, it would be impractical or impossible to provide the rich, cross-platform software functionality we take for granted today if programmers were still writing all code in assembler. In fact, it is relatively rare to find a modern day programmer who is proficient in assembly. However, there are certain instances when assembly language or machine code instructions retain their importance.
For example, there are some game programming situations where extreme optimization is required. As one example, an inner loop of a processor-intensive algorithm within a game program that may be executed repeatedly hundreds or thousands of times each second may need to be maximally optimized. Some argue that a modern-day optimizing compiler can do a better job of optimizing than can any human programmer, but others disagree and believe there is no substitute for a human being who fully understands the target hardware and can apply creativity to minimize the number of instructions needed to perform a given function.
Additionally, sometimes legacy game code originally written in assembly now has to be understood and/or modified. Another interesting situation arises where a programmer does not have access to the source code for a particular game, but needs to understand how a particular part of the game code executes. It is possible to disassemble the machine code to provide mnemonic-oriented assembly language, but such code can be difficult to understand—especially by programmers who are used to writing in higher level languages.
Normally, programmers do not deal directly with machine code. However, sometimes there is a need to code these instructions by hand, or inspect compiler-generated machine code.
Thus, in these instances, it would be beneficial if the machine code could be visualized in a manner that more easily facilitates human comprehension. The technology provided herein is directed to specific techniques that work together to enhance and transform machine code, allowing a human to more easily understand the flow of control in a function or a program.
When reading any code, the flow of the program is very important to overall comprehension. Normally, the flow is from one machine instruction to the next. However, conditional or unconditional branches complicate the flow, causing execution to conditionally or unconditionally skip to an arbitrary instruction. Sometimes the flow can branch to previously executed instructions, providing what is known as a “loop,” where a sequence of instructions may be executed an arbitrary number of times.
In mathematics, computer science, and related fields, big O notation (also known as Big Oh notation, Landau notation, Bachmann-Landau notation, and asymptotic notation) describes the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions. Big O notation allows its users to simplify functions in order to concentrate on their growth rates: different functions with the same growth rate may be represented using the same O notation. Although developed as a part of pure mathematics, this notation is now frequently also used in the analysis of algorithms to describe an algorithm's usage of computational resources: the worst case or average case running time or memory usage of an algorithm is often expressed as a function of the length of its input using big O notation. This allows algorithm designers to predict the behavior of their algorithms and to determine which of multiple algorithms to use, in a way that is independent of computer architecture or clock rate. Because Big O notation discards multiplicative constants on the running time, and ignores efficiency for low input sizes, it does not always reveal the fastest algorithm in practice or for practically-sized data sets. But the approach is still very effective for comparing the scalability of various algorithms as input sizes become large. A description of a function in terms of big O notation usually only provides an upper bound on the growth rate of the function. Associated with big O notation are several related notations, using the symbols o, Ω, ω, and Θ, to describe other kinds of bounds on asymptotic growth rates. Big O notation is also used in many other fields to provide similar estimates. See Wikipedia “Big O Notation.”
There are alternative techniques for analyzing program flow and complexity. One exemplary illustrative non-limiting technique is based on “indentation level” signifying branching. Generally speaking, branching or looping in a computer program is often indicated by indenting code portions that are within the branch or loop. Hence, four levels of looping can be indicated by four-level indentation. An exemplary illustrative non-limiting implementation takes advantage of such typographic presentation to analyze program complexity based on indentation level. Transitions from one indentation level to another can indicate a change in complexity. The more indentation levels, generally speaking, the more complex the program structure. Automatic program analysis based on indentation level can thus be one useful measure of program complexity.
In one exemplary illustrative non-limiting implementation, branching arrows either facing forward/downward or backward/upward are automatically generated and used to visually indicate the flow of the program executed by the machine code. When presented in this manner, the branching destination of each machine code instruction is emphasized, so that the human reader, for example, a programmer, can grasp the flow of the program more easily and quickly.
In another exemplary illustrative non-limiting implementation, indentation of various machine code instruction lines is automatically generated and used to visually indicate the natural division of the machine code into blocks due to branching from a specific machine code instruction to other instructions. In some implementations, the destination of a branching command is visually noted in the machine code branching instruction, and the indentation continues until a machine code instruction with a visual mark is encountered.
In yet another exemplary illustrative non-limiting implementation, the flow and the complexity of the machine code are automatically analyzed and visually noted by computing the time complexity associated with each machine code block, or each machine code instruction, and marking the various components of the machine code accordingly.
An alternative way to analyze program complexity can be referred to as the “strongly connected subgraph method,” and can be used to detect worst case general time complexity of program assembly instructions. In one exemplary illustrative non-limiting implementation, time complexity may be different from program structural complexity. Such a technique can be based on the code flow and destination of branch instructions and may detect time complexity in terms of nesting level. Such techniques, for example, create a code flow graph, identify strongly connected subgraphs, create collections of strongly connected subgraphs, create a collection graph, analyze for longest path in a collection graph, sort collections and assign time complexity for interpretation, display, etc.
The strongly connected subgraph method looks into the dynamic flow of the program by exhaustively looking into the sequence of line to line instructions and identifying cyclic transitions (representing strongly connected subgraphs). This detection of nesting levels allows for the detection of actual usage of the program by the various portions thereof.
The indentation level or “up branch method” of computing time complexity and the strongly connected subgraph method of computing time complexity have respective strengths and weaknesses as well as differences between them. For example, the up branch method may sometimes incorrectly attribute a higher time complexity to a line of code due to the grouping nature of the algorithm with respect to line numbering, when in fact the line may not be a true part of the cycle. The up branch method is conservative, not recognizing cycle containment when the first or last lines of code are shared between cycles, and not recognizing cycle containment when a cycle partially overlaps another cycle with regard to line numbering. On the other hand, the strongly connected subgraph method tends to aggressively recognize cycle containment when in fact the source code or assembly code might not result in this time complexity e.g., since the actual computation of each line of code that causes a branch to be taken is not analyzed. The strongly connected subgraph method may not ever incorrectly attribute a lower time complexity to a line of code as theoretically possible thus every computed time complexity is a maximum.
Given the strength and weaknesses of the up branch method time complexity algorithm and the strongly connected subgraphs method time complexity algorithm, combination strategies can be employed. For example, the strongly connected subgraph method may be used to determine the worst case time complexity that is theoretically possible (e.g., based on naive code flow and branching flow). Then, it is possible to take the minimum time complexity for each line of code between the up branch and the strongly connected subgraph method to provide a conservative estimate of time complexity while eliminating incorrect higher time complexities present in the up branch method.