The present invention relates generally to computer software, and more specifically, to a method and system of making computer software resistant to tampering and reverse-engineering.
The market for computer software in all of its various forms is recognized to be very large and is growing everyday. In industrialized nations, hardly a business exists that does not rely on computers and software either directly or indirectly, in their daily operations. As well, with the expansion of powerful communication networks such as the Internet, the ease with which computer software may be exchanged, copied and distributed is also growing daily.
With this growth of computing power and communication networks, a user""s ability to obtain and use unauthorized or unlicensed software is increasing, and a practical means of protecting such computer software has yet to be devised.
As well, personal computers are found in a substantial percentage of all households in America, and in Canada. Home computing is also common in Europe and various parts of the SEATO region. However, when home computer users access banking or other online services via the World Wide Web or the like, much of the processing must be in the physically protected servers rather than in the home computers, because the applications running on home computers are vulnerable to tampering.
Any bank or other service institution must protect itself by ensuring that operations whose results must be trusted are performed in secured environments. The result is that service support tends to be centralized at a relatively small number of sites. Servers must therefore be massive to support heavy transaction loads. If the clientele doubles, the centralized support must likewise be doubled.
This need for centralized physical security requires centralized implementations of services. Centralized services are undesirable as they are inherently vulnerable to localized attacks and provide a single point of failure.
A possible, but impractical, solution would be to house centralized servers in fortresses, and to implement them on fault-tolerant architectures with fault-tolerant tools. However, both aspects of this approach are quite costly.
Before describing the difficulties and failed approaches, the general language used in the art will be outlined.
Computer software is generally written by software developers in a high-level language which must be compiled into low-level object code in order to execute on a computer or other processor.
High-level computer languages use command wording that closely mirrors plain language, so they can be easily read by one skilled in the art. Typically, source code files have a suffix that identifies the corresponding language. For example, Java is a currently popular high-level language and its source code typically carries a name such as xe2x80x9cprog1.javaxe2x80x9d. Such files are generally referred to as xe2x80x9c.javaxe2x80x9d files. Other examples include the high-level computer languages C and C++, where source files are typically xe2x80x9c.cxe2x80x9d or xe2x80x9c.cppxe2x80x9d (for xe2x80x9cC plus plusxe2x80x9d) files, respectively, as in xe2x80x9cprog1.cxe2x80x9d or xe2x80x9cprog1.cppxe2x80x9d, respectively.
High-level structure refers to, for example, the class hierarchy of object oriented programs such as those in Java(trademark), or the module structure of module-based languages such as Ada(trademark) and Modula-2(trademark) programs
Object-code generally refers to machine-executable code, which is the output of a software compiler that translates source code from human-readable to machine-executable code. In the case of Java(trademark), these files typically are commonly named xe2x80x9cfilename.classxe2x80x9d, where the xe2x80x9cfilenamexe2x80x9d may be any valid identifier string, and are referred to as xe2x80x9c.class filesxe2x80x9d.
The low-level structure of object code refers to the actual details of how the program works, including scalar data flow and detailed control flow including inter-routine call-return linkages. Low-level analysis usually focuses on, or at least begins with, one routine at a time. This routine may be variously called, for example, a procedure, function or method; in C or C++ all routines are called xe2x80x9cfunctionsxe2x80x9d, whereas in Java, they are all called xe2x80x9cmethodsxe2x80x9d. The conventions vary with the source high-level software language. Analysis of individual routines may be followed by analyses of wider scope in some compilation tool sets.
The low-level structure of a software program is usually described in terms of its data flow and control flow. Data flow is a description of the variables together with the operations performed on them, and the way information flows from one variable to another. Control flow is a description of how control jumps from place to place in the program during execution, and the tests that are performed to determine those jumps.
Instructions which potentially transfer control to another instruction are referred to as branches. A conditional branch is a branch whose destination is determined by its input value or values. A boolean branch is a conditional branch which takes a single input value and chooses between two destinations, one associated with the input value xe2x80x9ctruexe2x80x9d, and the other with the input value xe2x80x9cfalsexe2x80x9d.
Tampering refers to changing computer software in a manner contrary to the wishes of the original author. In the past, computer software programs had limitations encoded into them, such as requiring password access, preventing copying, or allowing the software only to execute a predetermined number of times or for a certain duration. However, because the user has complete access to the software code, techniques have been found to identify the code administering these limitations. Once this coding has been identified, the user is able to overcome these programmed limitations by modifying the software code.
To protect a program from hostile attackers, both the behaviour of the program and the knowledge which it embodies must be protected. That is, one must prevent changes to its behaviour, and one must conceal its embedded knowledge. The prevention of behavioral changes is referred to as xe2x80x9ctamper-proofingxe2x80x9d, and the concealment of embedded knowledge as xe2x80x9cobscuringxe2x80x9d.
When an attacker seeks to subvert the behaviour of a program, for example, by removing password checking or eliminating a date-check on a time-limited trial version of a software package, the attack is generally directed at control flow, rather than the data flow. Changing behaviour through an attack on the data flow generally requires substantial insight into the way the program operates, whereas an attack on control flow can often succeed with almost no knowledge of how the application functions.
Although these two aspects of program protection are related, they are not the same. For example, it is possible to conceal almost all of the knowledge embedded in a program, but still leave it vulnerable to tampering.
Consider, for example, an application program which is password-protected to prevent unauthorized use. When an attempt is made to use it, it asks for a password to determine whether its use is valid. To subvert this password checking, it is not necessary to understand the application or how it operates, or even how the password is stored or checked. Typically, all that is needed is to find the particular conditional branch instruction whose execution results in either refusal or acceptance of the user. Replacing this single instruction with an unconditional branch to the location leading to xe2x80x9cacceptancexe2x80x9d completely subverts the password checking. Discovering the accept/reject conditional branch can be done by low-level tracing of the initial phases of execution. No knowledge of anything else about the program, other than what is needed to find the crucial branch-point, is required. This remains true irrespective of how obscure any other information, whether in algorithms or in data, might be.
Obscurity, in and of itself, does not necessarily prevent tampering. In fact, obscuring is necessary, but not sufficient, for tamper-proofing.
There are many tools on the market whose purpose is to obfuscate the algorithms in programs. Since Java(trademark) is used for transmission of programs over the World Wide Web and the Internet, obfuscators for Java(trademark) are especially in demand.
With two exceptions, all of these are very weak. For example, Obfuscate(trademark), J-shrink(trademark), HashJava(trademark), SourceGuard(trademark), and DashO(trademark) all attempt to obfuscate Java(trademark) code by a common set of modifications involving renaming, removal of attached debug information, and other xe2x80x98de-structuringxe2x80x99 operations. The theme of these approaches is to apply the rules of good software engineering in reverse, and to remove information used to observe program behaviour during execution. The rationale is that if doing something makes code easier to understand, doing the opposite may be expected to make it more difficult to understand.
J-shrink(trademark), HashJava(trademark), SourceGuard(trademark), and DashO(trademark) also perform code optimization, which tends to make Java(trademark) object code (.class files) more difficult to decompile into source code.
DashO(trademark) also introduces irreducible flow graphs, which have no direct representation as Java(trademark) source, although conversion to Java(trademark) source is still possible using node-splitting to re-establish flow graph reducibility. Algorithms for removal of irreducible flow-graphs from programs are well-known, for example, combining node splitting with T1-T2 analysis. Such a method is presented in xe2x80x9cCompilers: Principles, Techniques, and Toolsxe2x80x9d, by Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ulman, ISBN 0-201-10088-6: pp. 666-668. Hence, very limited protection is provided by the introduction of irreducible flow-graphs.
In their paper xe2x80x9cA tentative approach to constructing tamper-resistant softwarexe2x80x9d, 1997 New Security Paradigms Workshop, ACM publication 0-89791-986-6/97/19, M. Mambo, T. Murayama, and E. Okamoto propose a tool for making software code tamper-resistant which they designate xe2x80x9ca0/f1/f2/f3xe2x80x9d. Aside from optimization of the code, which is standard in obscuring tools, they propose to:
(a0) analyze the program;
(f1) replace complex instructions with simpler, more elementary ones;
(f2) shuffle the instruction stream; and
(f3) insert dummy instructions.
At most, this approach adds weak obscurity and no tamper-resistance in the context defined herein, so that code treated using this technique is easily decoded. Steps f1 and f2 make no significant changes to the data flow graph, and no changes at all to the control flow graph. The dummy instructions added at step f3 can be removed using existing program slicing tools and code optimisers. As a result, this technique offers no protection against a concerted or sophisticated attack.
None of the above tools or proposed techniques provides tamper-proofing. While tamper-resistance appears in the title of the paper by M. Mambo et al., the body of the paper contains only proposals for weakly obscuring software, and in actuality proposes no technique which can achieve significant resistance to tampering.
Moreover, the obscurity provided by the above techniques is weak. Except for DashO(trademark), none makes significant changes to the control flow and data flow graphs after optimization is applied. In other words, the resulting programs"" computational graphs are either unmodified or little modified. The data are not protected at all.
Tamper-proofing has traditionally been done by means which cannot stand up to a concerted attack. For example, one method is to obtain some hash value from the code dynamically, for an internal test. If the hash value changes, the code has been modified, and the program causes itself to fail or trap. Such protection is, of course, vulnerable to discovery by low-level tracing, and once the code to implement such checking is discovered, removing or disabling it is straightforward.
Tamper-proofing may also rely on obscure aspects of the platform as a reference, such as contents of unused portions of disk, or the xe2x80x98signaturexe2x80x99 provided by attached peripheral hardware. For example, Megaload Inc. has developed a technology for xe2x80x98finger-printingxe2x80x99 PC installations, and limiting access to applications via a key related to the finger-print. This approach is inflexible in that changes in the installation induce changes in the finger-print, with resulting administrative overhead to obtain a new registered key. Moreover, such finger-printing does not prevent tampering to remove finger-print checking.
Another approach is to use a xe2x80x9cdonglexe2x80x9d, a special piece of plug-in hardware, such as a smart card, which implements part of the algorithm to be protected. The program will then not work correctly unless the dongle is plugged in. Obviously, this is a high-cost approach, and does not work on a standard platform. Indeed, it requires the platform to be changed to include the dongle whenever the program to be protected is to be run.
There are also various approaches based on encryption, such as decrypting software immediately prior to execution. Such protection can be penetrated by copying the image of the decrypted executable code from memory or by hacking out the key of the software and then simply running the resulting decryption over the encrypted software.
In general, existing schemes for making software tamper-proof are either quite weak or involve specialized hardware and/or other high-cost or high-maintenance methods. Other schemes, such as that of U.S. Pat. No. 5,748,741, are very restrictive in the kinds of programs they can protect.
U.S. Pat. No. 5,748,741 obscures computation by encoding via intertwining, cascades, checking codes, clocking, and appended trap codes. These techniques may only be applied to intraprocedural scalar computations, and not:
1. large scale arrays and structures, or arrays of dynamically determined size required for programs with sizable indexed linked data structures, including object-oriented (OO) programs;
2. polymorphic routine calls or parallel threads, required in OO and parallel programs;
3. data pointers and linked structures such as search trees or linked lists, ruling out encoding of most programs in languages such as C(trademark), C++(trademark), or Modula-2(trademark);
4. code pointers such as procedure variables or function variables, ruling out certain table-driven control structures often used in complex control applications such as telecommunications switching, and also ruling out the implementation of dynamic method vectoring, required in object oriented programming languages as an implementation for polymorphic routine calls;
5.full range of scalar operations found in such languages as C(trademark) or Java(trademark); and
6. may not be applied to integer (truncated) division, modulus, remainder, and bitwise operations such as and, or, xor.
In intertwining, multiple computations of the original program are combined into new multiple computations, such that there is no 1-to-1 mapping between old computations and corresponding new ones. This intertwining requires that operations be done in groups of two or more. For example, two additions may be coded together, or an addition and a subtraction, or two multiplications, and the like. Often, the source code does not provide such convenient pairs of operations which are both ready to execute at the same time, so decoy code must be added to provide the corresponding pairs. This greatly increases the size of the code.
U.S. Pat. No. 5,748,741 also depends on cascades, which are sizable data-flow graphs within a program where all outputs depend on all inputs. These are used for a variety of purposes, including delaying response to tampering via a clock cascade, and controlling the security level of the encoding. Since cascades are entirely composed of code added to the program to be encoded, this widespread use of cascades increases code bulk and slows execution speed.
The greatest failing of U.S. Pat. No. 5,748,741 is that it does not make any substantial changes to control flow, other than to add trapping codes and additional branches to branch into these trapping codes. As a result, the control structure of the encoded program is not obscured or tamper-protected, exposing information and vulnerabilities to attack.
As noted above, it is desirable to prevent Users from making small, meaningful changes to computer programs, such as overriding copy protection and timeouts in demonstration software. It is also necessary to protect computer software against reverse engineering which might be used to identify valuable intellectual property contained within a software algorithm or model.
In other applications, such as emerging encryption and electronic signature technologies, there is a need to hide secret keys in software programs and transmissions, so that software programs can sign, encrypt and decrypt transactions and other software modules. At the same time, these secret keys must be protected against being leaked.
There is therefore a need for a method and system of making computer software resistant to tampering and reverse engineering. This design must be provided with consideration for the necessary processing power and real time delay to execute the protected software code, and the memory required to store it.
It is therefore an object of the invention to provide a method and system of making computer software resistant to tampering and reverse engineering which addresses the problems outlined above.
The method and system of the invention recognizes that attackers cannot be prevented from making copies and making arbitrary changes. However, the most significant problem is xe2x80x9cuseful tamperingxe2x80x9d which refers to making small changes in behaviour. For example, if the trial software was designed to stop working after ten invocations, tampering that changes the xe2x80x9ctenxe2x80x9d to xe2x80x9chundredxe2x80x9d is a concern, but tampering that crashes the program totally is not a priority since the attacker gains no benefit.
Control-flow describes the manner in which execution progresses through the software code. The invention increases the complexity of the control flow by orders of magnitude, obscuring the flow of its algorithm and preventing the attacker from identifying and tampering with targeted areas. However, the invention does much more than this: it also changes the way in which control over execution flow is exercised, so that control becomes highly data-driven.
One aspect of the invention is broadly defined as a method of increasing the tamper-resistance and obscurity of computer software code comprising the step of: transforming the control flow in the computer software code to dissociate the observable operation of the transformed computer software code from the intent of the original software code.
Another aspect of the invention is a method of increasing the tamper-resistance and obscurity of computer software code comprising the step of: converting the control flow of the computer software code from its original form into data-driven form, to increase the tamper-resistance and obscurity of the computer software code.
Another aspect of the invention is broadly defined as a method of increasing the obscurity and tamper-resistance of computer software code comprising the step of converting its control instructions from its original form, in which the stereotyped control structures provided by human limitations and the limited, fixed repertoire of high-level control facilities provided in a high-level software language reveal the semantic content and intent of the software code, into a new domain without any such corresponding high-level semantic structure, so that the control structure is divorced both from the original intent of the programmer, and from the forms of control structure easily understood by a programmer reading the code.
Another aspect of the invention is broadly defined as a method of increasing the tamper-resistance of computer software code comprising the steps of: adding fake-robust control transfers to the computer software code, to increase the tamper-resistance of the computer software code. An operation is fake-robust when it appears to operate normally in the presence of tampering, but in actual fact responds to tampering by performing some quite different, meaningless action, while not causing program execution to abort. In response to tampering, the fake-robust control transfers branch to spurious destinations with high probability, causing execution to wander off into persistent nonsensical behaviour.
Another aspect of the invention is broadly defined as an apparatus for increasing the tamper-resistance of computer software code comprising: means for re-sorting assignments in the computer software code without changing the semantic operation of the computer software code; means for copying multiple different segments of the computer software code into new segments; and means for adding fake-robust control transfers to the new segments, to increase the tamper-resistance of the computer software code.
Another aspect of the invention is broadly defined as a computer readable memory medium, storing computer software code executable to perform the steps of: re-sorting assignments in said computer software code without changing the semantic operation of said program; copying multiple different overlapping segments of said computer software code into new segments; and adding fake-robust control transfers to said new segments, to increase the tamper-resistance of said computer software code.
Another aspect of the invention is broadly defined as a computer data signal embodied in a carrier wave, the computer data signal comprising a set of machine executable code being executable by a computer to perform the steps of: re-sorting assignments in the computer software code without changing the semantic operation of the computer software code; copying multiple different overlapping segments of the computer software code into new segments; and adding fake-robust control transfers to the new segments, to increase the tamper-resistance of the computer software code.