The invention relates to a method, a computer program and apparatus for processing a message with a computer
There are many examples of computer systems in which it is useful to be able to analyse symbols passing through or stored in the computer system. As will be appreciated from the following, the term “symbols” in this context is to be construed broadly. In general, the term “symbols” is used herein in the broad sense, for example, including computer messages, which term is also to be construed broadly and includes for example computer messages in a computer language (including computer instructions, such as executable programs), natural languages in computer-readable form (such as in documents, emails, etc.). “Symbols” also includes computer data in the conventional sense, i.e., typically, abstractions of real world artefacts, etc.
In one example of computer systems in which it is useful to be able to analyse symbols passing through or stored in the computer system, third parties can attempt to take control of a computer by “attacking” the computer system. One class of attack, carried out by third parties involves them tampering with messages in a computer system. Such “attacks” can be carried out by exploiting the well-known buffer overflow vulnerabilities of some computer systems. In another example, hacking can take place by the third party sending commands to the computer system in which the commands are correctly structured in the context of the language of the computer system, but which are intended to cause the computer system to perform undesirable actions including to return an error message that can be used by the third party for reconnaissance or to return inappropriate information to a third party or to gain illegal access to the computer system. Attacks of this type on relational databases are well known and yet are difficult to defend against. Relational databases are widely used, and are used for example by e-commerce and many other websites to hold user data (such as login name and password, address and credit card details, etc.).
In another example, it may be desirable to monitor computer symbols or messages to ensure that the computer system is being used properly and that for example it is not being used inappropriately. For example, in an organisation, a user may be using a computer system inappropriately, for example by using the system for purposes for which the user is not authorised, and yet which is not intended by the user to be an “attack” on the computer system as such.
In our co-pending US and European patent applications entitled “A method, A Computer Program and Apparatus for Analysing Symbols in a Computer” having application numbers U.S. Ser. No. 11/672,253 and EP-A-1,830,253, respectively, there is described and disclosed a method for analysing symbols in a computer system. The method and apparatus, referred to herein as “Efficient Grammatical Clustering” (“EGC”), described in the aforementioned patent applications, provides a means to understand usage patterns based on messages entering (or leaving) computer systems. For example, EGC provides a means of being able to recognise messages that are the different database commands entering a relational database system. This enables a baseline of normal behaviour to be determined. EGC enables all new messages, i.e. messages that have not been seen by the system previously, to be recognised so that a proactive device can determine whether a new message (command) should be allowed to pass to the database or not.
The EGC patent applications, the entire contents of which are hereby incorporated by reference, relate to a method by which a unique execution path of any instrumented computer program can be determined. Such an execution path provides the basis for generalisation of execution paths into semantically similar clusters. Each execution path can be assigned a unique cluster identifier.
In one embodiment of the EGC patent applications, grammatical clustering of messages which are sentences within a computer language is performed. In this case, the computer program is a parser and the execution path is generated by the operation of the parser on some input sentence messages (in the appropriate computer language). In the EGC method, parsing of the incoming messages is thus extremely important as it is the parsing of the messages that enables the respective execution paths to be determined from which the unique cluster identifier can be assigned to the message.
Referring to FIG. 1, the EGC method is briefly described. There is shown in FIG. 1 an example of a computer system 106 connected to a network 105. The computer system 106 has a computer resource 103 which might be, for example, a relational database. The computer system 106 makes its computer resource 103 available to applications 102 interacting directly or across the computer network 105 to support one or more users 101. The interaction with the computer resource 103 is mediated through a computer language via the transmission of messages MSG 104 within the message language. Such messages are an example of “symbols”, as mentioned above, within the computer system. A process 202 is provided operating within or on the computer system 106 to observe messages. The message and the intent of the message can be determined via another process 201 explained below.
Typically, the messages MSG 104 might be used to specify the desired operational behaviour of components in the computer system 106. Thus, messages are used between components within the computer system, and messages are used by users to gain access to the computer system 106. Computer languages are used to facilitate the use of messages in the computer system. Each computer language is defined by a grammar so that messages conform to a known syntax. The grammar of such languages is published so that software developers can ensure that the messages of the software conform to the correct syntax.
The grammar of the computer language of the messages that are to be analysed is defined, e.g. using first order logic. This may be carried out in a manner that is known per se. For example, the programming language Prolog can be used to describe the grammar of the language as a set of first order logic. This logic is then applied initially to a set of training examples of messages. Such messages are defined so as to be correct syntactically in the context of the language and appropriate in the sense that they are messages that are deemed to be acceptable in the context of usage of the system around which the messages pass. The logic contains clauses. When the logic is applied to the messages, the identity of the clauses along a successful path through the logic is noted. In this way, paths of acceptable messages being parsed via the logic are obtained. These paths can then be grouped according to similarity. In turn, the messages that follow the respective paths can be grouped according to similarity in this sense, so that patterns of similar messages can be discerned. This means that new messages, which are different from messages used in the training, can then be allocated to patterns of messages that are known to be acceptable, or rejected.
The EGC method works by analysing symbols into patterns, so that new symbols can be analysed more efficiently than in other known techniques. This enables the EGC methodology to be implemented in real-time with relatively little computational overhead. In one example, the method is carried out on new symbols to determine whether the new symbols fit a pattern of data that is known or constitute a new pattern. Patterns may also be referred to as “clusters” as they represent a cluster of similar paths through the computer logic. In practice, if the new symbols fit a pattern that is known, then a decision will already have been made as to whether symbols fitting that known pattern are to be deemed acceptable or not. If the symbols constitute a new pattern, in practice a decision will have been made what to do with symbols that constitute a new pattern, such as “always deem not acceptable” or “send error report”, etc.
The EGC system and method is not concerned with generating new rules for new messages. Instead, it is concerned with determining patterns for computer messages. In one embodiment, the patterns that are obtained can then be considered, for example “manually” by a human user, to determine whether a computer system has been compromised.