The present invention relates to a method, a computer program and apparatus for analyzing symbols in a computer.
There are many examples of computer systems in which it is useful to be able to analyze symbols passing through or stored in the computer system. As will be appreciated from the following, the term “symbols” in this context is to be construed broadly. In general, the term “symbols” is used herein in the broad sense, for example, including computer messages, which term is also to be construed broadly and includes for example computer messages in a computer language (including computer instructions, such as executable programs), natural languages in computer-readable form (such as in documents, emails, etc.). “Symbols” also includes computer data in the conventional sense, i.e., typically, abstractions of real world artifacts, etc.
In one example of computer systems in which it is useful to be able to analyze symbols passing through or stored in the computer system, third parties can attempt to take control of a computer by “attacking” the computer system. Such “attacks” can be carried out by exploiting the well-known buffer overflow vulnerabilities of some computer systems. In another example, hacking can take place by the third party sending commands to the computer system in which the commands are correctly structured in the context of the language of the computer system, but which are intended to cause the computer system to perform undesirable actions including to return an error message that can be used by the third party for reconnaissance or to return inappropriate information to a third party or to gain illegal access to the computer system. Attacks of this type on SQL databases are well known and yet are difficult to defend against. SQL databases are widely used, and are used for example by e-commerce and many other websites to hold user data (such as login name and password, address and credit card details, etc.).
In another example, it may be desirable to monitor computer symbols or messages to ensure that the computer system is being used properly and that for example it is not being used inappropriately. For example, in an organization, a user may be using a computer system inappropriately, for example by using the system for purposes for which the user is not authorised, and yet which is not intended by the user to be an “attack” on the computer system as such.
In our co-pending US and European patent applications entitled “A method, A Computer Program and Apparatus for Analyzing Symbols in a Computer” having application numbers U.S. Ser. No. 11/672,253 and EP-A-1,830,253, respectively, there is described and disclosed a method for analyzing symbols in a computer system. The method and apparatus, referred to herein as “Efficient Grammatical Clustering” (“EGC”), described in the aforementioned patent applications, provides a means to understand usage patterns based on messages entering (or leaving) computer systems. For example, EGC provides a means of being able to recognise messages that are the different database commands entering a relational database system. This enables a baseline of normal behavior to be determined. EGC enables all new messages, i.e. messages that have not been seen by the system previously, to be recognised so that a proactive device can determine whether a new message (command) should be allowed to pass to the database or not.
The EGC patent applications, the entire contents of which are hereby incorporated by reference, relate to a method by which a unique execution path of any instrumented computer program can be determined. Such an execution path provides the basis for generalization of execution paths into semantically similar clusters. Each execution path can be assigned a unique cluster identifier.
In one embodiment of the EGC patent applications, grammatical clustering of messages which are sentences within a computer language is performed. In this case, the computer program is a parser and the execution path is generated by the operation of the parser on some input sentence messages (in the appropriate computer language). In the EGC method, parsing of the incoming messages is thus extremely important as it is the parsing of the messages that enables the respective execution paths to be determined from which the unique cluster identifier can be assigned to the message.
Referring to FIGS. 1 and 2, the EGC method is briefly described. There is shown in FIG. 1 an example of a computer system 106 connected to a network 105.
The computer system 106 has a computer resource 103 which might be, for example, an SQL database. The computer system 106 makes its computer resource 103 available to applications 102 interacting directly or across the computer network 105 to support one or more users 101. The interaction with the computer resource 103 is mediated through a computer language via the transmission of Messages MSG 104 within the message language. A process 202 is provided operating within or on the computer system 106 to observe messages. The message and the intent of the message can be determined via another process 201 explained below.
Typically, the messages MSG 104 might be used to specify the desired operational behavior of components in the computer system 106. Thus, messages are used between components within the computer system, and messages are used by users to gain access to the computer system 106. Computer languages are used to facilitate the use of messages in the computer system. Each computer language is defined by a grammar so that messages conform to a known syntax. The grammar of such languages is published so that software developers can ensure that the messages of the software conform to the correct syntax.
The grammar of the computer language of the messages that are to be analyzed is defined, e.g. using first order logic. This may be carried out in a manner that is known per se. For example, the programming language Prolog can be used to describe the grammar of the language as a set of first order logic. This logic is then applied initially to a set of training examples of messages. Such messages are defined so as to be correct syntactically in the context of the language and appropriate in the sense that they are messages that are deemed to be acceptable in the context of usage of the system around which the messages pass. The logic contains clauses. When the logic is applied to the messages, the identity of the clauses along a successful path through the logic is noted.
In this way, paths of acceptable messages being parsed via the logic are obtained. These paths can then be grouped according to similarity. In turn, the messages that follow the respective paths can be grouped according to similarity in this sense, so that patterns of similar messages can be discerned. This means that new messages, which are different from messages used in the training, can then be allocated to patterns of messages that are known to be acceptable, or rejected.
The EGC method works by analyzing symbols into patterns, so that new symbols can be analyzed more efficiently than in other known techniques. This enables the EGC methodology to be implemented in real-time with relatively little computational overhead. In one example, the method is carried out on new symbols to determine whether the new symbols fit a pattern of data that is known or constitute a new pattern. Patterns may also be referred to as “clusters” as they represent a cluster of similar paths through the computer logic. In practice, if the new symbols fit a pattern that is known, then a decision will already have been made as to whether symbols fitting that known pattern are to be deemed acceptable or not. If the symbols constitute a new pattern, in practice a decision will have been made what to do with symbols that constitute a new pattern, such as “always deem not acceptable” or “send error report”, etc.
The EGC system and method is not concerned with generating new rules for new messages. Instead, it is concerned with determining patterns for computer messages. In one embodiment, the patterns that are obtained can then be considered, for example “manually” by a human user, to determine whether a computer system has been compromised.
Referring to FIG. 2, there is shown a simplified schematic flow chart for the process 201 by which messages are classified or clustered, using the EGC method, in dependence on semantic intent of the messages. Messages MSG 104 received by a computer are clustered using the EGC process 401 which produces a classification MSG CLASSIFICATION 402 of the message. The message classifications are stored, along with a copy of the respective messages, in a message store MSG STORE 403. As well as the message, other attributes about the message can be included in the message store. For example, these attributes could include, amongst others; the date & time the message was received; the username or application name that sent the message; network addressing information about the source and destination of the message; etc.
The EGC system works well. In particular, by analyzing the symbols into patterns, new symbols can be analyzed more efficiently than in previous known techniques, which makes it possible to implement the method in real-time with relatively little computational overhead. However, although the EGC system does work well, a full parse of each message is needed, which can be computationally intensive. Indeed, the process typically involves                1) lexical analysis of a received message, in which the message is tokenised;        2) parsing of the tokenised message through a grammar, e.g. an instrumented grammar;        3) extracting the summarised execution path and finally;        4) mapping the extracted summarised execution path to a unique cluster identifier.        
With reference to FIG. 2, for each message MSG 104, the Clustering Process 401 provides a unique classification MSG CLASSIFICATION 402 of the semantic intent of the message. This uniqueness allows syntactically different messages to be classified in the same way because their class of semantic intent is identical.
In the context of a computer resource that is a relational database, the messages are received at the computer resource in the language of Structured Query Language (SQL). As examples, the unique message classification 402 for 7 specific messages is shown in FIG. 3. Performing a full parse of each message MSG 104 through the instrumented grammar so as to determine the semantic intent of the message can be extremely computationally intensive. Therefore, as data rates and volumes of processed traffic and data increase, a method to reduce the computational intensity whilst still providing the required or desired performance levels and accurate determining of semantic intent is sought.
According to a first aspect of the present invention, there is provided a computer-implemented method of analyzing symbols in a computer system, the symbols conforming to a specification for the symbols, in which the specification has been codified into a set of computer-readable rules; and, the symbols analyzed using the computer-readable rules to obtains patterns of the symbols by determining the path that is taken by the symbols through the rules that successfully terminates, and grouping the symbols according to said paths, The computer-implemented method comprising;
upon receipt of a message at a computer, performing a lexical analysis of the message; and,
in dependence on lexical analysis of the message assigning the message to one of the groups identified according to said paths.
The invention provides a method by which the repeated full execution of a parser is rendered unnecessary and replaced by a more efficient process that determines the appropriate cluster identifier to associate with a message. A lexical analysis is performed on a received message and, in dependence on this, the message may be successfully allocated to an appropriate cluster. Thus, an efficient and quick method is provided by which a message may be allocated to the appropriate message cluster. As compared to the basic EGC method quicker and more efficient message allocation may be achieved since it is in on dependence on the lexical analysis that the assignment is determined.
In an embodiment, in the step of performing a lexical analysis of the message a sequence of tokens is generated corresponding to the message. It is preferred that in dependence on the sequence of tokens, a message digest is assigned to the message, the message digest corresponding to the said one of the groups.
Preferably, the tokens are tokens that are directly related to some of the tokens that would have been used in a full parse of the message. Preferably, the token sequence is a syntactic sequence, thereby enabling semantic grouping of the messages based on message syntax.
Thus, the tokens are not the full language based tokens that would normally be generated by a tokenizer. “Selective Message Digest” (SMD) of an SMD token sequence produced at the lexical analysis stage may be used to allocate new messages to clusters. Thus, a full parse of a received message is not required for every message and so the process is quicker and more efficient than the EGC method described above.
An embodiment of the invention provides a method of going directly from a lexical analysis phase to the cluster identifier without the computational complexity of a full parse of the message. In the case where the full parse has already been performed and the cluster identifier determined, then any repeat parse which would have generated such a cluster identifier can be determined solely by the SMD value generated at the lexical analysis stage. Embodiments of the invention provide a simple and robust method by which the beneficial effects of the EGC patent applications, discussed above, can be achieved in a significantly more computationally efficient and quick manner.
In embodiments, the invention provides a method whereby the repeated full execution of a parser is rendered unnecessary for previously processed message types and replaced by a much more efficient process that determines the appropriate cluster identifier to associate with a message by the “Selective Message Digest” (SMD) of the SMD token sequence produced at the lexical analysis stage.
Preferably, the token sequence is a syntactic sequence, thereby enabling semantic grouping of the messages based on message syntax. This enables a subsequent syntactic token sequence that has the same semantic grouping to be quickly identified.
Preferably, a message digest is calculated for a token sequence using a method selected from the group consisting of shift and rotate on the entire token sequence, SHA family or Message-Digest algorithm 5 (MD5) algorithms on the entire token sequence and an interleaved Message Digest method integrated into the tokenization process. Thus, commonly available and reliable algorithms may be used to generate the message digest.
In one example, not all tokens of a standard input language are used to create the message identifier. This provides the advantage that it is possible to group messages that are syntactically different into the same message cluster.
Preferably, the computer system includes a computer resource and the messages are directed to the computer resource. The method enables a determination to be made as to whether or not usage of the computer resource is changing at a semantic level.
In some cases, the messages directed to the computer system are attempts to inappropriately utilise the computer resource. In such cases, the method enables a system administrator or manager effectively to utilise detection to stop inappropriate use of the resource.
In a preferred embodiment, the computer system includes a computer resource and the messages are sequences of machine instructions that are about to run through a micro-processor within the computer system. Thus, the method can be used to detect buffer overflow exploits.
In one example, the computer resource is a relational database and messages are submitted in a language such as Structured Query Language. The method can therefore be used to detect and monitor inappropriate access and database attack techniques such as SQL injection.
Preferably, the process is performed progressively in that tokens are formed progressively as the message is received. This enables effective use of RAM to be made and enables the method to operate with messages that are non-grammatical structures.
In an embodiment, upon receipt of a message a check is made as to whether or not there already exists an identified group for messages with the token sequence of the received message and, if there is, assigning the message to the identified group. In such cases it is possible or probable that actions will have been identified for performance upon recognition of a message belonging to a particular cluster. Accordingly, the method enables the appropriate action to be identified and triggered quickly and efficiently.
In one embodiment, if it is ascertained that there is not already an existing identified group for a received message a full parse of the message is performed and a group is established for the message and for subsequent messages having the same token sequence. Thus, a full parse may be done only when it is necessary, thus increasing efficiency as unnecessary parsing may be avoided.
Preferably, the method comprises generating, e.g. automatically, statistical data in respect of the groups to which messages belong. The generation of such statistical data is useful since it can be used for applications such as accounting, charging, monitoring performance and detecting inefficiencies and the like.
According to a second aspect of the present invention, there is provided a computer program arranged such that when run on a computer it causes the computer to perform the method of the first aspect of the present invention.
According to a third aspect of the present invention, there is provided a computer programmed to carry out a method according to the first aspect of the present invention.
According to a fourth aspect of the present invention, there is provided a computer-implemented method of analyzing symbols in a computer system, the symbols conforming to a specification for the symbols, in which the specification has been codified into a set of computer-readable rules; and, the symbols analyzed using the computer-readable rules to obtains patterns of the symbols by determining the path that is taken by the symbols through the rules that successfully terminates, and grouping the symbols according to said paths, the method comprising; upon receipt of a message at a computer, performing a lexical analysis of the message to generate one or more tokens corresponding to the message; and, in dependence on the sequence of tokens assigning the message to one of the groups identified according to said paths.
From a mathematical perspective, selective message digest (SMD), in which a message digest is performed based on a sequence of tokens in which it is the form of the tokens that is considered, and EGC as described above are hierarchical abstractions between the raw message, i.e. the textual sequence of characters and to the actual semantics or meaning of the message (in the context of its operating environment).
For example, starting from the sequence of characters and moving in an order in which each step represents a generalization, the following stages might be included:
1) A {sequence of characters}
2) A {sequence of tokens}
3) A {sequence of SMD tokens}
4) A {SMD Value}
5) One or more {SMD value} maps to a {Cluster ID}
At each level of generalization in the hierarchy specific detail is lost, whilst managing to track the “essence”, “intent”, or “motive” of the original message. This “essence” can be recognised in another non-identical message. The computer or computer program can thus easily be trained to respond to this other non-identical message in the same way that has been specified for the previously observed message. SMD is a much finer-detail method of tracking message “essence” than that provided by EGC.
For the SMD calculation it is only necessary that the tokenization succeeds. Even if the message is not valid (with respect to the language) an SMD value will still be generated. This property is particularly useful in the case in which the EGC grammar is deficient and the message is valid but not recognised (by parsing) by the grammar and in the case in which the message is not valid and correctly determined as such, but a (buggy) system sending the invalid message continues to do so and an action needs to be taken for each of these invalid messages. One typical action might be blocking the message arriving at the resource. In the context of an SQL Database, if the database receives the message it will attempt to process the message, consuming resources needlessly, eventually failing and returning an error to the sender. Another typical action might be translating the invalid message into a valid message. Both cases require the next invalid message to be determined to be of a form that has already been observed so that the appropriate action can be taken. SMD can be utilized to do just this.
Accordingly, in a fifth aspect, the invention provides a computer-implemented method of analyzing symbols in a computer, the method comprising;
upon receipt of a message at a computer, performing a lexical analysis of the message to determine tokens for the message;
selecting a form of the tokens to generate a sequence of generalised tokens representative of the message. A computer program is also provided, the computer program being arranged such that when run on a computer it causes the computer to perform the method of the fifth aspect of the present invention.
A method is provided that provides for a simple and robust way of processing messages when tokenization has succeeded irrespective of whether or not the message is valid.
Preferably, a message digest is assigned to the sequence of generalised tokens. Thus, a simple and robust method is provided by which a selectively tokenised message can be assigned a means of identification and/or qualification.
Thus, the present method provides that even if a message is not valid (with respect to the language) an SMD value will still be generated. As mentioned above, this property is particularly useful in the case in which the EGC grammar is deficient and the message is valid but not recognised (by parsing) by the grammar and in the case in which the message is not valid and correctly determined as such, but a (buggy) system sending the invalid message continues to do so and an action needs to be taken for each of these invalid messages.