Machines communicate with each other through a wide variety of network protocols. Indeed, there are a vast number of actual protocols that are in use at any given time for traffic over any given network. This number is magnified by the variety of implementations and versions of each protocol. For instance, to give a general idea of the complexity of the scope of the issue, a quick survey of five minutes of traffic for two buildings at the same company yielded more than 50 unique protocols.
This diversity of protocols creates a huge headache for any network or systems engineer responsible for understanding the traffic. For instance, if the network engineer is only interested in knowing what filenames are being transported via a filesharing application, perhaps to check for illicit activity, the engineer must find the specification for that protocol and then write code to decode the packets according to the specification. If such specification does not exist, the protocol must somehow be reverse engineered.
While there are tools like Ethereal that are hardwired to understand multiple protocols, there are so many variations and sub-protocols that are constantly evolving with different versions over time that it is difficult for such a tool to cover them all, and also stay up to date over time. At best, the network engineer's code would work for some time, and then require updating later if the protocol changed in any respect.
Furthermore, even where the protocol is known, extracting a particular field of interest can be far from trivial. For instance, even relatively simple protocols have variable length fields, such as strings, which means the field of interest can be an arbitrary offset within the packet flow (e.g., TCP flow). To extract such a field, for instance, requires knowledge of the protocol structure and analysis of other fields to determine the span of the variable length field.
To date, there are some systems that analyze and parse network protocols with varying degrees of automation, but they are inadequate except for certain circumstances. For instance, at least one framework exists that specifies and parses network protocols via hand-written grammars. While the hand-written grammars allow for precise recovery of all protocol elements, the downside is that such a framework requires significant manual effort to precisely describe the entire protocol. While the grammar is sometimes an improvement over creating entirely custom code, given the complexity of the number and kinds of different data and fields over which a network engineer may be held responsible, such an approach is too costly.
In addition, there have been fully automatic approaches that model server responses for various protocols to fool attackers into attempting to exploit vulnerabilities. In a similar vein are systems that classify packet streams into protocols based on learned Markov models. While these more automatic methods are able to fool attackers and identify protocols, respectively, the models they learn are not sufficiently specific to extract arbitrary fields, and they are not protocol-agnostic.
These and other deficiencies of the state of the art of automatic analysis and extraction of data from network traffic will become apparent from the below description of the various exemplary non-limiting embodiments of the invention.