The approaches described in this section could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
The HTTP protocol is standardized as RFC 2616 by the Internet Engineering Task Force (IETF) and is transported over the TCP/IP stack (Transport Control Protocol/Internet Protocol).
HTTP is used for implementing numerous services. Indeed, more and more applications run within a web browser and their communications are based on HTTP.
One advantage of HTTP is simplicity since this protocol supports a small number of request methods, and basically applications use two or three request methods (mainly methods called GET and POST).
Many applications make use of HTTP as a session protocol to convey different types of media such as simple text files, office documents, audio and video files.
In what follows, files or data streams transported by HTTP will be referred to as HTTP contents.
According to the HTTP protocol, the HTTP content is inserted into an HTTP body part (or payload) of an HTTP message, and an HTTP header part contains control information of the HTTP message.
Prior to the transmission of the HTTP message, the HTTP content can be compressed or encrypted by the applications in order to either reduce the volume format or secure the transmission.
For instance, audio and video media are compressed by means of audio/video codecs. For the same purpose, a packet archive (for instance zip format, rar format, etc) contains a set of compressed files.
In order to improve the HTTP protocol efficiency, some extensions such as persistent connections and pipelining have been developed according to the HTTP protocol standard.
Persistent connection consists in keeping opened the TCP connection that carries the HTTP session between an HTTP client and an HTTP server, after the completion of the HTTP request (after reception of an HTTP response from the server). Then, the HTTP client may send another HTTP request on the same TCP connection.
HTTP pipelining consists in sending several HTTP requests from an HTTP client to an HTTP server over a single TCP connection without waiting for the reception of the corresponding HTTP responses.
Malicious applications such as malware, Trojan or Remote Administration Tools (RAT) also often use HTTP as a carrying protocol for communication between an infected machine and a Command and Control (C&C) server.
These malicious applications may use HTTP to carry stolen information and files, and prior to the transport, they can also carry out compression and/or encryption of the file in order to obfuscate the communication.
Then, it is impossible to decrypt the data stream using an offline process if the encryption key is not known, except by applying a brute force method.
Usually, malicious applications make use of basic obfuscation methods relying on scrambling codes such as XOR-cyphering. However, in some cases, they can apply standard encryption such as AES (Advanced Encryption Standard) or 3DES (Triple Data Encryption Standard). In these cases, it may be necessary to identify in real-time what load of data is exchanged between the HTTP client and the HTTP server.
This requires that the suspicious contents are quickly analysed by a traffic analyser located between the client and the server. Indeed, the system analysis is preferably performed on the client before the encryption key is erased from the transmitter or receiver system memory by the malicious application.
According to some methods, pattern matching is used to classify a file. For example, the well-known Unix™ utility named “file” is based on pattern matching and uses the libmagic library to output the application related to a given file.
However, such methods are applied on binary files, which are for example stored on a device. However, they cannot be performed in real-time on data streams communicated between a server and an online client.
There is a need to analyse in real-time data streams (such as HTTP contents) carried over a telecommunications network and classify it in different groups (or types) so as to carry out further analysis on data belonging to a given group or to some given groups.