(1) Field of the Invention
This invention relates to detection of items stored in a computer system, and particularly, but not exclusively, to detection of unwanted items introduced by steganography.
(2) Description of the Art
Steganography may be defined as the covert concealment of information in the form of unwanted computer code within data on a carrier file such as an image file. The intention is that the existence of such information cannot be detected without some further information which is secret. Steganography is different to Cryptography. Cryptography can be described as being concerned with encrypting information so that an attacker cannot decipher it without some secret knowledge that it is hoped the attacker does not possess. Cryptography is therefore not necessarily concerned with secrecy regarding the existence of a message.
Steganography will normally be implemented with a large carrier file, so that only relatively minor perturbations are created in the carrier file by the introduction of secret information. These perturbations are small compared with the carrier documents apparent randomness. Many existing designers of steganographic systems use image files because of their large size. Known techniques include changing the least significant bit (LSB) of bitmap files or changing the LSB of some coefficients of JPEG files.
LSB changes are insignificant to human eyes, and so visual inspection will fail to detect steganographic information if the steganographic process is well designed.
However, LSBs are not truly random and they should show some statistical properties. Conceptually, a designer of a steganographic process may adopt some defensive techniques to avoid detection. The designer may compress data for input to the steganographic process, which tends to decrease the size of data to be hidden using steganographic techniques. Alternatively, the designer can encrypt the data, which will tend to remove any pattern from it. Additionally, if a user only desires to hide a small amount of data, the steganographic process can be designed so that hidden data perturbs only a few bits of a carrier file in which it is inserted. A further concealment technique is to process the image statistics so that they are the same as the original carrier file. Consequently, the statistics of the carrier file will be almost unchanged and it becomes unlikely that the presence of the steganographic modifications can be detected mathematically.
A subverted employee of an organisation, or software masquerading as such an employee, may use steganography to try to pass sensitive information from inside the organisation's secure logical perimeter to an agent (human or software) outside that perimeter. Conversely, to protect an organisation's security a firewall or other barrier device can be provided to detect and potentially prohibit export of sensitive information. It may also be suspected that steganography is being used to conceal the transfer of sensitive information.
Products are available which attempt to detect instances of steganography, and they rely on statistical properties of images. To assess the effectiveness of these products, tests were carried out using them with a number of data samples: the samples were digitised images in which information was hidden using a range of steganographic tools freely available from the Internet and based on publicly known principles. The samples included one which was an intentionally poor example of steganography. With most data samples, the tests showed that available steganography detection products give too high a false positive rate when their sensitivity settings are set to give an acceptable false negative rate. Here “false positive” means apparent detection of steganography where none exists, and “false negative” means failure to detect an actual instance of steganography. This demonstrates the weakness of techniques that rely on statistical properties of images.
It is known to detect unwanted information in the form of viruses in computer systems using some characteristic or signature that in each case the virus leaves in software or data it has attacked. U.S. Pat. No. 5,649,095 to Cozza discloses detection of a virus from virus-induced change in length of an affected file. Published International Application No. WO 02/103533 mentions use of a signature to detect a virus or other malicious code, but does not disclose how a signature is created.
EP 0896285 A1 makes reference to the use of signatures to detect viruses. It uses signatures to try and increase the chances of detecting both an original virus and variants thereof. WO 02/103533 uses similar techniques to spot malicious software and other so-called “malware”. These techniques suffer from the problem that the signature has to be originated by human intervention. That is, once software has been identified as being malicious, a human has to decide upon an appropriate signature for use by antiviral software.
U.S. Pat. No. 5,452,442 to Kephart discloses extraction of virus signatures from source material by an automatic procedure. The procedure is relatively complex: i.e. a computer system implementing this procedure executes the following:    a) obtain virus samples;    b) perform two filtering operations to remove from virus samples all but invariant virus code from which signatures will be obtained;    c) obtain a corpus of programs in common use on the relevant platform (hardware-operating system combination);    d) calculation of exact and partially matching probabilities for candidate signatures using the corpus of programs;    e) combination of exact and partially matching probabilities to obtain an overall score for each candidate signature;    f) selection of a threshold for comparison with candidate signature scores by (i) segregating the corpus of software into probe, training and test sets, (ii) using the probe set to provide trial signatures (byte strings), (iii) using the training set to estimate probabilities of trial signatures; (iv) counting trial signature frequencies in the test set, (v) producing lists of estimated probabilities versus frequency, (vi) determining false positive probabilities, and (vii) determine a threshold having a sufficiently low false positive probability but achieved by an acceptable proportion of trial signatures; and    g) rejection of candidate signatures with overall scores which fail to achieve the threshold.