Digital data relates in general to a sequence of bits. A fingerprint derived from the digital data can therefore be regarded as a unique or nearly unique description of the digital data. Such a fingerprint can for example be derived from the digital data by applying a hash function to the digital data, whereby the fingerprint relates to the received hash value which provides a unique description of the input data.
A fingerprint derived from a bit sequence of digital data can be used in many applications, e.g., for comparing the determined fingerprint with other fingerprints that are stored, for example, in a database. If the fingerprint matches one of the fingerprints of the database, then both fingerprints have been derived from the same digital data.
There are however applications, where simple hash techniques for determining a fingerprint of a bit sequence of digital data are rarely effective. Spam emails, for example, might contain one or more embedded images, whereby the spam message is displayed as text in the embedded images. The terms “spam” or “spam email” relate to unsolicited communication and in particular to unsolicited commercial emails. As most spam images contain random variations and distortions and thus produce a unique hash code for each image, hash techniques to identify spam email are barely applicable in this case.
U.S. patent application Ser. No. 2005/0216564 A1 discloses a method and apparatus for analysis of emails that contain images, e.g., in order to determine whether or not a received electronic mail is a spam email. One or more regions of an image embedded in the email are detected and pre-processing techniques are applied to locate regions, e.g., blocks or lines, of text in the images that may be distorted. The regions of text are then analyzed in order to determine whether the content of the text indicates that the received email is a spam email. Specialized extraction and rectification of embedded text followed by optical character recognition processing is applied to the regions of text to extract the content therefrom. Alternatively, text recognition or shape-matching processing is applied to detect the presence or absence of spam-indicative words from the regions of text. According to a further alternative described in the above mentioned document, other attributes of extracted text regions, such as size, location, color and complexity are used to build evidence for or against the presence of spam.
The method disclosed in the above mentioned document is however not suitable for an email processing environment, where high email throughput is required. The reason is that the employed character recognition techniques, also referred to as OCR techniques, are computationally very expensive to perform and are therefore not advantageously applicable to email processing environments. Additionally, OCR analysis is relatively easy to circumvent, for example by altering the size and style of the text in the embedded image, or by writing the text in irregular patterns rather than straight lines.
More techniques for analyzing image data exist that make use of color and spatial information contained in the image to extract a set of features that can be compared against a database of stored image features.
For example, Gavrielides et al. describe in the document, “Color-Based Descriptors For Image Fingerprinting,” IEEE transactions on multimedia, volume 8, no. 4, August 2006, pages 740-748, an image fingerprinting system which aims to extract unique and robust image descriptors. The image fingerprinting system consists mainly of two parts: fingerprint extraction and fingerprint matching. In the first part, a descriptor is extracted from each image and is used
to create an indexed database. In the second part, the index for an image (query image) is compared to the indices of the rest of the database (target images), using some kind of similarity measure to determine close matches between the query image and target images. The fingerprint extraction procedure involves the quantization of the image colors and the calculation of color histograms based on the resulting colors.
The more sophisticated techniques often involve image analysis techniques too expensive to perform in an email processing environment which is supposed to have a high email throughput. Additionally, these techniques are liable to produce mis-classification rates considered high in an email filtering environment.
It is one object of the invention to provide an improved method of generating a fingerprint from a bit sequence which might relate to a bit sequence derived from an embedded image of an email. It is a further object of the invention to provide an improved system for generating such a fingerprint.