With the rapid increase and advances in digital documentation services and document management systems, organizations are increasingly storing important, confidential, and secure information in the form of digital documents. Unauthorized dissemination of this information, either by accident or by wanton means, presents serious security risks to these organizations. Therefore, it is imperative for the organizations to protect such secure information and detect and react to any secure information (or derivatives thereof) from being disclosed beyond the perimeters of the organization.
Additionally, the organizations face the challenge of categorizing and maintaining the large corpus of digital information across potentially thousands of data stores, content management systems, end-user desktops, etc. It is therefore important to the organization to be able to store concise and lightweight versions of fingerprints corresponding to the vast amounts of image data.
Many organizations store sensitive data in the form of digital images. Image data is susceptible to being transformed from its original form to a derivate form. Examples of derivative image data include image file format conversion (e.g., changing a BMP image format to a JPEG image format, etc.), cropping the original image, altering dimensions of the original image, change in scale and/or orientation of the original image, rotation of the image by an angle, etc.
Therefore, it is critical to the organization's security to be able to identify derivative forms of the secure image data and identify any unauthorized disclosure of even such derivative forms. Therefore, any system or method built to accomplish this task of preventing unauthorized disclosure would have to address at least these two conflicting challenges.
One method to detect derivative image data is to sample features across the entire original image, record the values of the sampled features, and perform a nearest neighbor search of the sampled features. The nearest neighbors on the original image are compared against the nearest neighbors of the image being inspected to detect similarities. In one example of this prior art method, a histogram of RGB pixel values is generated for the entire original image, and compared against a histogram of RGB pixel values generated for the entire image to be inspected. If the histograms of the original image and the image being inspected are approximately similar, a similarity is detected. However, this entire image approach is not suitable for partial image matches (e.g., when the image to be inspected is only a portion of the original image), and does not handle several types of transformations in a derivate image. For example, cropping the image to be inspected in half drastically changes its global characteristics, and will therefore escape detection when compared against the original image.
Other methods operate on local regions of the image, improving the ability to detect derivative image data. These methods are predominantly broken into two steps. In the first step, distinct features (hereinafter “feature points”) are identified within the image. The feature points are identified by locating edges or corners within the image. Other algorithmic approaches may also be employed to identify feature points. Examples of such algorithmic approaches include Harris detection, Moravec detection, Shi and Tomasi detection, Harris-Laplace detection, FAST, SIFT, etc.
In the second step, descriptors are computed by examining the regions surrounding the feature points. The descriptors are recorded and searched to correlate derived regions within the image. Examples of the descriptor methods include creating a histogram, employing SIFT, using steerable filters, or using moment invariants.
However, this approach also suffers from several disadvantages. The first disadvantage is that the descriptors are large, and therefore occupy a large space. The cost and efficiency associated with storing and maintaining these large descriptors compound and increase with an increase in the amount of digital information an organization intends to protect. Additionally, comparing feature points involves searching through a highly dimensional space, making this approach computationally slow.