The leakage of proprietary and/or confidential data is a continuing problem for organizations such as corporations, governments and universities. Contemporary ubiquitous remote network access to an organization's computers increases productivity and is convenient, but at the same time creates ever greater challenges for protecting the data from being accessed by unauthorized parties such as competitors or criminals. Leakage of enterprise data can result both from intentional activity by unscrupulous employees, as well as unintentional but negligent actions of employees not following robust security procedures
Organizations lack visibility into the access and flow of sensitive documents and information. Administrators lack tools for tracking data access and usage. Tracking the access and flow of enterprise data and preventing leakage are more difficult than ever. Yet, organizations rightly want to limit the access and use of confidential data according to an enterprise-level information control policy.
Some technologies for tracking access and flow of enterprise data compare strings of text to a database of defined information or types of information. However, these technologies do not extend to circumstances where sensitive information is contained in an image.
Conventional DLP solutions have relied on traditional optical character recognition (OCR) technologies to determine whether an image contains sensitive information. However, OCR is not suitable for the high computational efficiency requirements of data loss prevention systems which may have to scan high volumes of data with minimal impact on transmission latency. Furthermore, conventional OCR technologies are limited in their ability to capture, process, and analyze complicated images. For instance, OCR technology ignores uniquely identifying image features, such as faces, logos, graphics, etc., and can easily be confused by image features such as these features and irregular text, thereby leading to unacceptable inefficiencies and false positives and/or false negatives. Lastly, OCR technology is very sensitive to external parameters such as illumination, perspective, noise and scale variations in the image.
Generic image classification technologies, such as automatic image tagging (e.g., used in image search tools) are also not suitable for the high computational efficiency and accuracy requirements of data loss prevention. In particular, generic image classification techniques may attempt to identify random objects based on their appearance, regardless of whether these objects contain personally identifiable information or not. For instance, a generic image classification engine may spend computational resources trying to detect whether an image contains a picture of an animal, or building, thus wasting time and resources in a way that is not beneficial for finding personally identifiable information.
It would be desirable to address these issues.