Organizations collect and store voluminous amounts of data in various formats and locations. To quickly and efficiently access, analyze and protect its data, organizations must implement good data classification systems. Organizing data into effective classification systems often relies on detecting areas where files form groups with similar properties. Identifying similarities between files spread across multiple memory resources in one or more different devices can be very challenging when dealing with a high-dimensional space that arises as a way of modeling datasets with many attributes. In high dimensional spaces, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient. This problem is what is known as the curse of dimensionality. As the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse, making it difficult to search.
In particular, in an enterprise environment, where a large corpus of files are stored on many devices (e.g., file servers, SANs, NAS devices, laptops, desktops, mole devices, flash drives and external drives), detecting similar files is a challenging problem. Compounding the problem, when dealing with files of non-trivial size, the number of characteristics, such as file content hashes, keywords and regular expression attributes, becomes large fairly quickly. Pairwise comparison of the files is not practical because calculating a complete set of similar matches is in the order of n2 in time, O(n2), where n indicates the number of files.
As a result, there is a general need for an improved system and method for identifying similarities among files in various formats and devices across many locations. There is also a need for an improved system to identify and protect sensitive data. Embodiments of the disclosed subject matter are directed to detecting similarities between files stored in various formats and devices, dispersed over many locations. Further embodiments of the disclosed subject matter are directed to identifying and protecting sensitive data.