Currently, sensitive personal information (also referred to as personally identifiable information) is detected in a storage device in near real-time via textual processing of the files in a file system for the storage device. Sensitive personal information, as used herein, refers to information (e.g., financial and health information, social security numbers, data about children, geolocation data) that can be used on its own or with other information to identify, contact or locate a single person or to identify an individual in context. A file is technically defined as an ordered set of characters implemented over a block device. A file then is a collection of extents, with each extent corresponding to a contiguous set of blocks, obtained from a block device (also referred to as a logical unit number (LUN)). A block is a contiguous set of bits or bytes that forms an identifiable unit of data. Since sensitive personal information is detected in the storage device in near real-time via textual processing of the files at the file layer, all of the blocks associated with those files are also processed. As a result, a large quantity of blocks are being processed in order to detect sensitive personal information in the storage device in near real-time. The system needs to detect sensitive personal information at the file layer since the end user (or other consumer) is usually only interested in that level of granularity of detection.
Hence, the common approach to detecting sensitive personal information involves detecting sensitive personal information at a file by file level leading to large amounts of data being processed even for small changes within a file, perhaps spanning a single block. By processing a large quantity of blocks, an inordinate amount of computing resources is being utilized to detect sensitive personal information. Furthermore, by processing a large quantity of blocks, it increases the time in detecting sensitive personal information.