Organizations store large amounts of data, for example, as files in file systems. The files are data sets that are typically owned by a single user. The data owner may have full control over the data set. However, other users may also have varying levels of control over the data set, including: read access, write access, delete control, create control, modify control, list folder content control, read and execute control, special control, etc. Identifying the data owner from the other users that can share the data set is important in cases such as security remediation, data migration, and compliance.
Security remediation takes place when data gets compromised, for example by deletion or overwriting. When recovery occurs, an administrator needs to quickly and accurately discover who owns that data. This can be difficult because many users within the organization may have some level of access to the data, and the many users may access the data on a regular basis. It may not be immediately clear who is the file owner.
In addition with data migration, an administrator may want to move data from one location to another. For example, if data has not been accessed in a long time, the administrator may want to move the data from expensive high performance storage to less expensive low performance storage. However, before moving the data, the administrator will want to notify the data owner of the change and/or get approval from the data owner. Again, the data owner must be determined. Furthermore, with data compliance, administrators may be looking for data owners during administrative activities and execution of other programs.
Identification of a data owner can occur by manually inspecting the records in the access logs and access control logs. Unfortunately, there is a tremendous amount of data in these logs, especially in the case where there are many users. This amount of information can be overwhelming, making it incredibly difficult for an administrator to manually correlate the logs and conclusively identify the data owner.
In one conventional method a data owner is identified based on the total number of accesses to the file. In effect, the user with the highest number of accesses is automatically recommended as the data owner. However, owner identification based only on the total number of accesses can lead to a high number of false positives. For example, a user may only have read access and may access the data many times a day. However, the data owner may only access the data once a week. In this case, an administrator would incorrectly identify the owner of the data based on the highest number of accesses.