Companies can store a tremendous amount of end-user data. For example, end-user data can include, but is not limited to, address information, credit card information, photographs, e-mails, healthcare records, financial records, electronic documents, messages, associations with other end-users, and other types of information. Not only do the end-users have an expectation of privacy, but in many cases there can be legal requirements on the dissemination and use of the data. As a result, unauthorized access and/or use of the end-user's data can result in dissatisfied customers and potential legal liability.
Furthermore, when an end-user requests to delete an account, a company may be under ethical and/or legal obligations to expeditiously remove information associated with the deleted accounts. For example, in most cases, at least some of the information associated with the deleted account contains user identifiable information (UII). The term “user identifiable information” or “UII” includes any information that can be directly identified or linked to a specific individual or end-user, with or without his or her knowledge. Additionally, it is often the case that the UII related to a deleted account must be scrubbed or otherwise deleted from a company's storage systems within a specified time frame.
Traditionally, companies that store vast amounts of end-user data have both front-end systems and back-end systems (e.g., data warehouse(s)) for data storage purposes. However, because of the nature and volume of data stored in a back-end systems, at least some of the data in the back-end system may not be indexed (i.e., index lookups are not available). One such example of a system is Hive. Hive is a data warehouse system for Hadoop Distributed File System (HDFS) that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Unfortunately, data in Hadoop is stored as files rather than as a database structure and thus, specific portions of information cannot be looked up via index lookups without reading the entire file. That is, an indexing issue exists in some storage systems in which some or all of the data, such as user log information, is not indexed. Consequently, to remove UII from the user log information, a system must scan each file in the data warehouse for UII and subsequently rewrite the entire file. This infrastructure becomes particularly troublesome when a company attempts to comply with a user's request to delete his/her account. This is because the system has to scan each file in the data warehouse for UII and subsequently rewrite the entire file each day that any user in the system deletes an account—essentially every day for a social networking company.
Companies may want to comply with a user's request to delete his/her account and remove UII. However, scanning and rewriting each file in a data warehouse is an arduous process that is both time consuming and processor intensive. Furthermore, if a company has petabytes of user data, this process can quickly become unmanageable.
Overall, the examples herein of some prior or related systems and their associated limitations are intended to be illustrative and not exclusive. Upon reading the following, other limitations of existing or prior systems will become apparent to those of skill in the art