Data may be anonymized to protect the sensitive information in datasets. For example, names of individuals may be anonymized by using a cryptographic hash function that converts the name to an output value of a fixed size. The hash function will generate a unique hash value for a unique name, which allows the anonymized data to be analyzed without compromising personal information.
Anonymizing information using a hash function can securely generate a corresponding token from the sensitive data. However, the hashed value resembles a random alphanumeric string making reading the hashed values difficult. When the sensitive data is intended to be included in a user interface, using a hashed value can make understanding the information presented in the user interface difficult. For example, a user interface that displays user information may be easier to understand when identifying individual users by their names, such as “Tim Johnson”, “Frank Thomas”, etc. The user interface may be more difficult to understand when the individual user's names are replaced by anonymized hash values such as “7ab034b02b35902d074d0eba077b32a9” or “aab50cf88d2ae72ebd4835362d5e3b61.”
Attempts at improving the readability of hash values have included selecting a name for each hash value as required. For example, a first hashed value may be converted to “user 1”, a second hashed value may be converted to “user 2”, etc. However, creating identifiers in such a manner requires maintaining and updating a list of identifiers as the hash values are processed. Maintaining and updating such a list with new identifiers, which requires locking access to a global counter sentinel or similar counter, makes parallelizing and scaling such a process difficult.
An additional, alternative and/or improved process for anonymizing sensitive data would be desirable.